Yann LeCun’s vision for creating autonomous machines

Yann LeCun’s vision for creating autonomous machines

We are excited to bring Transform 2022 back in person July 19th and virtually July 20th – 28th. Join AI and data leaders for informative talks and exciting networking opportunities. Register today!


Amid the heated debate over AI sense, conscious machines and artificial general intelligence, Yann LeCun, chief AI scientist at Meta, published a blueprint for creating “autonomous machine intelligence.”

LeCun put together its ideas in a paper inspired by advances in machine learning, robotics, neuroscience and cognitive science. He lays out a roadmap for creating AI that can model and understand the world, reason and plan to do tasks on different time scales.

Although the paper is not a scientific document, it provides a very interesting framework for reflecting on the various pieces needed to replicate animal and human intelligence. It also shows how the mindset of LeCun, an award-winning pioneer of deep learning, has changed and why he thinks current approaches to AI will not bring us to human level AI.

A modular structure

One of the most important elements of LeCun’s vision is a modular structure of different components inspired by various parts of the brain. This is a breakaway from the popular approach in deep learning, where a single model is trained end-to-end.

At the center of the architecture is a world model that predicts the conditions of the world. Although modeling of the world has been discussed and tried in different AI architectures, it is task specific and cannot be adapted to different tasks. LeCun proposes that autonomous systems, such as humans and animals, should have a single flexible world model.

“One hypothesis in this paper is that animals and humans have only one world model engine somewhere in their prefrontal cortex,” LeCun writes. “That world model engine is dynamically configurable for the task at hand. With a single, configurable world model engine, rather than a separate model for each situation, knowledge about how the world works can be shared across tasks. It can make reasoning by analogy possible by applying the model set up for one situation to another. ”

LeCun’s proposed architecture for autonomous machines

The world model is complemented by several other modules that help the agent understand the world and take actions that are relevant to his goals. The “perception” module fulfills the role of the animal sensory system, gathers information from the world and estimates its current state using the world model. In this regard, the world model performs two important tasks: firstly, it fills in the missing pieces of information in the perception module (eg hidden objects) and secondly, it predicts the likely future conditions of the world (eg where the flying ball will be in the next time step be).

The “cost” module evaluates the agent’s “discomfort,” measured in energy. The agent must take actions that reduce his discomfort. Some of the costs are tied up, or “intrinsic costs.” For example, in humans and animals, these costs will be hunger, thirst, pain and fear. Another submodule is the “trainable critic,” whose purpose is to reduce the cost of achieving a particular goal, such as navigating to a place, building an instrument, and so on.

The “short-term memory” module stores relevant information about the conditions of the world over time and the corresponding value of the intrinsic cost. Short-term memory plays an important role in helping the world model function properly and make accurate predictions.

The “actor” module changes predictions into specific actions. It gets its input from all other modules and controls the external behavior of the agent.

Finally, a “configurator” module for executive control, which adapts all other modules, including the world model, to the specific task it wants to perform. It is the key module that ensures that a single architecture can handle many different tasks. It adjusts the perception model, world model, cost function and actions of the agent based on the goal he wants to achieve. For example, if you are looking for a tool to hit a nail, your perception module should be configured to look for items that are heavy and solid, your actor module should plan actions to pick up the temporary hammer and use it to to drive the nail, and your cost module should be able to calculate whether the object is swinging and close enough or you need to look for something else that is within reach.

Interestingly, in its proposed architecture, LeCun considers two ways of working, inspired by Daniel Kahneman’s “Thinking Fast and Slow” dichotomy. The autonomous agent must have a “Mode 1” operating model, a fast and reflective behavior that directly links perceptions to actions, and a “Mode 2” operating model, which is slower and more involved and uses the world model and other modules for reason and plan.

Self-supervised learning

Although the architecture that LeCun proposes is interesting, its implementation poses several major challenges. Among them is training of all the modules to perform their tasks. In his paper, LeCun makes extensive use of the terms “differentiable”, “gradient-based” and “optimization”, all of which indicate that he believes the architecture will be based on a series of deep learning models as opposed to symbolic systems. in which knowledge was previously embedded by humans.

LeCun is a proponent of self-directed learning, a concept he has been talking about for several years. One of the major bottlenecks of many deep learning applications is their need for human annotated examples, which is why they are called “supervised learning” models. Data labeling does not scale, and it is slow and expensive.

On the other hand, learning models learn without supervision and self-supervision by observing and analyzing data without the need for labels. Through self-supervision, human children gain common sense knowledge of the world, including gravity, dimensionality and depth, object perseverance, and even things like social relationships. Autonomous systems must also be able to learn on their own.

Recent years have seen some major advances in unattended and self-supervised learning, mainly in transformer models, the deep learning architecture used in large language models. Transformers learn the statistical relationships of words by masking parts of a familiar text and trying to predict the missing part.

One of the most popular forms of self-directed learning is “contrastive learning”, in which a model is taught to learn the latent characteristics of images by masking, complementing and exposing to different positions of the same object.

However, LeCun proposes another type of self-monitoring learning, which he describes as “energy-based models.” EBMs try to encode high-dimensional data such as images in low-dimensional embedded spaces that retain only the relevant features. By doing so, they can calculate whether or not two observations are related.

In its paper, LeCun proposes the “Joint Embedding Predictive Architecture” (JEPA), a model that EBM uses to capture dependencies between different observations.

Chart description generated automatically
Joint Embedding Predictive Architecture (JEPA)

A significant advantage of JEPA is that it can choose to ignore the details that are not easily predictable, ”Writes LeCun. Basically, this means that instead of trying to predict the world condition at the pixel level, JEPA predicts the latent, low-dimensional features that are relevant to the task at hand.

In the paper, LeCun further discusses Hierarchical JEPA (H-JEPA), a plan to stack JEPA models on top of each other to address reasoning and planning on different time scales.

“JEPA’s capacity to learn abstractions points to an expansion of the architecture to handle prediction on multiple time scales and multiple levels of abstraction,” LeCun writes. “Intuitively, low-level representations contain many details about the inputs, and can be used to predict in the short term. But it can be difficult to produce accurate long-term forecasts with the same level of detail. Conversely, high-level abstract representation can make long-term predictions possible, but at the expense of eliminating many details. ”

Diagram, timeline Description automatically generated
Hierarchical Joint Embedding Predictive Architecture (H-JEPA)

The road to autonomous agents

In its paper, LeCun acknowledges that many things remain unanswered, including configuring the models to learn the optimal latent characteristics and a precise architecture and function for the short-term memory module and its beliefs about the world. LeCun also says that the configurator module still remains a mystery and more work needs to be done to make it work properly.

But LeCun makes it clear that current proposals for achieving human AI will not work. For example, one argument that has received a lot of attention in recent months is that of “it’s all about scale”. Some scientists suggest that by scaling transformer models with more layers and parameters and training them on larger data sets, we will eventually achieve artificial general intelligence.

LeCun refutes this theory, arguing that LLMs and transformers work as long as they are trained on discrete values.

“This approach does not work for high-dimensional continuous modalities, such as video. To represent such data, it is necessary to eliminate irrelevant information about the variable that needs to be modeled by an encoder, as in the JEPA, ”he writes.

Another theory is “reward is enough,” suggested by scientists at DeepMind. According to this theory, the right reward function and correct reinforcement learning algorithm are all you need to create artificial general intelligence.

But LeCun argues that while RL requires the agent to constantly interact with his environment, much of the learning that humans and animals do is through pure perception.

LeCun also refutes the hybrid “neuro-symbolic” approach, saying that the model is unlikely to require explicit mechanisms for symbol manipulation, describing reasoning as “energy minimization or constraint satisfaction by the actor using multiple search methods to find a suitable combination of actions and latent variables. ”

Much more needs to happen before LeCun’s blueprint becomes a reality. “This is basically what I plan to work on, and what I hope to inspire others to work on, over the next decade,” he wrote on Facebook after publishing the newspaper.

VentureBeat’s mission is to be a digital town square for technical decision makers to acquire knowledge about transforming enterprise technology and conduct transactions. Learn more about membership.