This AI Model Can Intuit How the Physical World Works

The AI Model That Intuits How the Physical World Works

Imagine being able to understand the world around you without needing to see every detail. This is the kind of intuition that infants develop as they learn about the physical world, and it's a skill that artificial intelligence (AI) models are now close to replicating. Researchers at Meta have developed an AI system called Video Joint Embedding Predictive Architecture (V-JEPA) that can learn about the world through videos and demonstrate a notion of "surprise" when presented with information that goes against the knowledge it has gleaned.

How V-JEPA Works

V-JEPA is a complex system that uses a combination of artificial neural networks to learn about the world. The basic concept is simple: instead of predicting what's behind masked regions at the level of individual pixels, V-JEPA uses higher levels of abstractions, or "latent" representations, to model the content. This means that the model focuses on creating and reproducing latent representations, which capture only essential details about data.

The architecture of V-JEPA is split into three parts: encoder 1, encoder 2, and a predictor. The training algorithm takes a set of video frames, masks the same set of pixels in all frames, and feeds the frames into encoder 1. Encoder 1 converts the masked frames into latent representations, while encoder 2 converts the unmasked frames into another set of latent representations. The predictor then uses the latent representations produced by encoder 1 to predict the output of encoder 2.

The Benefits of V-JEPA

V-JEPA has several benefits over traditional pixel-space models..title> These models come with limitations, such as focusing too much on irrelevant details and missing important aspects of the video. V-JEPA, on the other hand, can discard unnecessary information and focus on more important aspects of the video. This enables the model to learn to see the cars on the road and not fuss about the leaves on the trees.

Intuitive Physics Understanding

In February, the V-JEPA team reported how their systems did at understanding the intuitive physical properties of the real world—properties such as object permanence, the constancy of shape and color, and the effects of gravity and collisions. On a test called IntPhys, which requires AI models to identify if the actions happening in a video are physically plausible or implausible, V-JEPA was nearly 98 percent accurate. A well-known model that predicts in pixel space was only a little better than chance.

Surprise and Uncertainty

The V-JEPA team also explicitly quantified the "surprise" exhibited by their model when its prediction did not match observations. They took a V-JEPA model pretrained on natural videos, fed it new videos, then mathematically calculated the difference between what V-JEPA expected to see in future frames of the video and what actually happened. The team found that the prediction error shot up when the future frames contained physically impossible events. For example, if a ball rolled behind some occluding object and temporarily disappeared from view, the model generated an error when the ball didn’t reappear from behind the object in future frames. The reaction was akin to the intuitive response seen in infants.

Robotics and Future Work

In June, the V-JEPA team at Meta released their next-generation 1.2-billion-parameter model, V-JEPA 2, which was pretrained on 22 million videos. They also applied the model to robotics: They showed how to further fine-tune a new predictor network using only about 60 hours of robot data (including videos of the robot and information about its actions), then used the fine-tuned model to plan the robot’s next action. “Such a model can be used to solve simple robotic manipulation tasks and paves the way to future work in this direction,” Garrido said.

Implications and Future Directions

The development of V-JEPA and its applications in robotics and intuitive physics understanding have significant implications for the field of AI. The ability to learn about the world through videos and demonstrate a notion of "surprise" when presented with information that goes against the knowledge it has gleaned is a key aspect of human intelligence. By replicating this ability in AI models, researchers can create systems that are more robust and adaptable to changing environments.

The limitations of V-JEPA, such as its ability to handle only a few seconds of video as input and predict a few seconds into the future, are also areas for future research. By addressing these limitations, researchers can create more advanced AI models that can learn about the world in a more comprehensive and intuitive way.

In conclusion, the development of V-JEPA and its applications in robotics and intuitive physics understanding are significant milestones in the field of AI. The ability to learn about the world through videos and demonstrate a notion of "surprise" when presented with information that goes against the knowledge it has gleaned is a key aspect of human intelligence, and by replicating this ability in AI models, researchers can create systems that are more robust and adaptable to changing environments.

Source: https://www.wired.com/story/how-one-ai-model-creates-a-physical-intuition-of-its-environment/