What's in Sora's Pandora's Box? Does Sora Truly Understand the Physical World?

baoshi.rao

OpenAI, a global leader in AI, has dropped a new bombshell—the AI text-to-video model Sora.

Not only can it create highly realistic and imaginative scenes based on text prompts, but it can also generate seamless one-minute-long videos with astonishing consistency in characters, backgrounds, and smooth transitions between various shots.

In the field of text-to-video, giants like META and Runway, as well as other startup competitors, lag far behind in these aspects. For example, Runway can generate 4-second videos, extendable up to 16 seconds—the previous record for AI-generated video length. Stable Video and Pika offer 4-second and 3-second videos, respectively.

Sora's multi-shot capability leaves competitors in the dust. Previous AI videos were limited to single shots, while Sora effortlessly switches between multiple angles while maintaining high consistency in characters, backgrounds, and style.

People can't help but exclaim: Will short videos, advertisements, and even blockbuster movies soon be 'one-click generated' from scripts?!

Before the industry can fully grasp the seismic impact on business models, experts have discovered that Sora's capabilities go even further—it may even understand the physical world! Not only can it interpret user requests, but it also comprehends how people and objects exist in the physical realm. For example, some analyses have used the classic example of 'pirate ships battling in a coffee cup' to explore Sora's god-level understanding of the physical world: shrinking pirate ships to a size where they can battle inside a coffee cup requires Sora to understand and adjust the relative sizes of these objects in real life; the liquid in the coffee cup would affect the movement of the pirate ships. Sora needs to simulate the effects of fluid dynamics. Additionally, there's the handling of lighting, shadows, gravity, buoyancy, motion, and other physical laws.

Although there are still some imperfections in the generated effects, it's clear that Sora seems to have some understanding of 'physics.' NVIDIA senior research scientist Jim Fan even asserts that 'Sora is a data-driven physics engine.'

However, there are differing opinions in the tech community about whether Sora truly understands the physical world.

The most direct opposition comes from Turing Award winner Yann LeCun, Meta's chief scientist and head of AI. In his view, merely generating realistic videos based on prompts doesn't mean a model understands the physical world, as the process of generating videos is entirely different from causal prediction based on a world model. He expressed his view on platform X: 'There is a 'huge' misconception here.'

The so-called 'world model' refers to modeling the real physical world, enabling machines to have a comprehensive and accurate understanding of the world, much like humans.

The 'world model' is actually one of the directions that text-to-video companies are striving toward. Sora is clearly aiming for the 'world model' as well. OpenAI's Sora research report is also titled 'Video Generation Models as World Simulators.' Regardless, the industry believes that the emergence of the Sora model will have a disruptive significance, and a new technological revolution is underway. However, it is still difficult to pinpoint a clear timeline for when this influence will reach its explosive potential.