Investment Opportunities Brought by Sora in AI

baoshi.rao

On February 14th, OpenAI released the video generation model Sora. Due to its stunningly realistic simulation of the real world, it sparked exclamations like 'the film industry will be completely overturned, and film production staff will lose their jobs.'

However, some believe that although Sora is powerful, its impact seems limited to visual performance fields such as film and gaming, far from being as widespread and life-changing as ChatGPT, a large language model with broader applications.

In reality, Sora is not just about video applications; it's much more than that. For example, in a video where a store is occluded by a foreground person, the store becomes visible when the person moves away. While this is a common scenario for the human brain, computers traditionally required 3D modeling to understand the spatial relationship between objects. Now, with machine-generated understanding, it means the algorithm can autonomously comprehend these physical motion relationships of objects in the real world.

Therefore, its greater significance lies in being a crucial milestone towards achieving Artificial General Intelligence (AGI). This is a very significant breakthrough. Physical laws are not inherently present in our brains. For example, infants under one year old do not understand why their mother disappears when she moves behind a wall, even though they can see her movement. Only after repeatedly observing their mother disappearing and reappearing from different places can they gradually comprehend the physical laws of the world, including 3D consistency and object permanence.

How does Sora achieve this?

According to the official technical documentation, Sora's model architecture combines a "diffusion model + transformer." The diffusion model is similar to the large models used in most text-to-image generation, while the transformer is the large language model used in ChatGPT, with the latter being the key to qualitative transformation. Sora adopts the transformer's approach of converting text information into tokens. During training, it divides various video materials with different aspect ratios into multiple tuplets, each of which transforms into spatiotemporal patches containing spatial and temporal video representations. This is the key to its understanding of the world.

Traditional video generation methods simply decompose videos into a series of consecutive frames without including spatial information about the positions and movements of objects in each frame. This is similar to how a one-year-old baby perceives the world as two-dimensional, unable to comprehend the temporal relationship between a mother's movement and the spatial relationship with a wall, thus failing to understand why the mother disappears.

In contrast, the Sora large model, which builds upon these spatiotemporal patches as its "building blocks," can simultaneously consider the spatiotemporal relationships of objects in videos, enabling more precise generation of subtle movements and changes of objects while ensuring content coherence. This is akin to how children after the age of two can understand that the mother hasn't disappeared but is merely behind the wall. When algorithms understand the real physical relationships of the world, it can be said that Sora's application capabilities extend far beyond just video generation.

Up until now, generative AI has been focused on understanding and creating human-made information worlds—such as text, sound, images, and videos—rather than the real external world. It can help you find 100 Tang poems about Mount Lu, but it cannot connect with the actual Mount Lu in the physical world.

However, Sora shows us the possibility of AI interacting with the world. Sora can simulate simple actions that affect the state of the world—a painter can leave new brushstrokes on a canvas that persist over time, or a person can take a bite out of a hamburger and leave visible marks. If Sora truly understands this process, it should be able to apply it to the real world. For example, industrial software.

Currently, there are many industrial software applications designed for single domains. They can accurately simulate real-world physical behaviors on a factory production line, such as object movement, fluid flow, structural responses, and system performance under various environmental conditions. However, these industrial software solutions lack artificial intelligence and operate entirely based on pre-programmed routines. Once the environment changes, they simply "shut down."

In contrast, general AI models like Sora exhibit illogical errors when simulating specific physical world scenarios, failing to understand causal relationships. For instance, chairs floating in mid-air or an elderly woman blowing out birthday candles without the flames moving. In the future, Sora can empower specialized industrial software with artificial intelligence, enabling it to solve more complex problems at the level of skilled engineers. This is how Sora will impact the real world.

Another example is autonomous driving.

Tesla has also begun exploring world models that can simultaneously predict future scenarios from eight surrounding cameras on the vehicle. These models can accurately simulate previously hard-to-describe scenarios, such as smoke and dust, and can be used for segmentation tasks. Current autonomous driving simulations employ a technical approach combining NeRF, material library permutations, and game engines. While this ensures scene authenticity, it falls short in AI generalization. Sora-like world models, however, can comprehend the laws of physics and generalize beyond training samples, enabling rapid generation of highly realistic and diverse driving scenarios for autonomous driving simulations.

Based on public replies from Elon Musk's Twitter, industry insiders infer that Tesla likely adopts a generative AI technical route similar to OpenAI's Diffusion+Transformer approach. Therefore, Sora-like models may well become foundational models for autonomous driving in the future.

From the perspectives of industrial software and autonomous driving, it's evident that Sora's disruptive potential extends far beyond the realm of video production. Finally, let's return to investment and see what new opportunities Sora will bring us.

First is computing power. With Sora's continuous iteration and optimization, and the expansion of training dataset sizes, future computing power demands will experience exponential growth. Therefore, the most certain investment opportunities lie in upstream computing power infrastructure.

Next is applications. Technological breakthroughs point the way forward, and companies with strong tool products related to video modalities are expected to benefit, especially those with a high proportion of overseas business. Finally, pay attention to industrial software companies, which often possess simulation algorithms and various physical models. Collaborating with large model companies can enhance the application capabilities of their software. At the same time, large model companies can significantly improve the complexity and accuracy of video generation models in representing the physical world through such collaborations.