Sora Goes Viral: A Classic OpenAI Victory

baoshi.rao

During the 2022 Spring Festival, OpenAI's ChatGPT quickly ignited excitement in both the capital and AI circles, marking the beginning of the AI exploration era.

This year, a similar story unfolded. In the early hours of February 16th, without any prior warning or leaks, OpenAI suddenly released its first text-to-video model: Sora. Clearly, this gave the entire AI industry a bit of a shock.

Compared to existing AI video models on the market, Sora demonstrated capabilities far beyond expectations: not only did it increase video generation duration by 15 times in one go, but it also showed significant improvements in the stability of video content. More importantly, in the demo videos released, Sora displayed an understanding of certain physical world laws, which was a major pain point for previous text-to-video models. With the release of Sora, another interesting question arises: Why is it always OpenAI? It's worth noting that before Sora's debut, there were quite a few companies exploring AI video models, including well-known names like Runway and Pika, which had made significant progress. Yet, OpenAI still managed to deliver a dimensional breakthrough.

This is a classic OpenAI-style victory: focusing on the ultimate goal of AGI without being confined to specific scenarios, and extending the 'magic' of generative AI from text to video and the real world through Scaling Law.

In this process, the boundary between the virtual world created by AI and the real world is gradually blurring, bringing OpenAI closer to its AGI goal. Before Sora's release, text-to-video solutions were already familiar to the public. According to statistics from the renowned investment firm a16z, there were 21 publicly available AI video models on the market by the end of 2024, including well-known ones like Runway, Pika, Genmo, and Stable Video Diffusion.

Compared to existing AI video models, Sora's demonstrated advantages mainly focus on the following aspects:

First is the significant improvement in video length. Sora can generate ultra-long videos of up to 1 minute, far exceeding the content length of all AI video models currently on the market. According to a16z statistics, most videos generated by current AI models are under 10 seconds in length. Popular tools like Runway Gen 2 and Pika typically produce videos lasting only 4 seconds and 3 seconds respectively. A 60-second duration would essentially meet the content requirements of short video platforms like Douyin.

The second challenge lies in video stability. AI-generated videos essentially create frames and attempt to produce temporally coherent animations between them. However, without an intrinsic understanding of 3D space or how objects should interact, these videos often exhibit distorted and deformed figures.

For instance, it's common to see scenarios where: in the first half of a clip, a person walks down the street, but in the latter half, they melt into the ground—the model lacks the concept of "solid" surfaces. Due to the absence of three-dimensional scene comprehension, generating consistent clips from different angles remains particularly difficult. However, what makes Sora unique is that its 60-second videos not only achieve a single continuous shot, but the female protagonist and background characters in the video maintain astonishing consistency. With various shots freely switching, the characters remain highly stable. Below is the demo video released by Sora:

Prompt: A fashionable woman walks down a Tokyo street filled with warm neon lights and animated city signs. She wears a black leather jacket, red long skirt, and black boots, carrying a black purse. She has sunglasses on and red lipstick. She walks confidently and casually. The street is wet and reflective, creating a mirror effect under the colorful lights. Many pedestrians come and go.

Thirdly, Sora's profound language understanding enables it to accurately recognize user instructions, thereby presenting rich expressions and vivid emotions in the generated videos. This deep understanding is not limited to simple commands; Sora also comprehends how these elements exist in the physical world and can even achieve considerable physical interactions. For example, take Sora's understanding of the physical properties of hair texture. When Pixar created the furry protagonist in Monsters, Inc., their technical team spent months developing software to simulate the soft, flowing texture of 2.3 million strands of hair. Now, Sora accomplishes this effortlessly without any explicit instruction.

"It has learned about 3D geometry and consistency," said Tim Brooks, a research scientist on the project. "This wasn't pre-programmed—it emerged naturally from observing vast amounts of data."

Undoubtedly, compared to other 'toy-level' video generation AIs, Sora represents a dimensional leap forward in the field of AI video. From a technical perspective, the underlying frameworks for image and video generation are quite similar, primarily including recurrent neural networks, generative adversarial networks (GANs), autoregressive transformers, and diffusion models.

Unlike mainstream AI video tools like Runway and Pika, which focus on diffusion models, Sora adopts a new architecture—the Diffusion Transformer model. As the name suggests, this model combines the dual characteristics of diffusion models and autoregressive models. The Diffusion Transformer architecture was proposed in 2023 by William Peebles from the University of California, Berkeley, and Saining Xie from New York University.

In this new architecture, OpenAI follows the approach used in large language models (LLMs) by introducing a method that uses "patches" (visual patches) as video data to train video models. These patches serve as a unified representation unit in a low-dimensional space, somewhat analogous to tokens in text form. Just as LLMs abstract all text, symbols, and code into tokens, Sora abstracts images and videos into patches. In simple terms, OpenAI divides videos and images into many small pieces, much like individual puzzle pieces. These small pieces are called patches, and each patch acts like a small card used in machine learning, containing a bit of information.

Through this method, OpenAI can compress videos into a low-dimensional space and then use diffusion models to simulate the diffusion phenomenon in physical processes, generating content data. This transforms a video frame filled with random noise into a clear and coherent video scene. The entire process is somewhat like sharpening a blurry photo.

According to OpenAI, there are two main benefits to this unified representation of visual data: First, sampling flexibility. Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos, and everything in between (as shown in the following three videos). This enables Sora to directly create content for different devices in their native aspect ratios and quickly prototype content at lower resolutions.

Second, improved framing and composition. Empirical findings show that training videos at their original aspect ratios improves composition and framing. For example, models that crop all training videos to squares sometimes generate videos where the subject is only partially visible. In contrast, Sora demonstrates improved video framing.

Why was OpenAI able to conceive this unified representation approach for visual data? Beyond technical reasons, it largely benefits from OpenAI's different cognitive approach to AI video generation models compared to Pika and Runway. Before Sora's release, AI video generation was often seen as one of the first vertical applications of AI, as it easily brings to mind the potential to disrupt industries like short videos, film/TV, and advertising.

Because of this, nearly all AI video generation companies have fallen into homogeneous competition: overly focusing on higher resolution, higher success rates, and lower costs, rather than developing world models for longer durations. For example, Pika and Runway produce videos no longer than 4 seconds—while the visuals may be impressive, the dynamic motion of objects remains subpar.

OpenAI's exploration of AI video generation, however, seems to follow a different path: leveraging world models to bridge the boundary between virtual and real worlds, achieving true AGI. As stated in OpenAI's Sora technical report: "We believe the capabilities demonstrated by Sora today prove that continued scaling of video models is a promising path toward developing simulators of the physical and digital world—including the objects, animals, and people that inhabit it."

The concept of a "world model" was first proposed by Yann LeCun, Chief Scientist at Meta, in June 2023. In essence, it refers to modeling the real physical world, enabling machines to develop a comprehensive and accurate understanding of the world, much like humans, particularly in grasping the many natural laws that govern the physical world.

In other words, OpenAI prefers to regard Sora as a foundational model for understanding and simulating the real world—a significant milestone in AGI—rather than merely a scenario for AI application. This means that, compared to other players, OpenAI always approaches problems from a higher-dimensional perspective. In practical terms, this makes problem-solving much easier. As Einstein once said, we cannot solve problems with the same thinking we used when we created them. From this perspective, it also explains why OpenAI can occasionally surprise the industry.

Although AI-generated videos still have various issues—such as models struggling to accurately simulate the physics of complex scenes or failing to understand specific instances of causality—it's undeniable that at least Sora has begun to comprehend some rules of the physical world. This challenges the notion of "seeing is believing" and poses unprecedented challenges to the authenticity of worlds built on physical rules.

When large models shift from learning patterns in past texts to learning from videos and the real world, and as the logic of Scaling Law emerges across various fields, the boundary between the cyber world and the physical world may become increasingly blurred. Disclaimer: This article (report) is written based on publicly available information or information provided by interviewees, but ReadCJ and the article author do not guarantee the completeness or accuracy of such information. Under no circumstances shall the information or opinions expressed in this article (report) constitute investment advice to any individual.