Deep Dive into Sora: A Billion-Dollar Investment in Video Aesthetics

baoshi.rao

Two weeks ago, Sora emerged, dropping another bombshell in the AI field. Demo videos show that Sora can already generate complex scenes with multiple characters performing specific movements in intricate settings.

Video generated by Sora, Source: OpenAI OpenAI stated in its technical report: "Video generation models like Sora are simulators of the world. Sora is a foundation for models that can understand and simulate the real world, and we believe this capability will be an important milestone toward achieving AGI."

However, not everyone agrees with this characterization. Yann LeCun, Chief AI Scientist at Meta, argues: "Generating realistic videos from text prompts alone doesn't mean the model understands the physical world."

Why does Sora achieve such stunning results? After studying the technical report and interviewing industry experts, we found that while Sora doesn't employ particularly groundbreaking techniques, its outstanding performance in the current unconverged landscape of video model technologies significantly reduces the trial-and-error costs for other market participants. It also provides valuable design concepts and product logic for video generation. What disruptive changes will Sora bring to the industry? How will the video model industry seize the challenges and opportunities this time?

"Sora has shown peers in this field a path forward, demonstrating that Transformer can also exhibit strong emergent capabilities in the video modality," said Bai Zeren, Vice President of Investments at Linear Capital, to 36Kr.

He believes this will accelerate the R&D pace of other large video model companies, ushering in new opportunities, while open-source technologies will also see further advancements. For many, Sora has unlocked new possibilities for multimodal video models. OpenAI has once again single-handedly elevated multimodal video models to unprecedented levels. Prior to this, the video domain had not seen any groundbreaking products like ChatGPT or Midjourney due to technical challenges and dataset limitations.

From a product perspective, Sora demonstrates clear advantages over other similar models in terms of video duration, content consistency, coherence, and resolution.

From the demo effects released by OpenAI, it's evident that within the one-minute generated videos, scenes change smoothly with camera movements while maintaining content consistency. This was something we rarely experienced when using video model products like Pika and Runway before. For example, in "Hands-on Testing of Pika 1.0: After Investing 390 Million Yuan, the Actual Results Fall Short of the Hype | Product Review," we found that after Pika converted a static image into a dynamic video, the faces in the video appeared distorted and slightly blurrier than in the original photo.

Take this model photo as an example. Uploading the image to Pika (the image is from IC photo).

Dynamic video generated by Pika based on the above image. Video generation capabilities are more technically complex compared to text-to-image. AI video generation tools not only need to master basic natural language understanding but also require strong performance in aspects such as smoothness, style accuracy, stability, consistency, and motion coherence.

Sora has achieved a one-minute duration, which no other product has accomplished. From a technical perspective, extending a model-generated video from 4 seconds to 10 seconds involves extremely complex technical challenges.

One critical consideration is the video generation logic: whether it follows the image-to-video approach (generating images first, then converting them into videos) or the video-native design concept (unifying image and video encoding for mixed training). "If choosing the image-to-video route, using frame-by-frame images to compose a video—for example, first generating a 4-second video consisting of 32 frames, then taking the last frame of this video as the starting point for the next 4-second video—this method is theoretically feasible. However, errors accumulate during the process, and video generation involves content continuity issues, making the problem more complex. As a result, the image at the 10th second would significantly deviate from the initial video," said Yao Ting, CTO of Zhixiang Future.

Pika once mentioned the challenges of this approach in an interview: "When the video is long, ensuring each frame is coherent is quite a complex issue. During training, handling video data requires processing multiple images—how to transfer 100 frames to the GPU is one challenge. During inference, generating a large number of frames makes the process slower compared to single images, and computational costs also increase."

Sora adopted a mixed training approach. In the technical report, OpenAI mentioned: Using a method that mixes images and videos for training, employing patches (visual patches) as video data to train the video model. Yao Ting believes that the video-native design philosophy adopted by OpenAI, which naturally incorporates images as single-frame videos into model training, allows the Sora model to seamlessly switch to an image generation model. This will prompt technicians to rethink the design logic of video generation.

He mentioned: "This also gives us inspiration. From the effects of Sora, we can see that mixed training with images and videos is crucial. Without this, it would be difficult to achieve such heights. Of course, this also proves that OpenAI has coupled the technical architecture very well."

Additionally, regarding the smooth camera movements displayed in the videos generated by Sora, some speculate that, considering the team includes dedicated digital content workers, Sora incorporates 3D rendering data in its training data, making it more adept at generating camera movements and simulating 3D visual effects compared to other products. These are some of the product design details behind Sora's stunning effects.

Beyond the amazement, another question worth pondering is that while OpenAI refers to Sora as a simulator of the world, its current effects also reveal certain limitations.

"Sora may not truly understand the world," Professor Wang Jun from the UCL Department of Computer Science told 36Kr. He gave an example: in the real physical world, when a glass bottle shatters, its collisions with other objects must adhere to the laws of physics. "If Sora generates videos by predicting the next token, building a world model that truly aligns with logical and physical laws becomes a challenge. Just like language models, some may focus solely on producing human-understandable language, but that doesn’t mean they genuinely comprehend physical logic."

According to information on OpenAI’s official website, the Sora team has been in existence for less than a year, with a core team of just 15 members, some of whom are even from the post-00s generation.

Why Sora has achieved such impressive results in such a short time remains a mystery. In its recent technical blog, OpenAI also mentioned that it won’t share technical details, providing only the model’s design philosophy and demo videos. Given OpenAI’s increasingly closed approach, it’s unlikely we’ll gain more meaningful technical insights in the future. Many are discussing Sora's technical approach. Currently, there are two mainstream video model frameworks: Diffusion model and Auto-regressive model, the latter being the well-known GPT model. Historically, the mainstream framework for video generation models hasn't converged into a definitive path like language models have.

Yao Ting, CTO of Zhixiang Future, told 36Kr that the difference between the two approaches lies in: "The Diffusion model, based on its noise-adding and denoising mechanism, can better structure and generate higher-quality video content, while the Auto-regressive model is more suitable for long-context understanding and naturally adapts to multimodal dialogue generation methods." In the specific technical implementation, different sub-architectures continue to emerge under the two major routes. For example, under the Diffusion model route, Gen-2 and Pika adopt the U-net (convolutional neural network) architecture, while some companies replace the U-net architecture with the Transformer architecture, adopting the DiT (Diffusion Transformer) architecture.

Sora is believed to have adopted the DiT architecture. This is currently a widespread market speculation, primarily based on the 2023 paper Scalable Diffusion Models with Transformers co-authored by Bill Peebles, one of Sora's lead researchers, and New York University assistant professor Xie Saining. According to 36Kr, domestic multimodal video model startup Aishu Technology chose this technical route from its inception, while another startup HiDream also adopted the DiT architecture.

Yao Ting stated: "We've actually independently developed and validated a mature DiT architecture for image generation models. Compared to U-Net, the DiT architecture offers greater flexibility and can enhance the quality of image and video generation."

From a purely technical perspective, Sora's chosen architecture isn't particularly rare - it's just that different video model companies made different choices based on their considerations earlier on.

"There's nothing particularly special in the technical approach shown by Sora. **OpenAI must have its own unique training methods," Wang Jun told 36Kr. He mentioned, "Through large-scale training, it is possible to leverage massive amounts of data and computational resources to achieve outstanding engineering results. In my view, computational power and data have not yet reached their limits; there is still room for further development. We can further tap into the potential of data, conducting deeper processing on text, images, and even videos, to elevate model capabilities to new heights."

So, although there has been no innovation in the underlying technical approach, OpenAI's strength lies in continuously practicing the brute-force aesthetics of large-scale computation and big data on this path—by achieving breakthroughs through sheer scale and relying on meticulous engineering innovations to drive the continuous optimization of emergent model capabilities. OpenAI mentioned in the report: Our results indicate that scaling video generation models is a promising path to building general-purpose simulators of the physical world—'Under the same sample conditions, as the scale of training computation increases, video quality significantly improves, and many interesting emergent capabilities appear, enabling Sora to simulate certain aspects of people, animals, and environments in the real world.'

Additionally, OpenAI also noted in the paper that Sora incorporates capabilities from products like GPT.

Yao Ting believes that Sora's strength is built upon previous research on DALL-E and GPT models. 'Sora is an outlet for OpenAI to integrate its various capabilities in language (GPT), visual understanding (GPT4-V), and image generation (DALL-E). It uses DALL·E 3's recaptioning technique to generate highly descriptive annotations for visual training data, thus enabling it to more faithfully follow users' text instructions.' Currently, there is rampant speculation about Sora's parameter count and training data, with widely varying estimates. Some guess that Sora's model parameter scale is in the billions, with training costs in the tens of millions of dollars, while others believe the parameter scale might be as low as 3 billion, though data annotation costs remain high. There are even claims that Sora's inference computing power demand could exceed GPT-4's by over 1,000 times.

Li Zhifei, founder of Mobvoi, suggested that Sora might have used millions of hours of training data: "Typically, video resolution exceeds 128*128, resulting in a token count of at least tens of trillions. If Sora was trained on 5 million hours of video data, that would roughly equate to 9 days' worth of data production on YouTube."

Parameters and data volume are just one aspect of the model. Compared to text models, video models involve higher data complexity, more dimensions, fewer high-quality data sources, and greater challenges in data annotation—all of which pose engineering difficulties for companies training such models. At this moment, for other video AI companies, Sora's impressive capabilities have validated the DiT architecture on one hand, reducing the trial-and-error costs in technical architecture selection and enabling faster progress. On the other hand, they now face more challenging practical issues—how to enhance the engineering capabilities in algorithms, data, and other aspects to catch up with Sora without the same level of talent and computational resources as OpenAI.

After Sora's release, some are pessimistic, believing that "with Sora's emergence, other video companies are doomed" and "the gap between domestic and international players has widened further." Others, after analyzing more details, argue that the opportunities for video models will expand significantly, stimulated by Sora, leading to new development spaces.

On one hand, Sora's technical approach offers valuable insights, allowing other companies to avoid indecision in their strategies and accelerate product development. On the other hand, as Sora draws more attention to the market, it will attract more talent, computational power, data, and funding, creating new entrepreneurial opportunities. From Sora's current progress, it's evident that real-time updates have not been achieved, and the waiting time for video generation is quite lengthy. This means Sora has not yet undergone the large-scale user testing that ChatGPT has, and its model's computational resources and optimization level are not yet ideal, requiring further iteration. This leaves time and space for other companies.

According to user reports on Reddit, when demonstrating Sora's capabilities, OpenAI primarily used pre-selected examples and did not allow the public to generate videos with custom prompts. Additionally, rendering a 1-minute video with Sora takes over an hour.

Wang Changhu, founder of Aishu Technology, stated that in his view, Sora's current technological development is between GPT-2 and GPT-3, not yet at the level of GPT-4, leaving significant room in the market. Bai Zeren, Vice President of Investment at Linear Capital, told 36Kr: "The development of models will accelerate the emergence of more prosperous upper-layer applications, bringing more opportunities for application innovation, including video models and multi-modal application scenarios. However, how to differentiate and establish long-term moats is a constant challenge for product-layer startups. Entrepreneurial teams need to pay more attention to building barriers beyond models, returning to product experience, application scenarios, and the essence of business."

In terms of market progress, many domestic companies have long been laying out their strategies. First, major players have been continuously active in the video field, basically advancing language model businesses while also deploying video model businesses: Compared to tech giants with abundant computing resources and talent pools, startups face greater challenges but still have opportunities. According to 36Kr, several Chinese startups including HiDream.AI, Aishu Tech, HeyGen, Shengshu Tech, and RightBrain Tech have already made strategic moves in video AI model development. Unlike the previous wave of language models that developed separately in domestic and international markets, companies like Aishu Tech are targeting overseas markets from the outset, effectively competing with Sora in the same arena.

Industry veterans have already entered this field. Wang Changhu, founder of Aishu Tech, previously served as ByteDance's visual technology lead, overseeing products like Douyin and TikTok, and spearheaded ByteDance's visual AI model development from scratch. Mei Tao, founder of HiDream.AI, was formerly JD.com's vice president and a senior researcher at Microsoft Research Asia. Shengshu Tech is led by Professor Zhu Jun, deputy director of Tsinghua University's AI Institute, with core team members coming from Tsinghua's AI research center.

Considering the current progress of domestic video AI model companies, both tech giants and startups will significantly increase investments, intensifying industry competition. Tech giants hold advantages in talent, funding, computing power, data resources, and application scenarios, while startups can leverage their agility to accelerate model and product iteration, seizing innovation opportunities at the product level. Additionally, on the commercialization path, since Sora has not been opened for public testing like ChatGPT, there is currently no clear business model to observe. However, based on the signals released by OpenAI, it is likely that the core will still revolve around a general-purpose model.

For Chinese startups, under multiple pressures such as computing power costs and data training, they will face route selection earlier in the commercialization process.

In the future, video model startups will also diverge into different paths after continuous development: one is to continuously enhance the capabilities of the foundational model and pursue a C-end product route, such as Aishu Technology, which has chosen this path. According to the overseas traffic monitoring website similarweb.com, Aishu's overseas product PixVerse has seen rapid growth in monthly visits, already exceeding one million. The other path is to focus on specific scenarios for training, creating specialized video models to quickly establish a commercialization loop for certain B-end scenarios. Yao Ting believes that in the video generation field, startups need to consider how to build their products early on to find opportunities for differentiation. "Currently, video production is still in the single-shot stage. In the future, if we want to produce a short video or mini-drama, the video production process will need to consider various issues such as multi-shot sequences, storyboarding, narrative logic, etc. These product-related issues must be considered in advance."

Technology, product development, and commercialization—each area contains thousands of detailed problems to be solved. For every video large model, the remaining time in 2024 will be a tough battle.