Sora is Here: How to Understand and Surpass It?

baoshi.rao

On February 15th, the text-to-video model Sora made its debut, unexpectedly sparking another multimodal industrial revolution.

Sora can generate one-minute videos based on text instructions or static images. These videos feature intricate scenes, vivid character expressions, and complex camera movements. It also supports extending existing videos or filling in missing frames.

Huafu Securities noted that Sora has achieved industry-leading levels in video fidelity, resolution, and text comprehension. Moreover, when trained on sufficiently large datasets, Sora demonstrates emergent capabilities, endowing the video generation model with the potential to act as a universal simulator of the physical world. The technological shockwave from across the ocean is sweeping the globe, and the AIGC-related industrial chain is stirring.

Recently, the information consulting company Six Degrees Intelligence received requests from multiple clients who wished to interview various experts in the AIGC field to gain a deeper understanding of industry trends, which indirectly confirms the booming state of the AIGC industry. Six Degrees Intelligence is a business information retrieval platform that provides clients with high-quality and leading research decision support and expert knowledge-sharing services. It currently has over 50,000 overseas experts and operates in regions including North America, Asia, Europe, and Southeast Asia.

According to interviews facilitated by Six Degrees Intelligence, many experts believe that Sora technology has demonstrated a positive impact in the fields of virtual reality (VR), augmented reality (AR), and mixed reality (MR), and has also shown potential in multimedia processing. Considering its technical characteristics, Sora may be widely used in the future for high-risk or highly creative lens production. In fact, text-to-video AIGC technology is not a new track. What technical advantages does Sora have behind its stunning performance?

From text generation model GPT, text-to-image model DALL·E, to text-to-video model Sora, OpenAI may have established its own AGI (Artificial General Intelligence) technical route. The continuously expanding and deepening applications of multimodal large models indicate that OpenAI has mastered the core competitiveness of large models, namely the "scaling law" - the larger the model and the more data, the better the performance.

In an interview facilitated by Six Degrees, a former Baidu Cloud Solutions Engineer stated: In the field of artificial intelligence, the latest advancements in image and video generation technology are mainly reflected in several aspects. First, current technology can map text into latent representations, meaning we can generate corresponding images and videos from simple text descriptions.

Second, by utilizing diffusion models and physics engines, we can ensure that the generated images are not only continuous but also comply with physical laws. Such technology can repeatedly patch details on sketches or images, thereby improving resolution and expressiveness.

Additionally, technologies like Sora can generate high-definition videos while ensuring content continuity and adherence to physical laws during the generation process. With the improvement in AI-generated video clarity, we have witnessed rapid development from 2K to 4K. Sora's characteristics in video processing and 3D data conversion are primarily demonstrated through its meticulous data processing. Instead of merely segmenting long videos into shorter clips, it performs annotation and processing at the smallest unit granularity.

This approach significantly differs from traditional video training methods and data preprocessing. In converting video data into 3D data, tools like NeRF or PiCkBirds are typically required. Sora's model has important applications in scenarios such as augmented reality (AR) and virtual reality (VR), enabling spatial computing, 3D reconstruction, and world simulation.

From the Sora technical report published on OpenAI's official website, it can be observed that the theoretical foundation of Sora's DiT architecture is an academic paper titled Scalable diffusion models with transformers. This paper was co-authored in December 2022 by William (Bill) Peebles, a researcher from Berkeley University and now the technical lead of the Sora team, and Xie Saining, a researcher from New York University. After the release of Sora, Xie Saining wrote on platform X, "When Bill and I were involved in the DiT project, we didn't focus on innovation but rather on two aspects: simplicity and scalability." He stated, "Scalability is the core theme of the paper. The optimized DiT architecture runs significantly faster than UNet (the traditional technical approach for text-to-video models). More importantly, Sora has demonstrated that the DiT scaling laws apply not only to images but now also to videos—Sora replicates the visual scaling behavior observed in DiT."

Overall, Sora's leading position is first attributed to the sufficiently large-scale GPT models. Building on this, Sora processes data more meticulously, annotating and handling it at the smallest unit granularity, leveraging unique text-annotated video datasets and other technical and resource advantages. These factors likely contribute to Sora's industry-leading status.

Many industry professionals believe that the emergence of Sora could transform a range of creative industries, affecting fields from film production and advertising to graphic design, game development, social media, influencer marketing, and even educational technology. The turmoil in the secondary market has already confirmed this statement. On the day after Sora's release, Adobe's stock price plummeted by over 7%, Shutterstock dropped by more than 5%, and Alphabet, which had recently launched the text-to-video tool Lumiere, saw its shares decline by 1.58%. The three companies collectively lost nearly $48 billion in market value in a single day.

Setting aside simplistic discussions about which industries might be replaced, it is clear that Sora and its technology significantly expand the application scenarios of AIGC (AI-generated content). These applications are not limited to the creative content industry; fields like autonomous driving can also benefit from this technology.

In another interview facilitated by Six Degrees, a former product director at SenseTime mentioned, When discussing the application prospects and potential impact of the Sora model, we can observe its broad application potential across multiple industries. For instance, in the gaming industry, the Sora model can be utilized to generate intricate maps, significantly enriching the details and realism of game worlds. In the autonomous driving sector, the model's ability to understand video and visual content may positively advance the development of autonomous driving technologies.

Professionals in the content creation field should also take note that with the integration of artificial intelligence technologies, the methods of content production and editing will undergo changes, potentially leading to a reevaluation of skill requirements and workflows. As a visual model, the Sora model relies on GPT to enhance its understanding of text, which differs from the expertise of the GPT1.5 model. Additionally, the Sora model incorporates transformer algorithms to improve the consistency of video content, excelling in certain aspects.

It is important to clarify that the Sora model is unlikely to completely replace manual video editing. Human creativity and aesthetic understanding play a crucial role in the AI-assisted creative process. In autonomous driving technology, the visual system is a critical component, and the Sora model may have learned data that includes three-dimensional depth information. As AI's accuracy in simulating three-dimensional worlds continues to improve, the challenges of autonomous driving are correspondingly reduced. Just like the birth of ChatGPT, from a technical perspective, OpenAI's series of products have undoubtedly reconstructed the foundation of artificial intelligence from the ground up. Technologies based on large models are the inevitable direction of the new wave of AI revolution.

However, when it comes to commercialization, it remains an unavoidable issue in every AI revolution.

Currently, OpenAI, primarily operating in a to-C model, is still in a phase where investment far exceeds revenue. Despite its astonishing technology, it has yet to reach the level of a killer application among consumers. This is also the current state of the large model frenzy in China. In another interview facilitated by Six Degrees, a former AI scientist at SenseTime stated,

"When discussing the commercialization prospects and potential challenges of the Sora model in the current market, I believe the implementation of new technologies always comes with high expectations. However, the Sora model may face some commercialization gaps in the short term.

Despite this, the Sora model has enormous application potential in fields such as e-commerce, controllable content generation, game production, entertainment industry, and interior design. However, realizing these applications will require some time. Therefore, investing in the commercialization of the Sora model at this stage may not be entirely reasonable. Nevertheless, we should recognize that the era of artificial intelligence being applied to real-life scenarios is approaching, which will bring more opportunities for the future development of the Sora model." Meanwhile, domestic large model developers are facing even greater challenges.

A former Baidu Cloud solution engineer stated:

"When developing AI technology products similar to Sora in China, the main technical challenges we face include establishing model architectures, exploring data processing and training methods. Additionally, developing physics engines also presents a significant challenge." It's worth mentioning that short video companies or professional video content producers have obvious advantages in dataset accumulation for AI model training. Meanwhile, the computational power required for AI models will significantly increase for both training and inference.

Looking ahead, model training clusters will require large quantities of NVIDIA H100 GPUs, with training cycles potentially lasting two to three months. These are all critical issues we need to focus on and solve during the development process.

Synthesizing expert opinions, we can draw the following conclusions: Sora's text-to-video technology leads the industry, but model characteristics remain in the exploration phase.

In terms of commercialization, Sora has wide-ranging application scenarios. Beyond the content creation industry, potential uses include gaming and autonomous driving, though specific business applications still need to be observed.

When developing Sora-like models capable of generating long videos, it's crucial to focus on enhancing the model's memory capabilities. This includes improving contextual memory, reasoning abilities, long-term memory, and modeling capabilities for long sequences.