Starting with Sora: A Comprehensive Interpretation of the Development History of AI Video Large Models

baoshi.rao

Sora, OpenAI's AI-generated video large model, has garnered global attention since its release on February 15, 2024. Silicon Valley AI video researchers (not Sora) have praised it: "It's really good, undoubtedly the No.1."

What makes Sora stand out? What are the challenges in the development of generative AI video? Is OpenAI's video model necessarily the right path? Has the so-called "world model" reached a consensus? In this video, we delve into the different schools of thought in the development history of generative AI video large models, the controversies, and future directions through interviews with frontline AI professionals in Silicon Valley. We actually wanted to tackle AI-generated video last year because, during conversations with many people, including VC investors, we realized that most didn't clearly understand the difference between AI video models and large language models like ChatGPT. But why didn't we proceed? Because by the end of last year, the best available in the market were Runway's Gen1 and Gen2, which offered video-to-video and text-to-video generation. However, the results we got... well, let's just say they left much to be desired.

For example, when we used Runway to generate a video with the prompt "Super Mario walking in a desert," the output looked more like Mario jumping on the moon. Whether it was gravity or friction, the laws of physics seemed to have suddenly disappeared in that video. Then we tried another prompt, "A group of people walking down a street at night with umbrellas on the windows of stores." This prompt was also attempted by an investor, Garrio Harrison. The resulting video looked like this:

Look at these floating umbrellas in the air—isn't it eerie... Yet, this was Runway, representing the most advanced technology last year. Later, Pika Labs, founded by Chinese entrepreneur Demi Guo, gained some popularity and was considered slightly better than Runway. However, it was still constrained by the 3-4 second length limitation, and the generated videos still had issues with video understanding logic, hand composition, and other defects.

So, before OpenAI released the Sora model, generative AI video models did not attract global attention like ChatGPT or Midjourney, largely because the technical difficulty of generating videos is extremely high. Videos involve two-dimensional space plus time, transitioning from static to dynamic, from flat surfaces to the three-dimensional effects displayed across different time segments. This requires not only powerful algorithms and computing power but also solutions to a series of complex problems such as consistency, coherence, physical plausibility, and logical rationality. So, the topic of generative video models has always been on our Silicon Valley 101 list, but we kept postponing it, waiting for a major breakthrough in generative AI video models before tackling this subject. Unexpectedly, that moment has arrived much sooner than anticipated.

Sora's demonstration undoubtedly outperforms previous models like Runway and Pika Labs.

First, one of the biggest breakthroughs is quite obvious: the generated video length has significantly increased. Previously, Runway and Pika could only produce 3-4 second videos, which were too short. As a result, the only AI video works that gained widespread attention were fast-paced movie trailers, as other applications requiring longer footage simply couldn't be fulfilled. On Runway and Pika, if you need longer videos, you have to continuously prompt to extend the video duration. However, our video editor Jacob discovered a major problem with this approach.

Jacob, Silicon Valley 101 Video Editor:

The pain point is that as you keep extending the duration, the later parts of the video become deformed, resulting in inconsistency between the earlier and later segments. This makes the footage unusable. In Sora's latest showcased papers and demos, it's revealed that the system can directly generate approximately 1-minute video scenes based on text prompts. Simultaneously, Sora maintains seamless transitions between character scenes and thematic consistency throughout the video. This capability has even impressed professional editors who've seen it.

(One Sora demo features) a girl walking through Tokyo streets... which I find particularly impressive. Even during dynamic motion sequences with spatial movement and rotation, Sora maintains consistent movement of characters and objects within the video scene.

Thirdly, Sora can accept videos, images, or text prompts as input, with the model generating videos based on user input. For example, one published demo shows a cloud bursting into bloom. This indicates the Sora model can animate static images, extending videos forward or backward in time. Fourth, Sora can read different video formats, whether widescreen or vertical, sample them, and output videos of different sizes from the same source while maintaining style consistency, such as in this sample clip of a small sea turtle. This is actually very helpful for video post-production. Currently, for 19201080p widescreen videos on platforms like YouTube and Bilibili, we need to manually recut them into vertical 10801920 videos to fit short video platforms like Douyin and TikTok. However, it's conceivable that in the future, we might be able to use Sora for one-click AI conversion, which is a feature I'm really looking forward to.

Fifth, long-distance coherence and temporal continuity have been significantly improved. Previously, a major challenge in AI-generated videos was maintaining temporal continuity. However, Sora excels at remembering people and objects in videos, ensuring they remain logically consistent even when temporarily obscured or moved out of the frame. For example, in the sample video of a dog released by Sora, when people walk past it and completely block the view, the dog continues to move naturally once it reappears, maintaining temporal and object coherence. Sixth, the Sora model can already simulate simple actions of the world state. For example, a painter leaving new brushstrokes on a canvas that persist over time, or a person leaving bite marks on a hamburger. Some optimistic interpretations suggest that this indicates the model possesses a certain level of common-sense ability, "understanding" the physical world in motion and predicting what might happen next in the scene.

Therefore, these groundbreaking updates from the Sora model have significantly heightened expectations and excitement for the development of generative AI videos. Although Sora still exhibits some logical errors—such as a cat appearing with three paws, unconventional obstacles in street scenes, or a person running in the wrong direction on a treadmill—it is undeniably the absolute leader compared to previous generative video models like Runway, Pika, or Google's VideoPoet. More importantly, OpenAI seems to be using Sora to demonstrate that the "brute force" approach of stacking computational power and parameters can also be applied to generative videos. By integrating diffusion models and large language models, this new model approach aims to lay the foundation for what is termed a "world model." These perspectives have sparked intense controversy and discussion within the AI community. Next, we will attempt to review the technological development path of generative AI models and analyze how the Sora model operates, as well as whether it is the so-called 'world model'.

In the early stages of AI-generated video, it mainly relied on two models: GAN (Generative Adversarial Network) and VAE (Variational Autoencoder). However, the video content generated by these two methods was relatively limited, monotonous, and static, with often unsatisfactory resolution, making it completely unsuitable for commercial use. Therefore, we will not discuss these two models further here.

Later, AI-generated video evolved into two technical approaches: one is a diffusion model specifically designed for the video field, and the other is the Transformer model. Let's first talk about the diffusion model approach, which has been adopted by companies such as Runway and Pika Labs. The English term for 扩散模型 is Diffusion Model. Many people are unaware that the original model of Stable Diffusion, now one of the most important open-source models, was jointly released by Runway and the team from the University of Munich. Stable Diffusion itself is also the underlying technological foundation for Runway's core products—the video editors Gen-1 and Gen-2.

The Gen-1 model was released in February 2023, allowing users to alter the visual style of original videos by inputting text or images, such as transforming a real-world street scene captured on a phone into a cyberpunk world. In June, Runway released Gen-2, which further enables users to generate videos directly from text prompts.

The principle of the Diffusion Model can be somewhat inferred from its name: it generates images or videos through a gradual diffusion process. To better explain the model's principle, we invited Dr. Zhang Songyang, one of the authors of Meta's Make-A-Video model paper and currently working on video generation models at Amazon's AGI team, to provide an explanation. Dr. Zhang Songyang, co-author of Meta's Make-A-Video model paper and Applied Scientist at Amazon AGI team:

The reason why this paper initially used the term 'diffusion' comes from a physical phenomenon. For example, when we drop ink into a glass of water, the ink will spread out - this is called diffusion. This process is physically irreversible, but AI can learn this process and reverse it. Applied to images, it means an image keeps adding noise until it becomes a mosaic-like effect - a pure noise image. Then we learn how to transform this noisy image back into the original picture. Training such a model to complete the task in one step would be very difficult, so it is divided into many steps—for example, 1,000 steps. For instance, when a small amount of noise is added, the model can restore what the image looks like after denoising. Then, when more noise is added, how should the model predict the noise? It involves many steps, gradually removing the noise iteratively. It's like water and ink completely mixed together, and the challenge is to predict how it can step by step return to its original state—a single drop of ink. Essentially, it's the reverse process of diffusion.

Dr. Zhang Songyang explained it vividly: The core idea of the diffusion model is to gradually generate realistic images or videos by continuously introducing randomness to the original noise. This process is divided into four steps:

1) Initialization: The diffusion model starts with a random noise image or video frame as the initial input. 2) Diffusion Process (also known as forward process): The goal of the diffusion process is to make the image unclear and eventually turn it into complete noise.

3) Reverse Process (reverse process, also known as backward diffusion): At this stage, we introduce "neural networks," such as the UNet structure based on convolutional neural networks (CNN), to predict "the noise added to achieve the current frame of the blurred image" at each time step. By removing this noise, the next frame of the image is generated, thereby creating realistic image content. 4) Repeat steps: Repeat the above steps until the desired length of the generated image or video is achieved.

The above describes the generation method for video-to-video or picture-to-video, which is also the general underlying technical operation of Runway Gen1. If the goal is to achieve text-to-video by inputting prompts, a few additional steps are required.

For example, let's take the Imagen model released by Google in mid-2022 as an example: our prompt is a boy is riding on the Rocket, 骑着火箭的男孩. This prompt is converted into tokens and passed to the text encoder. Google's IMAGEN model then uses the T5-XXL LLM encoder to encode the input text into embeddings. These embeddings represent our text prompt but are encoded in a way that the machine can understand. These "embedded texts" are then passed to an image generator, which produces low-resolution 64x64 images. The IMAGEN model then utilizes super-resolution diffusion models to upscale the image from 64x64 to 256x256, followed by another layer of super-resolution diffusion model, ultimately generating high-quality 1024x1024 images that closely align with our text prompts.

In summary, during this process, the diffusion model starts with random noise images and uses encoded text in the denoising process to generate high-quality images.

Why is generating videos so much more difficult than generating images? The principle remains essentially the same, with the only difference being the addition of a timeline. As mentioned earlier, an image is 2D, consisting of height and width. A video, however, adds a timeline, making it 3D—comprising height, width, and time. In learning the reverse process of diffusion, it transitions from a 2D inverse process to a 3D inverse process, which is the key distinction.

Consequently, issues present in images, such as whether generated faces appear realistic, will similarly manifest in videos. Videos also introduce unique challenges, such as maintaining consistency in the subject of the frames. Currently, the results for landscapes are quite acceptable, but when it comes to humans, the requirements are more precise, making it a more complex problem. Another significant challenge is extending the duration of generated videos, as the current output of 2-4 seconds falls far short of practical application needs.

Diffusion models offer three main advantages over previous models like GANs: 1. Stability: The training process is generally more stable and less prone to issues like mode collapse or mode collapse.

2. Image Generation Quality: Diffusion models can generate high-quality images or videos, especially when fully trained, with results that are often highly realistic.

3. No Need for Specific Architecture: Diffusion models do not rely on specific network structures and have good compatibility, allowing various types of neural networks to be used. However, diffusion models also have two major drawbacks, including:

First, high training costs: Compared to some other generative models, training diffusion models can be more expensive because they need to learn denoising at different noise levels, requiring longer training times.

Second, generation takes more time. This is because generating images or videos requires gradual denoising rather than producing the entire sample at once. The primary reason we currently can't generate long videos is that our GPU memory is limited. Generating a single image may occupy a portion of the GPU memory, and generating 16 images might nearly fill up the entire memory. When you need to generate more images, you have to figure out how to consider the previously generated information while predicting what to generate next. This imposes higher requirements on the model architecture, and computational power is also an issue. Perhaps in many years, our GPU memory will become significantly larger, and this problem might disappear. But for now, we need better algorithms, though better hardware could potentially solve this problem too.

Therefore, it's clear that current video diffusion models may not be the best algorithms, even though companies like Runway and Pika Labs are continuously optimizing them.

Next, let's discuss another approach: The Transformer-based large language model approach to video generation technology. Finally, in late December 2023, Google released VideoPoet, a generative AI video model based on large language models (LLMs). At the time, it was seen as an alternative solution and approach in the field of video generation, distinct from diffusion models. So, how does it work?

How do large language models generate videos?

Large language models generate videos by understanding the temporal and spatial relationships within video content. Google's VideoPoet is an example of using LLMs for video generation. At this point, let's once again invite Dr. Zhang Songyang, a generative AI scientist, to provide a vivid explanation. Then there's this large language model thing, which is fundamentally different in principle. Initially, it was used for text, specifically for predicting the next word. For example, given "I love telling the truth," and then the last part "I love telling the," what's the next word? The more context you provide, the easier it is to guess the next word. But with fewer words, there's more room for creativity. That's how it works.

This approach was then extended to video. We can learn a vocabulary for images or videos by dividing an image into small sections—say, cutting it horizontally and vertically into 16 pieces each—and treating each small square or grid as a word. These are then fed into the large language model for training. For instance, if you already have a well-trained large language model, you can learn how the model's words interact with text or video words, and what kind of relationships exist between them. By learning these interactions, we can leverage large language models to perform video-related tasks or text-based tasks.

Simply put, here's how Videopoet based on large language models operates: 1) Input and Understanding: First, VideoPoet receives text, audio, images, depth maps, optical flow maps, or videos to be edited as input.

2) Video and Audio Encoding: Since text is naturally discrete, large language models inherently require inputs and outputs to be discrete features. However, video and audio are continuous quantities. To enable large language models to process images, videos, or audio as inputs and outputs, VideoPoet encodes videos and audio into discrete tokens. In deep learning, a token is a crucial concept referring to a set of symbols or identifiers used to represent specific elements within a dataset or information. In the context of VideoPoet, this can be simply understood as "words" for videos and audio. 3) Model Training and Content Generation: With these token vocabularies, a Transformer can be trained to predict video tokens one by one based on user input, similar to learning text tokens. The model then begins generating content. For video generation, this means the model needs to create coherent frame sequences that are not only visually logical but also temporally continuous.

4) Optimization and Fine-tuning: The generated videos may require further optimization and fine-tuning to ensure quality and coherence. This could involve adjusting colors, lighting, and transitions between frames. VideoPoet leverages deep learning techniques to optimize the generated videos, ensuring they align with the text descriptions while remaining visually appealing. 5) Output: Finally, the generated video will be output for end users to watch.

However, the approach of using large language models for video generation has both advantages and disadvantages.

Let's first discuss the advantages: 1) High Comprehension Capability: Transformer-based large language models can process and understand vast amounts of data, including complex text and image information. This enables the models to possess cross-modal understanding and generation capabilities, effectively learning the associations between different modalities such as text, images, and videos. This allows them to generate more accurate and relevant outputs when converting text descriptions into video content.

2) Long Sequence Data Processing: Due to the self-attention mechanism, Transformer models are particularly adept at handling long sequence data, which is especially important for video generation since videos are inherently long sequences of visual representations. 3) Scalability of Transformer: Generally speaking, the larger the model, the stronger its fitting capability. However, when a model reaches a certain size, the performance gains from increasing the size of convolutional neural networks (CNNs) slow down or even stop, whereas Transformers continue to improve. This has been demonstrated by large language models based on Transformers, and now they are gradually making their mark in the field of image and video generation.

Now, let's talk about the disadvantages: 1) Resource Intensive: Using large language models to generate videos, especially high-quality ones, requires substantial computational resources. The approach of encoding videos into tokens often results in a much larger vocabulary than a single sentence or paragraph. Additionally, predicting tokens one by one significantly increases time consumption. This makes both the training and inference processes of Transformer models expensive and time-consuming.

There's a fundamental issue I find quite significant: Transformers aren't fast enough. This is an essential problem because Transformers predict small blocks one by one, whereas diffusion models generate an entire image at once. Therefore, Transformers are inevitably slower.

Chen Qian, Host of Silicon Valley 101 Video: Is there any concrete data on how much slower it is? For example, how much slower exactly?

For instance, if I generate one image directly, a diffusion model might take 1 step (though it requires some iterative process). If I use 4 steps, it would take 4 steps to generate the image, so we'd say 4. Currently, with good implementation, 4 steps can produce decent results. But if you use a transformer, say for a 16×16 grid, that would be 16×16=256 steps – that's the speed difference.

Here, 4 means I performed denoising iterations four times. For transformers, it's equivalent to predicting an image by predicting 256 tokens for a 16×16 grid. Their units are certainly different, but you can see the difference in complexity. Diffusion models have constant-order complexity, while transformer complexity scales with width × height. So from a complexity perspective, diffusion models are clearly superior. I think this issue would become more pronounced with higher-resolution images – transformers would face greater challenges.

Some other issues with transformer models include: 2）Quality Fluctuations: Although Transformer models can generate creative video content, the output quality may be unstable, especially for complex or insufficiently trained models.

3）Data Dependency: The performance of Transformer models largely depends on the quality and diversity of training data. If the training data is limited or biased, the generated videos may not accurately reflect the input intent or may lack diversity. 4) Understanding and Logic Limitations: Although Transformer models have made progress in understanding text and image content, they may still struggle to fully grasp complex human emotions, humor, or subtle socio-cultural signals, which could affect the relevance and appeal of generated videos.

5) Ethical and Bias Issues: Automated video generation technology may inadvertently replicate or amplify biases present in the training data, leading to ethical concerns. However, when it comes to the fifth point, I suddenly remembered recent news about Google's multimodal model Gemini. No matter what person you input, the output always shows a person of color - including America's Founding Fathers, a black female version of the Pope, Vikings depicted as people of color, and even Elon Musk generated as a black man.

The reason behind this might be Google's attempt to correct biases in Transformer architecture by adding AI ethics and safety adjustment instructions. However, they overcorrected, resulting in this major blunder. This incident occurred after OpenAI released Sora, which indeed led to Google being widely mocked again.

However, industry insiders also point out that these five issues are not unique to the Transformer architecture. Currently, all generative models may have these problems, with different models having slightly different strengths and weaknesses in various aspects. So, to summarize here, both diffusion models and Transformer models have their shortcomings when it comes to generating videos. So, how does OpenAI, as one of the most technologically advanced companies, approach this? Well, you might have guessed it—both models have their strengths, and what if combining them could achieve 1+1>2? Thus, Sora, the combination of diffusion models and Transformer models, was born.

To be honest, the details of Sora remain largely unknown to the public. It hasn’t been released to the general audience yet, not even through a waiting list. Only a select few in the industry and design fields have been invited to use it, and the videos produced have been shared online. As for the technology, most insights are based on speculation and analysis of the demo videos OpenAI has released. On the day of Sora’s announcement, OpenAI provided a somewhat vague technical explanation, but many key details were missing.

But let’s start with the publicly available technical analysis to understand how OpenAI’s diffusion + large language model approach works. Sora makes it clear from the outset: OpenAI jointly trains text-conditional diffusion models on videos and images with variable durations, resolutions, and aspect ratios. Additionally, it utilizes a Transformer architecture that operates on spacetime patches of latent codes from videos and images.

Therefore, the generation process of the Sora model includes:

Step 1: Video Compression Network In video generation technology based on large language models, we mentioned encoding videos into discrete tokens. Sora also adopts the same idea. Video is a three-dimensional input (two spatial dimensions + one temporal dimension), where the video is divided into small tokens in three-dimensional space, called "spacetime patches" by OpenAI.

Step Two: Text Understanding With the support of OpenAI's text-to-image model DALLE3, Sora can automatically annotate many videos without text labels and use them for video generation training. Additionally, with GPT's assistance, user inputs can be expanded into more detailed descriptions, making the generated videos more aligned with user inputs. The transformer framework helps the Sora model learn and extract features more effectively, capturing and understanding a vast amount of detailed information, thereby enhancing the model's generalization ability for unseen data.

For example, if you input "a cartoon kangaroo dancing disco," GPT can help by suggesting additional details like being in a disco hall, wearing sunglasses and a colorful shirt, with flashing lights and a crowd of various animals dancing in the background. The richness of GPT's explanations and details will determine how well Sora generates the video. Since the GPT model is OpenAI's own, unlike other AI video startups that need to call the GPT model, OpenAI's access to the GPT architecture for Sora is undoubtedly the most efficient and comprehensive. This might be why Sora performs better in semantic understanding. Step 3: Diffusion Transformer Imaging

Sora adopts a combination of Diffusion and Transformer methods. Previously, in our discussion on large language model-based video generation technologies, we mentioned that Transformer exhibits excellent scalability. This means that the performance of Transformer improves as the model size increases, a characteristic not shared by all models. For instance, when a model reaches a certain size, the performance gains from increasing the size of convolutional neural networks (CNNs) slow down or even plateau, whereas Transformer continues to show improvements.

Many have noticed that Sora demonstrates remarkable stability and consistency in maintaining object coherence, scene rotation, and other aspects, far surpassing video models like Runway, Pika, and Stable Video, which are based on Diffusion models.

Recall that when discussing Diffusion models, we also pointed out that the challenge in video generation lies in achieving stability and consistency of generated objects. This is because, although Diffusion is the mainstream technology for video generation, previous work has been limited to structures based on CNNs, failing to fully exploit the potential of Diffusion. Sora ingeniously combines the strengths of both Diffusion and Transformer, leading to significant advancements in video generation technology. On a deeper level, the continuity of videos generated by Sora may be achieved through the Transformer Self-Attention mechanism. Sora can discretize time and then understand the relationship between different timelines through the self-attention mechanism. The principle of self-attention is that each time point connects with all other time points, which is a capability that Diffusion Models lack.

Currently, there are some external speculations suggesting that in the third step of the diffusion model we previously mentioned, Sora might have replaced the U-Net architecture with a Transformer architecture. This modification allows the Diffusion Model, acting as a painter during the reverse diffusion and drawing process, to find more appropriate elements from OpenAI's vast database based on the probability distribution of keyword feature values when eliminating noise, thereby making more precise strokes. During an interview with another AI practitioner, he used a vivid example to explain the difference here. He said: "Diffusion models predict noise. By subtracting the predicted noise from an image at a certain point in time, you get the original noise-free image, which is the final generated image. This is more like sculpting. As Michelangelo said, he simply removed the parts of the stone that shouldn't be there, following God's will, and thus created great sculptures. On the other hand, transformers, through their self-attention mechanism, understand the relationships between timelines, allowing the sculpture to step down from its pedestal." Isn't that quite vivid?

Finally, Sora's Transformer+Diffusion Model generates images from spatiotemporal patches, which are then stitched together into video sequences, resulting in a Sora video. To be honest, the methodology of combining Transformer with diffusion models is not OpenAI's original creation. Before OpenAI released Sora, during an interview with Dr. Zhang Songyang in January this year, he already mentioned that the approach of integrating Transformer with diffusion models had begun to be widely researched in the industry.

Currently, we can see some attempts to combine Transformer models with diffusion models, and the results may not be inferior—some papers even suggest they might be better. So, I’m not sure how models will evolve in the future, but I think it might involve a combination of these two approaches. For example, Transformer has a natural advantage in predicting the next video frame, while diffusion models, though high in quality, often generate a fixed number of frames. How to effectively combine these two approaches is a topic for future research.

This also explains why OpenAI is now releasing Sora. On OpenAI’s forum, the Sora team clarified that Sora is not yet a mature product—it is not publicly available, there is no waitlist, and no expected release date. External analysis suggests that Sora is not yet mature, and OpenAI's computing power may not be able to support its public release. Additionally, there are concerns about fake news security and ethical issues post-release, so Sora may not be officially launched soon. However, since the combination of transformer and diffusion has become a widely explored direction in the industry, OpenAI needs to showcase Sora's capabilities to reaffirm its leading position in the increasingly competitive generative AI video field.

With OpenAI's validation, we can be fairly certain that the direction of AI video generation will shift to this new technological approach. OpenAI has also clearly stated in its technical articles that the 'brute force' approach of massive parameters used in ChatGPT has proven effective in AI video generation as well. OpenAI stated in their article, "We've discovered that video models exhibit many interesting emergent capabilities when trained at scale. These capabilities enable Sora to simulate certain aspects of people, animals, and environments in the real world.

This indicates that Sora, like GPT-3, is showing signs of 'emergence,' which means that similar to large language models, AI video generation will require more parameters, greater GPU computing power, and increased financial investment.

Scaling remains the key advantage of current generative AI, and this likely means that generative AI video may ultimately become a game dominated by large corporations. A more intuitive way to understand this is by comparing it to video storage. For instance, a video might occupy tens of gigabytes, but for large language models, it could be a thousand times larger, reaching terabytes. This illustrates the general trend, even though current video models are still at the billion-parameter level.

Looking at image models like Stable Diffusion, they released Stable Diffusion XL by scaling up the model size, which brought improved results—not necessarily better quality, but more realistic images with more pronounced effects. This scaling trend is inevitable, but the actual performance gains depend on the current model architecture and the nature and volume of training data.

This represents our very preliminary analysis of Sora. We must reiterate that since many technical details about Sora remain undisclosed, our observations are speculative from an external perspective. We welcome corrections, critiques, and discussions regarding any inaccuracies.