AI Continues to Dominate in the Year of the Dragon: Not Just Sora, Three Major AI Models Debut on the Same Day
-
During the Spring Festival holiday, the field of artificial intelligence has made rapid progress. To summarize, here's what you need to know:
OpenAI released their video generation model Sora. It's outstanding.
Google released Gemini 1.5 Pro, with performance close to 1.0 Ultra and nearly unlimited text length (up to 10 million tokens).
A model named Mistral-Next was discovered on the ChatBot Arena platform, hinting at its imminent release. Initial tests suggest it's at least a reliable model.
The main content below is translated from the Interconnects article "OpenAI’s Sora for video, Gemini 1.5's infinite context, and a secret Mistral model", originally written by Nathan Lambert. We knew it was coming, but we're still amazed by its outstanding performance. You need to see some AI-generated videos. OpenAI released Sora, and Sam Altman spent the whole day sharing its magical generation videos on Twitter. Later that day, OpenAI published a slightly more technical blog post confirming most of the rumors people were concerned about.
In short, Sora is a combination of Vision Transformers (ViT) and diffusion models. The core idea behind Vision Transformers and Sora's data processing seems to be embedding video clips into a latent space called "patches," which are then treated as tokens.
Quoted from OpenAI's blog: Sora is a diffusion model; trained to predict original "clean" patches from input noise patches (along with conditional information like text prompts). Notably, Sora is a diffusion transformer. Transformers have demonstrated exceptional scaling properties across multiple domains including language modeling, computer vision, and image generation.
In this work, we find that diffusion transformers can also effectively scale as video models.
Lambert points out that while the blog post mentions many interesting aspects, none address truly critical elements like model size, architecture, or data. The data likely consists largely of YouTube content and some procedurally generated videos (from game engines or other custom sources, to be detailed later). Key insights include:
They train on multiple resolutions (unlike most multimodal models fixed at resolutions like 256x256), including 1920x1080p in both landscape and portrait orientations. "We are introducing resubtitle technology to the DALL-E 3 image generator for video applications." This includes two key points:
Language model mediation of prompts remains crucial for obtaining high-quality outputs. People typically don't do this unless necessary. I believe this will ultimately be resolved through better data control.
More importantly, this is linked to their "highly descriptive captioning model" (which converts videos into text), which is essential for labeling data. This confirms that basic GPT-4 can achieve this, or that OpenAI has many other cutting-edge models hidden within. Sora can perform animation, editing, and similar operations by receiving image inputs.
Sora can edit videos through video inputs.
An anonymous ML account on Twitter uncovered a paper with a similar architecture. The architecture diagram is as follows. Sora's most impressive feature is its ability to realistically simulate the physical world (OpenAI describes this as an 'emerging simulation capability'). No other text-to-video model has come close to matching this before. Just a few weeks ago, Google's Lumiere made a strong impression, but it pales in comparison to Sora.
There are many rumors that Neural Radiance Fields (NeRFs), a popular 3D reconstruction technique for images, might be used under the hood based on video characteristics (just like the physical world), but we lack clear evidence to confirm this. Lambert believes it could be procedurally generated game engine content. Simply using games isn't enough—you need a method to generate data diversity, as with all synthetic data. The data we built for RL agents at HuggingFace is a good example. Data diversity might unlock another level of performance during generation—something we often observe in large models. All the commentary about the death of Pika and Runway ML (other popular ML video startups) is completely exaggerated. If the pace of progress is so fast, then we still have many turns ahead. If the best models come and go quickly, then the most important thing is the user touchpoint. This hasn't been established in the video field yet, and MidJourney is still relying on Discord (though the user experience is quite good)!
Just hours before Sora's release, Google already shocked everyone with the next version of Gemini. The immediate changes this might bring to how people use LLMs could arguably be more impactful than Sora's videos, though Sora's visual demo quality is mesmerizing.
Gemini 1.5 Pro's performance approaches Gemini 1.0 Ultra, but with higher parameter efficiency and the addition of Mixture of Experts (MoE) as its foundational architecture. Gemini 1.5 Pro can now extend text length to 10 million. For reference, when OpenAI increased GPT4 to 128k, it was a big deal. Ten million is almost meaningless—it's not a transformer. But the amount of information it can receive far exceeds what the average ChatGPT user might imagine.
Google may have found a new method that combines the architectural concept of long context with their TPU computing stack, achieving excellent results. According to Pranav Shyam, one of the leaders of Gemini's long-context project, this idea only sprouted a few months ago. If released as a minor version (v1.5) rather than v2, there is certainly room for even greater development. As a thought experiment, discussions around Gemini 1.5 reveal that you can include an entire production codebase within the model's context (see examples provided by Google). This could genuinely alter the fate of libraries that haven't yet become popular enough to be scraped thousands of times for the next GPT version. As an enterprise tool, it's invaluable. The visualization of ten million tokens as content represents significant value. Imagine processing 3 hours of video or 22 hours of audio without segmentation or loss through a single model.
To clarify, paid Gemini users will soon have access to 1 million token context lengths (similar to ChatGPT Plus plans), while the technical report mentions 10 million token windows. Lambert suggests the current limitation is primarily due to cost considerations. The computational requirements for any model at this scale are substantial.
This context length figure has been particularly perplexing. Longer context windows enable greater precision. Seeing this, we understand that this model is not a transformer. It has a way to route information through non-attention models. Many people mentioned Mamba, but it's more likely that Google implemented its own model architecture using optimized TPU code. Mamba comes with special Nvidia kernels and integration.
Lambert is very excited about this because, in the future, the models we interact with will allocate computation to sub-models specialized in different tasks. Lambert expects that if we see the Gemini 1.5 Pro architecture diagram, it will resemble a system more than a typical language model diagram. This is what the research and development phase looks like.
The famous prompt engineer Riley Goodside once shared this type of change: There are many implications here. Why bother with [supervised fine-tuning] when you can do 100K-shot learning? If it can translate Kalamang with just grammar and a dictionary, what more can the right words teach it?
Fundamentally, this means we can now directly instruct models on how to behave in context. Fine-tuning is no longer a necessity. Lambert believes this will create synergies, and that fine-tuning costs will be lower when reasoning reaches a certain scale, but this is still exciting.
For more information, refer to Google's Gemini 1.5 blog post or technical report.
Finally, the CEO of Perplexity mentioned in an interview that Google quadrupled the compensation for people he wanted to hire. That's insane—I don't know whether to interpret this as a bullish or bearish signal for Google. If that wasn't enough, there's another Mistral model quietly chatting away on the LMSYS Arena. I've heard rumors about another model coming soon, but this one seems much more tangible.
Basic testing indicates it's a powerful model. Of course, the Twitter mob will now rush to conduct more vibes-evals, but Mistral will tell us soon enough. I'm guessing this is their API-based GPT4 competitor.