What Impact Will OpenAI's Amazing Text-to-Video Sora Bring?

baoshi.rao

Coincidentally, two AI giants released their new AI nuclear weapons on the same day. OpenAI's text-to-video model Sora once again stole the spotlight. Sora's stunning debut not only outperformed many AGI video competitors but may also change the future rules of the game in the film, television, advertising, and gaming industries.

On Thursday, Google suddenly released its next-generation multimodal large model Gemini 1.5 Pro, accelerating its lead in the battle against OpenAI's large models. This is currently the most powerful large language model in the industry, supporting up to 10,000K tokens of context, directly boosting performance to the million-level, completely surpassing OpenAI's GPT-4 Turbo in performance.

What does a million-level token mean? Jeff Dean, Google's AI project leader, explained that with Gemini 1.5 Pro's million-level context window support, users can complete complex content interactions, easily parse entire books, movies, podcasts, understand very long documents, and even codebases with hundreds of files and hundreds of thousands of lines. The release of Gemini 1.5 Pro signifies that Google has gained a strong performance advantage in its arms race with OpenAI. In comparison, OpenAI's GPT-4 Turbo can only handle 128k tokens and recently experienced performance degradation, which wasn't improved until last month's update.

However, OpenAI didn't let Google steal all the spotlight. On the same day, they released Sora, an AI model that generates videos from text. Following their text model ChatGPT and image model Dall-E, OpenAI has now begun disrupting the video domain.

While Google's Gemini 1.5 Pro showcases hard technical advantages in data performance, Sora's visually stunning aesthetic achievements have made a more immediate impression, quickly becoming a hot topic on social media platforms. What makes Sora so astonishing? OpenAI has showcased multiple video clips created by Sora, and these snippets alone are enough to leave people in awe. In its official blog, OpenAI wrote that Sora not only understands user needs but also knows how these things exist in the real world.

With just a text input, Sora can automatically generate high-definition videos up to one minute long. Incredibly, Sora can accurately grasp the complex meanings in user text, break them down into different elements, and transform them into video content with specific creative concepts—looking as if they were produced by professional directors, cinematographers, and editors.

A fashionable woman wearing sunglasses and a leather jacket walks through the streets of Tokyo on a rainy night, her lips painted with bright lipstick curling slightly upward. Even behind the sunglasses, her smile is visible, and the puddles on the ground reflect her figure and the neon lights of the vibrant city. In a bustling Chinatown, a dragon dance performance is underway, with the lively crowd's attention focused on the leaping, colorful dragon. The festive atmosphere of the entire scene makes one feel as if they are right there. Unlike previous AI-generated videos with plastic aesthetics, Sora's creations showcase remarkable differences in realism and artistic quality: naturally curling human hair, visible facial blemishes like moles and acne, neon reflections in puddles, diverse street vendor food displays, and delicately rendered cherry blossom snowflakes - these details achieve near-perfect verisimilitude.

More astonishingly, Sora videos exhibit distinct cinematic characteristics in composition, color grading, creativity, and camera work. They seamlessly transition between single-take shots and multi-angle sequences, even capturing nuanced "actor" expressions - capabilities absent in previous text-to-video products. With this debut, OpenAI has elevated the entire AI video industry by a significant margin.

While Sora's videos aren't flawless (observant viewers may spot continuity errors like uneaten cookies), they represent a qualitative leap in visual fidelity compared to previous AI videos, achieving genuine film-like texture. Most impressively, the system can produce multi-shot cinematic sequences from abstract text descriptions, demonstrating semantic comprehension and cinematographic skills approaching human director-level proficiency. Clearly, the ChatGPT moment for video generation has arrived. After the release of Sora, the internet was in awe, almost overshadowing Gemini entirely. The speed of AI's evolution is truly astonishing. It's been only 14 months since OpenAI introduced ChatGPT, marking the beginning of the generative AI era. Until last year, we were just getting familiar with text-to-image products, and just six months ago, AI-generated images by MidJourney still featured characters with six fingers. Now, Sora's videos have already made everyone feel the blurring boundary between reality and virtuality.

Although OpenAI's GPT-4 Turbo recently experienced performance declines and slower speeds, raising concerns about a bottleneck in generative AI's growth, the release of Sora has undoubtedly dispelled all such worries. Aaron Levie, founder and CEO of cloud computing company Box, remarked after Sora's release, 'If anyone was still worried about AI's evolution slowing down, we've once again seen the exact opposite.'

Currently, Sora is only available for testing by invited creators and security experts to identify and resolve potential safety issues. No official public testing schedule has been announced yet. After all, in an internet flooded with misinformation, the ethical concerns surrounding DeepFake have already become a focal point. If hyper-realistic videos like Sora's are misused, they could lead to catastrophic consequences. Around the same time as the release of Sora, OpenAI also completed a secondary share sale transaction. This was not a fundraising effort for company purposes but rather allowed employees to sell existing shares to venture capital firms led by Thrive Capital for cash. It's worth noting that as a member of OpenAI's board, Sam Altman himself does not hold company shares, so the soaring valuation does not bring him immense personal wealth.

This transaction values OpenAI at $80 billion overall, more than doubling its $30 billion valuation from early last year. According to CB Insights, a market research firm tracking investments and financing, OpenAI has become one of the world's most valuable startups, second only to ByteDance and SpaceX.

In fact, this transaction was supposed to be completed last November but was delayed due to the conflict between Altman and the board. With Altman returning as OpenAI's CEO, investors have once again cast a vote of confidence in this AI giant. Clearly, after the official release of Sora, OpenAI's valuation is set to rise even further. So, what impacts will the stunning text-to-video Sora bring?

AGI video peers are undoubtedly the most directly impacted. After the release of Sora, Cristóbal Valenzuela, CEO of AI video startup Runway, simply posted two words on X platform (formerly Twitter): "Game On." A few months ago, Runway had just released its Gen-2 video model. Meanwhile, Emad Mostaque, CEO of another AI video company Stability, exclaimed, "Sam Altman is truly a magician."

Runway has been around for five years and holds a first-mover advantage in the AI video field, already being used by mainstream Hollywood studios. The Oscar-winning film Everything Everywhere All at Once, which took home seven awards last year, utilized Runway for AI video production. Following the success of Everything Everywhere All at Once, Runway's valuation in its latest funding round soared to $1.5 billion, three times its valuation a year ago. The text-to-video field is currently the hottest startup sector. Over the past few months, amid the generative AI boom, numerous startups specializing in text-to-video and image-to-video technologies have emerged. Justin Moore, an AI investment partner at A16z, listed over 20 text-to-video startup teams he tracks, including notable newcomers like Pika and Zeroscope, which once amazed the internet.

At the end of last year, Pika, founded by Stanford graduates of Chinese descent, stunned both the Chinese and American internet communities. Thanks to its impressive AI-generated videos, this four-person startup completed three funding rounds totaling over $55 million in less than six months, skyrocketing its valuation to $250 million.

Now, however, AI giant OpenAI has unveiled Sora. Whether in terms of video duration, visual fidelity, detail completeness, or multi-shot sequences, Sora far surpasses the videos produced by these smaller startups—dominating might not even be an overstatement. While the AI video field still has vast room for improvement and growth, whether these smaller companies can compete with OpenAI remains a significant question. However, Sora's impact is not limited to the survival space of other AGI video startups; it will also change the future rules of the game for the entire Hollywood, as well as the film, television, advertising, and gaming industries.

The use of AI in Hollywood for creating images and videos is nothing new. From CG (computer graphics), VR to AI, the film and entertainment industry has always been the first to adopt cutting-edge technologies. However, unlike other technologies, AI tools have always been a thorn in the side of Hollywood professionals.

Apart from Everything Everywhere All at Once using Runway's AI video tools, 21st Century Fox collaborated with IBM Watson last year to create a trailer for the AI-themed horror film Morgan using AI tools. Disney's Marvel even entirely used AI to produce the opening animation for Secret Invasion. At the time, Hollywood was in the midst of a major strike by actors and writers' unions. The application of generative AI in the film industry was one of the key points of contention between the two sides. During negotiations, actors and writers learned that Disney Marvel's new season of Secret Invasion had entirely used AI technology to create its opening scene. This news once again stalled the negotiations.

Why has the use of AI tools in the film industry sparked so much controversy? Industry insiders are primarily concerned that production companies might use existing materials to train AI models and frequently employ AI tools to generate content in the future. This not only infringes on the copyright of creators' existing works without providing them adequate compensation but also threatens their future job opportunities and creative space.

Although last year, writers and actors were willing to halt the industry and risk unemployment to secure temporary concessions from production companies, leading to more regulations on AI tool usage, the situation may become even more challenging for them in the next labor negotiations three years from now, as AI capabilities are expected to advance significantly. With the stunning debut of the text-to-video model Sora, Hollywood professionals may be facing a monumental question: Given AI's exponential rate of advancement, it might not be long before AI can generate complete short films or even full-length movies—handling everything from scripting to filming, acting, and post-production. So, what will Hollywood's future look like?

Dave Clark, the Hollywood director behind the horror film When She Wakes, is already using AI tools to make movies. In his view, AI technologies like Sora are not a threat but something creators should embrace to produce content that was previously unimaginable or impossible. "This is a game-changing technology. Instead of worrying about your job, you should be concerned about who is using these tools," he says.

A survey conducted last month by industry research firm CVL Economics, which polled 300 Hollywood leaders, reveals widespread anxiety across the industry. Thirty-six percent of respondents said generative AI has already reduced the demand for routine skills at their companies, while 72% of the surveyed firms are early adopters of generative AI tools. The harsher reality is that 75% of respondents admitted that generative AI (tools, software, models) has already prompted their business departments to cut and consolidate jobs. Those who control the order of the Hollywood industry predict that over 200,000 jobs in Hollywood will be impacted by AI in the next three years, especially in post-production roles such as visual effects, sound engineers, and illustrators.

Jason Hellerman, the screenwriter of the movie Shovel Buddies, believes that as AI tools gradually improve, future producers may indeed generate videos using tools like Sora instead of paying a production team. AI-generated content might also create an entirely new genre. However, if anyone can use AI to make videos and movies, becoming a 'content creator,' this will inevitably lead to a decline in professional standards.

He predicts that in the future, everyone will be able to generate their own videos, much like how everyone now shoots and watches short TikTok videos on their phones. The Gen Z youth, accustomed to short videos, will gradually abandon long-form content like movies and TV shows. Perhaps in the future of AI-generated videos, movies and TV will also transform into formats similar to TikTok short videos.