Sora Goes Viral, AI E-commerce Opens the Door to a New World

baoshi.rao

When a live streamer introduces a windbreaker in the studio, she can instantly teleport to outdoor scenes, snowy mountains, or blizzards to showcase the jacket's usage scenarios; when selling a dress in the live stream, she can naturally walk into street cafes in different cities to display the outfit in everyday settings...

This is not magic or movie special effects, but the integration of text-to-video models with e-commerce live streaming.

OpenAI's release of Sora has elevated video generation capabilities to new heights, opening up a wealth of potential applications. IDC China Research Director Lu Yanxia believes that video generation will first be applied in fields such as short videos, advertising, interactive entertainment, film and television, and media. Currently, Sora can generate one-minute videos, which is already a significant breakthrough for the industry. However, it remains unpredictable when it will be able to produce videos lasting two minutes or longer.

Clearly, the production methods of video marketing will be completely reconstructed. Whether it's placing hosts and products in virtual scenes or creating richer marketing materials for products, brands, marketing companies, video platforms, and consumers will either joyfully welcome or reluctantly accept a bizarre new world.

The core question behind this is: What will AI-powered e-commerce based on large models actually look like? The latest news is that Sora opened for public applications on February 26. Currently, there are two ways to access the official version of Sora: first, priority is given to renowned art professionals; second, by joining Red Teaming (an expert community that provides risk assessment for OpenAI). At the same time, ByteDance's AI creation platform DreaminaAI, under its video editing app Jianying, is about to launch video generation capabilities and has started accepting beta applications.

Current e-commerce marketing videos mainly fall into two categories: live-streaming clips and product showcases. For example, Jirui Technology's product iCut can automatically identify product selling points from live streams, edit clips, retain voiceovers, generate subtitles, and even add background music, transition effects, titles, sidebars, brand logos, and other elements for brand distribution. Sellers can obtain a vast amount of short video materials in real-time while live-streaming.

"Our work focuses more on producing materials from 1 to 100, while Sora can help us create materials from 0 to 1," said Wu Bin, CEO of Jirui Technology. Wu Bin explained that previously, video generation models couldn't be effectively used in e-commerce mainly due to three reasons: First, the video duration was too short to adequately explain products. Second, the clarity was insufficient. Third, the controllability was relatively poor.

From a generative effectiveness perspective, Sora has addressed some of the shortcomings in creating e-commerce marketing materials.

In the demo showcased by Sora, both clarity and completeness are excellent. A one-minute duration also aligns with common video lengths, making it suitable for sellers to create and publish content on platforms like Taobao's逛逛, JD.com's种草秀, or Douyin. Sora can generate eye-catching scenes, such as butterflies flying underwater.

It can also change product backgrounds to enhance presentations. For example, a live-streaming host explaining a windbreaker can smoothly transition to outdoor settings like snowy mountains or blizzards to demonstrate the product's use cases. Similarly, a host selling dresses can naturally walk into a city street café and introduce the product in everyday scenarios. JiRui Tech aims to combine its accumulated industry knowledge of products, scenarios, and details into suitable prompts to complete content production.

Similarly, AoChuang GuangNian, another e-commerce marketing company, hopes to improve the speed of creating original materials through AIGC. "User-provided materials may be insufficient, and with stricter platform duplicate-checking mechanisms, the success of content has a certain randomness. We need to enhance the quality and efficiency of video generation, accelerate the exploration and iteration of creative directions, and increase the proportion of original material generation," said Zhang Hongchun, the R&D head of AoChuang GuangNian.

He pointed out that Sora's ability to follow and understand prompts is astonishing. Video completion and transitions are very natural, whether it's 3D, multi-angle, or one-shot scenes. The data coverage is comprehensive, enabling the generation of more original materials and achieving breakthroughs in specific areas. From a cost perspective, AI is cheaper than hiring photographers and models. Based on a preliminary estimate using DALL·E-3's pricing (charging $0.04 for a single 1024×1024 image), generating a 1-minute video with Sora would cost around 500 yuan. However, due to Sora's increased parameter scale and the time needed to optimize inference efficiency, the current cost is likely higher than this estimate. For large B2B businesses, the cost of shooting a 1-minute video with real people ranges from 1000 to 2000 yuan. As the model's controllability and inference capabilities improve iteratively, the cost could decrease to the estimated range if the results meet requirements, making AI applications more widespread.

However, e-commerce marketing demands "product accuracy," where even minor color discrepancies or a 1-centimeter error in accessories can be considered false advertising. To address this, Aochuang Guangnian produces videos by separating product photography from background generation. "Part of it is real photography, and part is synthesis. When the product and its presentation are fixed, all surrounding elements can be generated by Sora," said Zhang Hongchun. "Sora cannot solve the problem of mismatched products, indicating it doesn't truly understand the physical world and requires human logic to compensate for its shortcomings," said Wang Huamin, Chief Scientist at Style3D. "Many have exaggerated the intelligence represented by Sora. It actually achieves very superficial intelligence through massive data, and its logical reasoning and fundamental understanding of the physical world are flawed. It would be better to have 3D and physical simulation technologies provide the entire logical framework, with AI serving as a polish. Current AI is more suited to play the role of a Copilot."

Style3D's approach is to provide a full-chain 3D+AI tool from product design to sales presentation.

In the design phase, Style3D offers Style3D iCreate, which helps designers or modelers quickly generate creative inspiration through AI's divergent thinking. After finalizing the design, use Style3D Studio to create precise, production-ready 3D virtual garments.

For the fitting presentation, digital human models from the Style3D Studio resource library can be utilized, with adjustable expressions, poses, hairstyles, accessories, and backgrounds.

Based on the 3D virtual garments and combined with AI optimization, e-commerce product images and detail pages can be generated with a single click. The entire process can be completed in as little as 24 hours. "We are more inclined to develop AI Agents that assist professionals in every step of their work. As for video generation represented by Sora, it's certainly helpful for us, but currently I can't see how much value video generation brings to designers," Wang Huamin said with a smile.

Image source: Style3D

Wu Bin believes that Sora serves more as a capability enhancement rather than changing the product logic for B2B solutions. The production of e-commerce marketing materials involves three steps: material organization, intelligent generation, and multi-channel distribution. Sora plays a role in video generation, but existing tools are still used for material organization and channel distribution. "For us, what the model looks like or how intelligent it is doesn't matter. What matters is achieving the goal—that's what makes a good AI," said Wu Bin.

Sora has just released a demo, and Silicon Minds has already started training the digital humans in the demo to speak. Silicon Minds' core business is creating digital avatar replicas for top influencers to conduct live streams, while also providing virtual hosts for MCN agencies to facilitate product sales through live commerce. So how will Sora transform this landscape?

From the perspective of Silicon Minds' CEO Sima Huapeng, the next generation of e-commerce may not necessarily follow the traditional shelf-based model. "You can't call it electrification just because you added an electric light to a horse-drawn carriage."

He cited Character AI as an example – an AI company centered around emotions, companionship, and trust. Its interaction resembles Tony Stark's AI assistant J.A.R.V.I.S. in Iron Man. When Stark asks J.A.R.V.I.S., "I have my first date with my girlfriend today, recommend me a suit," the AI provides product options and arranges doorstep delivery upon selection. This represents "emotional commerce" where purchases are completed through conversational interactions. "I have a very neutral, very loving, and very considerate AI assistant. Today, I asked it what to eat for dinner, and it made recommendations based on my personal information, preferences, and physical condition. I think this could be the new e-commerce. Everyone will have an assistant in the future, and this will significantly disrupt all businesses," said Sima Huapeng.

Digital humans serve as the carriers for these AI assistants. Currently, Silicon Minds has not only reduced the cost of digital human cloning technology from 8,000 yuan to 4 yuan but also leverages Sora to generate scenes and digital human prototypes. Combined with Silicon Minds' digital human training technology, the impact on the content industry is enormous. Additionally, Silicon Minds has ventured into digital human short drama production, reducing costs by 10 times through digital humans + AI tools.

The field of video generation in 2024 is bustling with activity. On January 4, Alibaba Cloud's "Dance King for All" (Animate Anyone model) took social media by storm. On January 11, ByteDance released the text-to-video model MagicVideo-V2, which supports generating 4K and 8K resolution videos in various artistic styles. On January 17, Tencent AI Lab released the video generation model VideoCrafter2. On the same day, a team from Baidu released the video generation model UniVG. In terms of technical architecture, Sora is at least one generation ahead of current video generation models.

Zhang Hongchun explained that video generation models like Pika, Runway, and the animatediff series add a temporal module after the spatial module of single-frame images to learn the coherence between frames, fully utilizing the pre-trained weights and knowledge of image diffusion. This approach of decoupling space and time for modeling is not the optimal method for video representation and modeling.

In contrast, both Google's W.A.L.T and Sora adopt unified spatiotemporal modeling across all modules of the pipeline. They also incorporate the DiT concept by replacing unet with transformers to enhance the model's scaling capability. Compared to W.A.L.T, Sora has made significant improvements in data quality, data diversity, and multi-size/multi-resolution support. The cumulative effect of these factors makes Sora's final performance truly outstanding. From large language models to multimodal large models, the core challenge lies in converting various modalities into tokens for input into language models. Zhang Hongchun explained that text is naturally tokenized, while images and videos can also be tokenized through compression methods, such as Google's MAGVIT. In the audio domain, Google has also introduced audio generation technology for underlying audio tokenization. Therefore, both Google and OpenAI have developed tokenization technologies for audio, video, images, and text.

However, domestically, discussions mainly revolve around text and image tokenization, with audio and video tokenization capabilities still being relatively uncommon. XR entrepreneur Xie Mingxuan believes that Sora demonstrates the possibility of real-time generation of digital content and virtual worlds, which could make virtual spaces become the new mass medium, replacing short video platforms.

The challenge of the metaverse lies in the low production efficiency of digital content—requiring 3D modeling, texturing, and then production in game engines, making the process overly complex with high barriers to entry. Sora's mechanism presents the possibility of a completely new rendering engine, where future content creation could be based on prompts to generate 3D content. With Sora, scripts in digital content production would use natural language instead of programming languages, significantly lowering the barriers to the digital world. Everyone could quickly build their own digital worlds. Regarding Sora's development path, most industry professionals agree that Sora will likely be incorporated into the large language model GPT-5, forming a product similar to Google's Videopoet. "Theoretically, Sora should be placed within the context to enable better understanding, reasoning, generation, and interaction based on longer contexts. Language models are most suitable as the foundation and for unifying various models. Only when integrated into language models can it better interact with humans," summarized Zhang Hongchun.

The text generation field lacks mature business models, making it difficult even for OpenAI to determine how to price GPT-generated text. However, video generation has very mature business models - there are established standards for pricing short videos, movies, and TV shows. "If Sora performs well after release, generating hundreds of billions in revenue is possible. With a price-to-sales ratio of several dozen, reaching trillion-dollar market capitalization would be relatively easy. I don't think this is particularly difficult," said Sima Huapeng. "The video generation path will bring OpenAI very significant revenue, I estimate it will exceed hundreds of billions of dollars," added Sima Huapeng.

Returning to OpenAI behind Sora, why is its model able to stand out from the crowd?

A domestic AI company executive told EBON Power that in 2019 he tried to poach an OpenAI employee. During the communication, he introduced their vision as helping humanity transition from carbon-based to silicon-based. The OpenAI employee replied, our vision is to create God. On the other hand, a significant number of AI practitioners see the technical concerns behind Sora.

OpenAI follows the technical route of "big data, large models, and massive computing power," treating Scale as one of its core values: We believe in scale—in our models, systems, ourselves, processes, and ambitions—as having magic. When in doubt, scale up, and Sora is a representative of this mindset.

However, Wang Huamin believes that many practitioners have recognized the limits of this technical path and are shocked by the overwhelming praise for Sora. OpenAI's capability breakthrough came from being the first to use previously untapped data, allowing rapid data volume growth. However, the data requirements for large models grow exponentially, and global high-quality language data is projected to be exhausted by 2024. "We've been overly optimistic about data volume—the global data ceiling will arrive sooner than the computing power ceiling," analyzed Wang Huamin.

When data reaches its limit, the brute-force technical approach will also hit its ceiling. Many of our interviewees agree that machine-synthesized data would degrade model performance.

Depletion rate of high-quality language data, Source: Tech blogger Dwarkesh Patel Meta's Chief Scientist Yann LeCun also believes that as data volume peaks, model performance will tend to saturate, and we need breakthroughs in other dimensions, relying on scientific research rather than pursuing the growth of data volume. "At this stage, there is no such technology that allows AI to learn like a baby observing the world. We are researching this issue and hope to achieve a breakthrough."

In 2023, Yann LeCun proposed a new concept, which is to build an end-to-end bionic architecture based on the operating mechanism of the brain, including six core modules: configurator, perception module, world model, cost module, actor module, and short-term memory module. Based on this concept, he designed the V-JEPA "non-generative model." This at least represents another path beyond the aesthetics of brute force. In Wang Huamin's view, "What Yann LeCun is doing—we can't guarantee whether he'll succeed, but at least his team has a clear understanding of these matters and knows where the existing problems lie. If you don't even know what the problems are, then it's fundamentally impossible to solve them."

Schematic diagram of Yann LeCun's autonomous intelligent system architecture, sourced from 'The Path to Autonomous Machine Intelligence, Version 0.9.2, 2022-06-27'

Technological development is nonlinear. When one technical approach reaches a certain level, it will stop progressing, and then new technical approaches will emerge to surpass it and reach higher levels. "The same goes for AI. Currently, Sora's wave doesn't seem capable of achieving physical understanding or realizing AGI. New technologies will emerge later to surpass it, and ultimately, we will be able to create AGI or world models," Wang Huamin concluded. No one knows what OpenAI is thinking. "Before Sora's release, the outside world had no idea what they were doing or how far they had progressed," an entrepreneur told EBON. "OpenAI has implemented military-style management internally. They have more models than just Sora, but no one knows what they are, and they're using these models extensively."

Perhaps before the next wave of technological innovation arrives, we can look forward to seeing Sora implemented in more scenarios.