From 'AI ID Photos' to 'AI Dubbed Films': Exploring the Commercialization Path and Challenges of Viral AIGC Applications
-
HeyGen, the video translation tool that made Guo Degang speak English and Taylor Swift speak Chinese, like the 'Miao Ya Camera' that sparked the AI ID photo trend, quickly faded from group chats after a frenzy of sharing.
In October, various cross-language translation videos of celebrities created by HeyGen went viral online. People were amazed by AIGC's authentic Chinese and English expressions, completely free from the tone of dubbed films. The lifelike voice restoration and highly synchronized lip movements led many to exclaim, 'This is truly shocking' and 'Voice actors might lose their jobs'...
There is nothing new under the sun, and this phenomenal AIGC application couldn't escape the fate of rapid obsolescence.
Now, in the LLM model discussion groups I'm in, someone occasionally shares a Chinese-English translation video, but it sparks no discussion. Most people probably don't even feel like clicking to watch it.
The public's sense of novelty fades quickly. 'Star-studded dubbed films' only serve as occasional entertainment, not a frequent necessity. After the initial curiosity wears off, when it comes to paying real money, the trend vanishes without a trace.
This year, large models have undoubtedly been the hottest global topic. But despite the hype, few large model applications have truly established themselves in the commercial market.
With so many models competing, why are there only a handful of successfully productized, phenomenal applications?
And why can't these mature, high-profile AIGC applications, which already have no shortage of attention, convert their traffic into lasting economic benefits, leaving commercialization still shrouded in mystery?
This article hopes to explore the productization conditions and commercialization myths of large models through the small lens of 'AI dubbing'.
Overnight Sensation
A Victory in Productization
First and foremost, it's important to clarify that the overnight success of AIGC applications like MiaoYa and HeyGen is undoubtedly a positive development for the large model industry.
Large models are merely foundational technologies, akin to steel, with large model manufacturers acting as steel mills. It's essential to have someone design specific products like washing machines, treadmills, and microwaves for new technologies to be utilized effectively.
The sudden rise of HeyGen represents a triumph in productization.
Technically speaking, cross-language video translation and production are not novel concepts. Many tech companies, film studios, and post-production firms have already explored and launched professional-grade tool platforms in this field.
In simple terms, it's an upgraded version of TTS (Text To Speech) technology. Using large language models to translate text more idiomatically, then better modeling the sound space to train a cross-language transfer TTS model, making style transfer, timbre transfer, and emotion transfer more robust, resulting in more natural and restored synthesized speech.
The characteristic of this technology is its efficiency—the entire translation process is fully automated, enabling batch generation of translated videos. However, in terms of naturalness and expressive details, it still falls short of the nuanced and creative performances of human voice actors.
To summarize, the technical principles behind HeyGen are not some exclusive secret.
The reason for its popularity is its exceptional productization capability.
Generally, the productization of AI technology involves three steps:
Step 1: Tool Selection.
As the saying goes, "A craftsman must sharpen his tools to do his work well." Tool selection is a topic developers love to spend considerable time debating. HeyGen's approach to tool selection is quite pragmatic, even appearing particularly "newbie-friendly"—opting for top-tier closed-source models combined with an open-source "starter pack."
Netizens have uncovered that HeyGen utilizes Whisper for speech-to-text conversion, GPT-4 (currently not open-source) for text translation, so-vits-svc for voice cloning and audio generation, and finally GeneFace++ to synchronize the translated speech with the speaker's lip movements in the video.
Since the rise of large models, we've seen many developers evaluating and selecting the "best" large models. However, with various base model providers in the market offering similar competing services, it's nearly impossible for developers to find tools that are absolutely the best. The cutting-edge nature of these underlying tools, such as base models and programming languages, is subject to change. What developers should focus on is selecting relatively optimal tool combinations, then quickly developing demos, validating ideas, and iterating for improvements.
Step Two: Prototype Design.
The tools chosen by HeyGen, whether it's the GPT4 API or open-source models, are relatively easy to obtain. However, most ordinary people don't derive much pleasure from searching GitHub repositories or debugging software issues.
Take cross-language video translation as an example. It involves translating multimodal content, including speech, text, and video. While there are already excellent automation solutions for subtitle translation, speech synthesis, and intelligent dubbing, there are still few products that integrate these multimodal functions to achieve end-to-end, one-click translation.
Therefore, HeyGen has built a simple and easy-to-use interface. By integrating multiple models and tools, it lowers the barrier to translation. Users only need to upload the original video, select the target language, and click a button to output the result, then wait for the voice cloning to complete.
The core value of HeyGen is to allow non-technical users to avoid getting bogged down in numerous technical details. Without needing to install multiple additional tools, they can interact with various models to accomplish a range of complex tasks such as transcription, translation, dubbing, image processing, and audio-visual synchronization, enabling effortless high-dimensional and interactive content creation.
Step Three: Productization.
While celebrity cross-language translation videos are entertaining, they represent just one use case and remain confined to consumer-level entertainment due to copyright issues involving natural persons' voices and likenesses, making large-scale commercialization impossible. Thus, although celebrity dubbed videos have popularized HeyGen, the platform must demonstrate stronger product capabilities to truly deliver market value.
HeyGen's official website reveals that its core product strength lies in digital avatars combined with cross-language video translation. The platform highlights several practical applications, including cross-border e-commerce marketing videos, multilingual brand promotion, educational video creation for teachers, social media audience growth, and personalized videos for memorable occasions like birthdays and weddings.
Building on this foundation, HeyGen allows digital avatars to translate videos across languages through automated production pipelines.
Users can upload their own photos for personalized avatar customization or choose from HeyGen's provided digital avatar materials and templates. By inputting a script, they can generate multilingual videos tailored to their needs.
With this, HeyGen has successfully completed the transformation of AI translation into a productized solution, achieving remarkable success. This has led to the phenomenon where "years of translation work went unnoticed, but HeyGen became famous overnight."
From AI-generated portraits to the explosive popularity of AI-translated videos, it repeatedly demonstrates that productization is a crucial, indispensable step that cannot be overemphasized.
It can be said with certainty that the inability to complete the transition from technology to prototyping and then to productization will be a major reason for the low return on investment in many large models, and one of the reasons for the failure of many AI startup projects.
Hard to Escape the Fate of "Rapid Obsolescence"
The Curse of Commercialization
However, even with such successful commercialization, HeyGen has once again repeated the story of its predecessor 'Miao Ya'—after a sudden spike in traffic, it quickly disappeared from major group chats.
The ebb of public traffic seems to be the common fate of viral AIGC applications.
Some argue that HeyGen is 'making a fortune quietly.' Although the novelty-seeking players have dispersed, the remaining users have contributed to HeyGen's revenue growth, with a month-on-month growth rate exceeding 50% for nine consecutive months. Founder Joshua Xu also shared relevant data on social media, revealing that the Annual Recurring Revenue (ARR) reached $1 million in just seven months.
The question arises: Is HeyGen's commercialization potential sustainable?
We believe that HeyGen will face substantial commercialization challenges.
First, technological tools cannot be monopolized, and relying solely on multimodal AI cannot establish a viable business model.
HeyGen has achieved remarkable results in cross-language video translation production, leveraging the powerful multimodal and comprehension capabilities of large models, surpassing traditional AI translation methods. This is indeed impressive work. However, large models, like C++ or databases, are merely new technological tools that cannot be monopolized. The open-source tools used by HeyGen are easily accessible, and closed-source model APIs are widely available. Therefore, relying solely on underlying tools cannot establish a business model or competitive barriers.
Moreover, the development threshold for product creativity and user interfaces is not high. Numerous tech companies and individual developers can easily replicate and optimize these products, meaning HeyGen could be surpassed in no time.
Nowadays, when you open reports from overseas tech media, you'll find video generation tools like HeyGen (formerly movio) listed among as many as 95 recommendations. It can be said that HeyGen provides a valuable AIGC use case, but it quickly ignited a fierce competition, which poses a significant threat to its sustained revenue growth.
Secondly, the rigidity of C-end payments and the deep industry barriers in the B-end sector will slow down the revenue growth curve.
Currently, HeyGen's revenue mainly relies on payments from C-end customers. The free version only supports one free credits subtitle, clearly just for casual use, while the lowest-paid Creator tier costs $24/month. Although this isn't too expensive for individual bloggers, with the influx of homogeneous products engaging in price wars, it may soon face the dilemma of being perceived as low value for money.
While commercial users (business) have strong payment capabilities and high price acceptance, their demands for the technical sophistication of cross-language video translation are more complex. Most clients of HeyGen's commercial version are involved in creating e-commerce marketing advertisements, digital language learning assistants, multilingual news broadcasts, and dubbed films. They require finer granularity in translation quality, such as ensuring the translated text length closely matches the target language to maintain lip-sync consistency. Additionally, different individuals have unique speech rhythms—pauses and stress positions must align to accurately reflect personal styles.
For example, when an elderly person and a child speak the same text, the wording and phrasing should differ due to their distinct character settings. Both the translated text and voice must align with these character portrayals.
There are also many cultural details that require strict control in cross-language translation, often necessitating human translators. After all, commercial scenarios differ from entertainment settings—misinterpretations in cross-language contexts can lead to ambiguity. Even a 1% error can negate the 99% of correct work, risking lost deals or compliance issues in overseas markets.
Therefore, commercial users require complex, high-quality, and highly controlled products. This necessitates vendors to have exclusive, high-quality data accumulation, especially in low-resource languages such as small languages. Model training and strong industry knowledge require long-term accumulation and iteration to reach the level of professional translators.
Currently, some AI companies have launched ToB product solutions to meet the demand for high-quality video translation. They train cross-lingual Voice Conversion models, capture the lip movements of voice actors, and manually control the process. The results generated by the VC model exhibit stronger expressiveness and retain more details compared to TTS models.
Domestic AI giants are now highly focused on the B2B market, with ample resources and deep expertise in machine translation TTS and multimodal AI technologies. They are likely to become competitors for HeyGen's commercial users.
The innovation wave of large models has just begun. Maintaining a leading edge in commercialization is like sailing against the current—even the most popular players cannot afford to be complacent.
The Rapid Rise and Fall of Popular Apps
The Commercial Myths of Large Models
On November 30, 2022, ChatGPT was launched. Just after its first anniversary, the wave of large language models has swept everyone along. Some may remain unaware, but they are inevitably caught up in it.
Over the past year, viral applications like Miaoya Camera and HeyGen have frequently taken social media by storm. This demonstrates several key points:
1. Large models are a condition, not the outcome. Those who skillfully leverage these new tools to create innovative products will unlock boundless opportunities in this new era.
2. Infrastructure is a challenge, but not an insurmountable one. Discussions about large models often revolve around concerns like restricted access to computing power (e.g., GPU shortages) and the performance gap of domestic models. While pessimists may be technically correct, it's the optimists who drive progress. In reality, neither computational infrastructure, development tools, nor foundational models should—or do—pose barriers for application developers today.
Some in the industry have said that domestic chips only need to reach 60% of Nvidia's performance for users to buy them. Several developers have told me that after intensive use of domestic large models like Baidu's ERNIE and iFlytek's Spark, the basic logical reasoning capabilities can indeed rival GPT-3.5-turbo, while non-core capabilities like function calling and stability are also commendable. Products like Miaoya and HeyGen, both developed by Chinese companies, prove that action speaks louder than insight.
3. Productization is a necessary condition for large model commercialization. Despite creating numerous general-purpose and industry-specific large models, without substantial productization outcomes, they cannot be transformed into practical value and economic benefits. What "changes the world" isn't large models themselves, but various products. Countless applications like HeyGen that help developers transition from prototyping to productization while reducing trial-and-error costs will become the most crucial move for large model providers.
4. Business barriers are built on essential scenarios + strong domain knowledge/data + software engineering. HeyGen's commercialization challenges demonstrate that neither large models nor products alone constitute barriers, as these can be easily replicated. Instead, industry-specific knowledge/data, large-scale software engineering process control, and cost-efficiency improvements—enabling deep scenario-specific development, rapid iteration, and optimization—align with AI's technical characteristics and ensure successful commercialization.
Several developers working on large model applications in the industry have independently shared a common insight: first identify the application scenario, then optimize the product and service. In other words, map out the commercialization path upfront, ensure your competitive moat is established, and only then focus on solid product development—this approach keeps anxiety at bay.
For example, a ToC homestay large model addresses the delicate balance in guest interactions. When property managers intervene too much, it feels intrusive and boundary-crossing; when they intervene too little, the service lacks value and fails to resolve issues promptly. A voice-interaction assistant powered by a large model serves as an effective buffer between guests and managers, delivering just the right level of service. Additionally, activities like dining, entertainment, and shopping during the guest's stay revolve around their accommodation. By leveraging the homestay large model to provide high-quality, reliable recommendations, it unlocks potential for commercial conversions.
A ToB financial large model developer also noted that the diverse needs within enterprise organizations cannot be met by a single generic, standardized software product. Therefore, ToB large model ventures must combine business analysis consulting with hands-on software development to truly serve clients well. Streamlining and automating AI software development is critical for cost control—relying on teams of PhDs to manually code for every project is unsustainable.
Deep insights into business and scenarios, as well as understanding the industry and customers, are far more challenging than mastering algorithms or technical skills. These are the core competencies that developers should prioritize the most.
Finally, I'd like to say that although large models have become extremely popular, there's no need to rush into worrying about 'bubbles' or fearing 'chasing highs.' This is just the beginning.
Survey reports from international consulting agencies show that 65% of respondents currently use generative AI occasionally or rarely, while about 90% of respondents believe AI should be used 'frequently or always.'
This means that public acceptance of machine learning and generative AI (Gen AI) is high, but actual penetration rates remain low. Phenomenal Gen AI products like Miaoya and HeyGen have undoubtedly taken a big step forward, but they alone are far from enough.
Popular AIGC applications represent only a small fraction of the value potential of AI and large models. The fact that no business model has been proven viable in the long term precisely indicates that there remains ample room for pioneers and builders to explore and develop on this new technological frontier.