Post-Sora Observations: Eight Anchoring Points for the Industrial Implementation of Large AI Models

baoshi.rao

In the ongoing year of 2024, domestic large models will further penetrate and materialize in industries. Beyond technological breakthroughs, there will be more downward industrial compatibility and practical case studies, accelerating the digital transformation of industries as a new form of productive force.

"The film tells the space adventure of a 30-year-old man wearing a red wool knit and a motorcycle helmet, with nothing but blue skies and salt deserts in sight. Please produce a vibrant, cinematic-style short film shot on 35mm film."

This prompt comes from the introduction page of OpenAI's first text-to-video model, Sora. In the corresponding video, the smooth playback, high-definition quality, video length, coherence, and seamless multi-shot transitions are nothing short of astonishing. Notably, within less than three months after Pika's product launch, OpenAI's Sora has taken a groundbreaking step forward in video duration, aspect ratio, and video expansion capabilities, marking a significant leap in the field.

The capital market has responded enthusiastically to the explosive debut of the AI video model Sora, with the A-share market experiencing an AI boom. The artificial intelligence index surged by over 7% during trading, and several individual stocks hit their daily limit-up.

The year 2024 has been heralded by the advent of AI large models capable of 'text-to-video' generation, signaling the dawn of a new era. Over the past year, from the explosive innovation in generative AI, to the deep integration of model miniaturization and scenario-based applications, to the prosperity of the open-source ecosystem and cross-domain synergies, AI large models are reshaping the world at an unprecedented pace.

At this historic turning point, coinciding with the emergence of Sora, we delve into eight key anchors of AI large model development: What will be the defining markers of AI technological advancement in 2024? Where will they emerge? And what paths will the world take toward AGI?

Undoubtedly, a new wave of AI productivity revolution is upon us. A market consensus is that in the field of general large-scale models, given the high financial barriers to R&D, only a few tech giants are expected to emerge victorious in the competition, as foundational large models lack broad applicability for the diverse needs of most small and medium-sized enterprises (SMEs).

Currently, mainstream large model providers in the market are predominantly internet companies, accounting for nearly half of the sector. Examples include Baidu's ERNIE model, Alibaba's Tongyi model, and Tencent's Hunyuan model.

Objectively speaking, general large models often struggle to precisely address the specific problems of all enterprises. When selecting a large model, key considerations for businesses also include its alignment with industry characteristics, data security strategies, iterative upgrade capabilities, and overall cost-effectiveness. It is foreseeable that this year's development of large models will increasingly exhibit a significant specialization trend, mainly divided into general-purpose, specialized, and models designed for specific scenarios.

Vertical industry-specific models will play a crucial role in promoting the widespread application of large models, by integrating general public domain data with industry-specific data to collectively build the data foundation required for industrial-scale large model training. For instance, in the healthcare sector, 'Good Doctor Xiao Hui' developed by Runda Medical in collaboration with Huawei Cloud is a vertical large model specializing in medical testing. It is built upon the Pangu large model and the Huijian testing knowledge graph. Similarly, in the education field, NetEase Youdao's first officially registered education vertical large model, 'Zi Yue,' has been successfully applied in smart hardware and App products.

In supply chain management, QiQitong, leveraging its 'multi-empowerment' strategy in digital procurement, has likely explored or already developed a vertical AI model specifically optimized for procurement and supply chain management processes. In the tourism industry, Ctrip's first tourism vertical large model, 'Ctrip Ask,' provides users with comprehensive intelligent service support, from pre-trip planning to in-journey services and post-trip feedback, demonstrating the profound impact of AI technology on the transformation of the tourism industry. This year, industry-specific vertical models will become a core implementation trend beyond technological breakthroughs, focusing on industries, sectors, and vertical data. Areas such as enterprise security management and financial tax management—more specialized fields—may see new AI opportunities emerge in 2024.

With advancements in artificial intelligence theories like deep learning and reinforcement learning, as well as the successful practical applications of large models such as the GPT series and Alpha series, today's AI Agents have already achieved relatively mature capabilities in knowledge representation, learning, and reasoning. From a global perspective, OpenAI's GPT-3 is now utilized in various scenarios such as code generation and text creation, serving as a mature C-end tool for public use.

Beyond mere tool-level applications, with technological breakthroughs and gradual implementation, AI Agents are progressively achieving comprehensive processing of multimodal information including vision, hearing, and language. This enables them to understand and adapt to more complex real-world environments, extending their applications to the C-end market.

For instance, Google's proposed CoCa is a multimodal pre-training model that integrates image and text understanding, with its application scenarios continuously expanding. In fields like customer service, education, healthcare, and industrial manufacturing, AI Agent-based systems such as intelligent customer service, teaching assistants, diagnostic aids, and automated production line decision-support systems are beginning to see large-scale deployment and application. Furthermore, in 2024, it is evident that AI Agent advancements are not limited to software tools but also include the intelligent upgrades of hardware devices (such as robots and drones), achieving integrated applications that combine software and hardware, further driving their practical implementation. Examples include decision-making systems in autonomous vehicles and interaction modules in home service robots.

Whether it's the consolidation of theoretical foundations, the launch of technological products, the enrichment of practical cases, or the refinement of industry chains, all clearly indicate that AI Agent is gradually transitioning from theoretical research to practical application stages. Domestic companies are also accelerating their competition in this market, with applications of AI Agents such as DingTalk, Feishu, and Kingsoft Office.

DingTalk has integrated a large model called "Tongyi Qianwen" into its product. By incorporating this powerful AI technology, DingTalk can provide users with more intelligent collaborative services, such as intelligent customer service, speech-to-text conversion, automatic meeting minutes generation, and smart schedule management.

Additionally, "Tongyi Qianwen" may also assist users in solving complex problems in work scenarios, providing cross-departmental information queries and customized solutions based on business needs. Feishu has launched the intelligent assistant 'MyAI'. It can understand and execute users' natural language commands, handling tasks in daily workflows such as document retrieval, project progress tracking, and internal communication coordination. Additionally, it continuously optimizes the user experience by leveraging machine learning capabilities. It is reported that Feishu's MyAI is also advancing towards more advanced office automation features, such as predicting team workloads and intelligently recommending workflow optimization solutions.

Cases like these are maturing, and as both software and hardware mature, AI Agents are moving from mere 'technical showcases' into practical stages.

MaaS (Model-as-a-Service) is a cloud computing model that provides pre-trained AI models to developers and enterprise users via APIs or SDKs. This allows them to quickly integrate AI technology into their products and services without having to build complex machine learning models from scratch. Specifically, MaaS simplifies the process of using AI, eliminating the need for users to possess deep AI expertise or substantial computational resources to train models. This significantly reduces the difficulty and cost for both enterprises and individuals to apply AI technologies. MaaS also provides standardized interfaces, allowing users to flexibly call different model services according to their needs, saving considerable R&D time and financial investment.

Users no longer need to maintain and run complex models locally. Instead, they can access cloud services on demand, achieving efficient utilization of computational resources and cost-effectiveness. The MaaS model supports businesses of various industries and scales to quickly implement intelligent solutions, such as precision marketing, risk assessment, and intelligent customer service, further accelerating the adoption and application of AI across different sectors.

Under this model, service providers are responsible for the continuous optimization and updates of models. Users only need to focus on business logic and final outcomes, benefiting from the latest AI advancements and technological progress. From the perspective of cloud vendors, major players such as Huawei, Tencent Cloud, Alibaba Cloud, and Baidu Intelligent Cloud are all providing such services. Specialized companies like Enflame Technology also offer large model-based services on their "YaoTu Text-to-Image MaaS Platform." Additionally, many startups and traditional software service providers focused on specific fields or industries have begun offering MaaS-related services.

It is foreseeable that this model will become a new service paradigm for cloud vendors, offering enterprises a novel payment model beyond SaaS, PaaS, and IaaS. For the cloud computing market, this represents a new direction for development and market expansion.

Since 2023, numerous model and hardware manufacturers have announced their vision of embedding large models into end devices. Chip manufacturers like NVIDIA, Intel, and Arm are actively developing AI chip products for end devices, strongly supporting the widespread application of large models in the consumer electronics market. With advancements in technology and optimization, including model miniaturization, lightweight design, enhanced edge computing capabilities, and low-power consumption designs, more and more large models or their simplified versions are expected to be embedded into various smart terminals such as personal computers, smartphones, AR glasses, and home appliances.

Additionally, industry experts are optimistic about the application of large models in more vertical fields. Currently, domestic large model manufacturers like Zhipu and Tongyi have gradually introduced "lightweight" models adapted for mobile terminals.

On the smartphone manufacturer side, Xiaomi announced its first GPT large model product, MiLM; OPPO released a personalized exclusive large model and intelligent agent, the AndesGPT; vivo officially launched its self-developed AI large model, BlueLM; the Honor Magic6 supports Honor's self-developed 7B on-device AI large model; Huawei announced that its Pangu large model is also being integrated into smartphones... Under this trend, it is anticipated that more customized and industry-specific 'lightweight' large models will achieve commercial implementation in 2024, providing users with more personalized, efficient, and real-time local intelligent services.

With the realization of this vision, some previously difficult-to-achieve technologies will also become a reality.

For example, highly personalized voice assistants capable of deeply understanding user needs can more accurately predict user behavior and provide decision-making suggestions, assisting with daily tasks, travel planning, and more. In fields such as healthcare, law, and education, large models can function as expert systems, offering professional consultation services directly on mobile devices—for example, providing preliminary diagnostic suggestions based on patient symptoms or legal advice instantly.

Large model-driven creative tools for image generation, video editing, and text writing allow users to produce high-quality content with simple instructions, such as generating marketing posters with one click or automatically creating short video scripts.

Integrated large models in smart home devices enable autonomous learning and optimization of the home environment, including automated decision-making for energy-saving management, security protection, and comfortable living experiences, with stronger understanding and interaction capabilities. Large model applications in enterprise software, such as financial analysis, market trend prediction, and customer relationship management, can quickly respond to complex issues on mobile devices, providing managers with real-time decision-making support.

In summary, in 2024, the application scenarios combining large models with terminal devices will further diversify and deepen, transitioning from theory to practice and potentially spawning new killer applications and services. Particularly represented by smartphone manufacturers and smart home device makers, who were the gateways of the previous era, they are also striving to become the new gateways of the AI era.

With the emergence of Sora, it is evident that beyond the development of models in specific fields like computer vision and natural language processing, the further cross-integration of multimodal large models may become a crucial practical direction in 2024. Unlike traditional interaction methods limited to single modalities like keyboard input or touchscreen operations, multimodal large models can integrate and understand multiple input modes (such as voice, images, text, gestures, etc.). This allows them to mimic the complexity and richness of natural human communication, closely resembling how we interact with others in daily life.

As mentioned at the beginning of the article, OpenAI's Sora is a prime example of a multimodal large model. The attitude of capital toward it clearly reflects its vast commercial potential for future applications.

It is foreseeable that future multimodal large models will be able to recognize and respond to users' voice commands, facial expressions, body movements, and even eye contact, enabling interactions with machines as natural and comfortable as conversing with a real person. It can also integrate information from different modalities to extract deeper meanings, such as understanding context by combining visual and auditory information, enabling machines to better interpret user intentions and communicate effectively even in ambiguous, noisy, or informal situations.

Deep learning-based large models can self-optimize and provide personalized services based on user habits and preferences, offering more accurate feedback and suggestions, thereby achieving dynamic and personalized interaction processes.

For users with special needs, such as individuals with disabilities, multimodal interaction provides more diverse means of interaction, allowing them to communicate in the most suitable way for themselves, thereby enhancing the inclusivity and accessibility of technology. In virtual reality (VR) and augmented reality (AR) environments, multimodal large models can create highly immersive experiences by perceiving comprehensive sensory inputs from users, enabling real-time feedback and interaction.

In team collaboration and remote work scenarios, multimodal systems can capture and interpret different modal signals from multiple individuals simultaneously, facilitating efficient communication and collaboration.

This type of multimodal large model will significantly enrich future human-machine interactions, potentially enabling multi-dimensional communication through text, visuals, voice, and more, thereby improving efficiency. Currently, major tech companies are actively deploying AI technologies. For instance, Alibaba Cloud's DAMO Academy has extensive applications of multimodal technologies in natural language processing and image recognition, having launched corresponding services and products. Tencent YouTu conducts in-depth research in computer vision and multimodal intelligence, with products and services covering various application scenarios from content understanding to social interaction. Baidu's large-scale pre-trained models like ERNIE-ViLG possess multimodal understanding and generation capabilities, serving multiple scenarios including search, advertising, and maps.

At the end of 2023, an agreement between OpenAI and AxelSpringer indicated that AI companies would need to pay media brands for using their content to train large models. This suggests that compensating data providers for intellectual property may become an industry trend.

In 2023, many regions in China introduced policies to promote AI technology development, such as "Beijing's Measures to Promote General AI Innovation" and "Shenzhen's Action Plan for Accelerating High-Quality AI Development and Application," both of which mention "high-quality datasets." Additionally, the Interim Measures for the Management of Generative Artificial Intelligence Services jointly issued by the Cyberspace Administration of China and six other departments stipulates that generative AI service providers must not infringe upon others' intellectual property rights.

This demonstrates that with the recent surge in AI policy implementations, the issues surrounding high-quality datasets and training data copyrights are receiving increased attention. The value of premium training databases will become more prominent in the future.

Currently, in the process of large model training, particularly in the field of deep learning, certain vector databases and distributed storage systems have shown outstanding performance in managing and accessing massive datasets. Notable examples include Tencent Cloud's vector database service and Alibaba Cloud's distributed NoSQL database. Moreover, data issues extend beyond mere database concerns. In 2024, questions regarding data privacy protection and ownership rights will come to the forefront: For instance, what types of data can AI model developers legitimately use for training? Where should proprietary datasets be sourced from? How can better datasets be obtained through annotation? And crucially, who owns the copyright to products generated by AI models?

These data-related challenges will become new flashpoints for AI in 2024.

Statistics show that the average GPU and TPU costs for AI enterprises currently stand at ¥73,900 and ¥22,900 respectively. Despite GPUs being more expensive, their superior performance in parallel computing—particularly for deep learning algorithms—makes this additional expenditure an unavoidable cost for businesses. From a market share perspective, GPU remains the most popular processor architecture in deep learning. Currently, Nvidia has a strong competitive advantage and brand influence in the GPU field, but diversified supplier choices still exist in practical applications.

In a survey, all 9 companies surveyed chose Nvidia's GPU as the primary solution, but AMD's GPU was also favored by some companies (C, D, H). Notably, domestic players Huawei and Cambricon have begun making their mark in the GPU market, with their products being selected as GPU suppliers by companies. In China's AI chip market, Huawei's HiSilicon Ascend 910 currently leads in single-card AI computing power, achieving 320 TFLOPS at half precision - on par with Nvidia's A100 PCIe version.

Overall, while there remains a significant gap between domestic and overseas chip technologies and software ecosystems, various constraints have conversely accelerated the growth of domestic chip manufacturers.

What's evident is that with China's strategic emphasis on self-reliant information technology, the government has provided policy support and technical guidance to local GPU enterprises, encouraging independent R&D of GPU technologies. This is steadily reducing reliance on external suppliers. With an increasing number of domestic companies achieving breakthroughs in GPU core technologies, they are enhancing product competitiveness by optimizing designs and reducing costs. Simultaneously, they are customizing products to meet the specific needs of the domestic market, thereby lowering the total procurement and usage costs for users. Local GPU enterprises are also strengthening cooperation with upstream and downstream industry partners to jointly build a complete ecosystem chain. This involves resource integration and collaborative innovation across multiple stages, from raw material supply and design manufacturing to system integration, improving overall efficiency and reducing costs.

In summary, against the backdrop of high GPU unit costs, domestic companies are growing rapidly driven by external environments and market demands. Although there remains a noticeable gap with foreign enterprises, including in the coming years, this gap is now narrowing due to the catalysis of internal and external factors. Over the past year, while there have been some successful cases of large models in the B2B sector, their customization and practicality in vertical fields are still in the developmental stage. Additionally, data privacy and security regulations may not have fully kept pace with technological advancements, presenting compliance challenges for enterprises adopting large models.

More importantly, internal corporate awareness and acceptance of new technologies vary, and large-scale deployment still requires time to build market confidence and technical readiness. The supporting industrial chain needs further improvement, including hardware computing power, software ecosystems, and talent reserves, which require additional accumulation and development.

As technological maturity increases, large model technology is expected to reach higher maturity levels in 2024, with not only enhanced model performance and generalization capabilities but also better adaptability and specificity in vertical applications. This will enable large models to more effectively address complex issues in B2B operations. Beyond this, as digital transformation deepens, B2B enterprises have accumulated vast amounts of industry and operational data. In the future, large models will be able to better leverage this data for deep learning and predictive analysis, providing robust support for decision optimization, productivity enhancement, and cost control.

Moreover, the construction of infrastructure such as cloud computing and edge computing has become more advanced, creating the conditions for deploying large models on terminal devices. This enables large models to respond in real-time across various business scenarios, meeting B2B users' demands for fast, accurate, and personalized services.

If 2023 saw a wave of entrepreneurs targeting the C2C sector in the large model field, 2024 will mark the B2B sector as the most critical battleground. For cloud providers and software vendors, beyond mere C2C visibility, more effort will be directed toward monetization and implementation in the B2B space, aiming to transform AI into genuine productivity. With the deepening of AI applications, the demand for high-quality, large-scale, and representative training data has become more urgent. However, the cost and difficulty of obtaining and cleaning such data are high, especially when dealing with multi-source heterogeneous and real-time streaming data. Ensuring data quality, integrity, and real-time performance remains an ongoing challenge.

In addition, although computing power continues to improve, enhancing model accuracy, robustness, efficiency, and reducing resource consumption remains a significant challenge in the face of increasingly complex task scenarios and more refined application requirements. Particularly in the field of deep learning, the high cost of large model training and the need for further development and refinement of optimization techniques such as model compression, acceleration, and fine-tuning strategies persist. Despite rapid advancements in AI technology, transforming cutting-edge innovations into viable products and services involves navigating development costs, maintenance expenses, hardware investments, and ensuring sustainable business models with significant economic returns. This process tests the market's productization capabilities and open ecosystem.

Moreover, each industry presents unique requirements and standards. For AI technology to achieve successful commercialization, it must deeply understand and adapt to sector-specific characteristics, identify practical application scenarios, and overcome inter-industry barriers—a formidable challenge in itself.

In summary, while the emergence of Sora demonstrates remarkable progress in AI technology, implementation hurdles persist. Key challenges for large AI models in 2024 will revolve around data complexities, pushing model performance to its limits, and solving the triangular dilemma balancing effectiveness, cost, and marginal returns—all critical realities of commercialization. Over the past year, we've witnessed transformative concepts like MaaS, AI Agents, multimodal systems, open-source developments, parameter competitions, and industry-specific models – all accelerating industrial evolution and propelling China's digital transformation. As we progress through 2024, domestic large-scale models are poised to become more deeply embedded in practical applications. Beyond technological breakthroughs, we'll see greater downward compatibility with industries and more real-world implementation cases. These advancements will serve as a new productive force, accelerating the ship of industrial digital transformation forward.