How Competitive is AI Large Model Training? Unveiling the Mystery of Computing Power for Large Models

baoshi.rao

Using 40 years of global weather data and 200 GPU cards for pre-training, a billion-parameter Pangu weather model was developed in about two months.

This is the story of Bi Kaifeng, a Tsinghua University graduate who trained this large model three years after graduation.

However, in terms of cost, under normal circumstances, with one GPU costing 7.8 yuan per hour, the training cost of Bi Kaifeng's Pangu weather model could exceed 2 million yuan. This is still a vertical large model in the meteorological field. If it were a general-purpose large model, the cost might increase a hundredfold.

Statistics show that China currently has over a hundred large models with 1 billion parameters. However, the industry's rush to 'refine' large models faces the unsolvable problem of high-end GPUs being extremely scarce. High computing power costs, lack of computing resources, and insufficient funding have become the most obvious challenges for the industry.

How scarce are high-end GPUs?

'Scarce, of course, but what can we do?' blurted out a senior executive at a major company when asked about the shortage of computing power.

This seems to have become an unsolvable problem acknowledged by the industry. At its peak, the price of a single NVIDIA A100 was inflated to 200,000–300,000 yuan, and the monthly rental price for an A100 server soared to 50,000–70,000 yuan. Even so, the high prices may still not guarantee access to the chips. Some computing power suppliers have encountered bizarre experiences, such as suppliers backing out of deals, which were rare in the past.

Zhou Lijun, a senior executive in the cloud computing industry, also shared similar sentiments: 'The shortage of computing power is real. Many of our clients want high-end GPU resources, but what we can provide still falls short of meeting the broad market demand.'

A cloud service provider's high-performance computing cluster equipped with A100 GPUs sold out.

Facts prove that the shortage of high-end GPUs is an industry-wide issue with no short-term solution. The explosion of large models has led to rapid growth in demand for computing power, but the supply has not kept pace. While the computing power supply will eventually shift from a seller's market to a buyer's market in the long run, how long this will take remains unknown.

Companies are calculating how many 'goods' (NVIDIA GPUs) they have on hand, even using this to gauge market share. For example, if a company holds close to 10,000 cards and the total market has 100,000 cards, its share would be 10%. 'By the end of the year, if the inventory reaches 40,000 and the market has 200,000, the share might be 20%,' an insider explained.

On one hand, GPUs are hard to acquire; on the other, the threshold for training large models is not as low as the industry suggests. As mentioned earlier, the training cost for Bi Kaifeng's Pangu weather model could exceed 2 million yuan. However, it's important to note that this model is a vertical large model built on the foundation of the general-purpose Pangu model, with parameters in the hundreds of millions. Training a general-purpose large model with 1 billion parameters or more could increase costs tenfold or even a hundredfold.

“Currently, the largest investment is in training. Without billions in capital, it’s difficult to sustain large-scale model development,” revealed Qiu Yuepeng, Senior Vice President of Tencent Group, COO of Cloud and Smart Industries Group, and President of Tencent Cloud.

“You have to move fast—at least achieve results before funding runs out to secure the next round of financing,” described one entrepreneur about the current state of the large model race. “This path is a dead end. Without tens or hundreds of billions in backing, it’s nearly impossible to succeed.”

Amid this scenario, the consensus in the industry is that as competition in the large model market intensifies, the market will shift from frenzy to rationality, and companies will adjust their strategies and control costs accordingly.

Proactive Measures in a Challenging Landscape

No conditions? Create them—this seems to be the prevailing mindset among most participants in the large model space. Companies are employing various methods to address real-world challenges.

Due to shortages of high-end GPU chips and the limited availability of the latest-generation GPUs in China—which often underperform—companies require longer training times for large models. Many are exploring innovative solutions to compensate for computational shortcomings.

One approach involves using higher-quality training data to improve efficiency. Recently, the China Academy of Information and Communications Technology (CAICT) released a report on the standard framework and capability architecture for industry-specific large models, emphasizing the importance of data quality. The report recommends manual annotation and verification of at least a portion of raw data to build high-quality datasets, as data quality significantly impacts model performance.

Beyond reducing costs through better data, improving infrastructure to ensure stable operation of clusters with thousands of GPUs for at least two weeks without failures is both a technical challenge and a method to optimize large model training.

“As a cloud service provider, we help clients establish stable and reliable infrastructure. GPU servers are less stable, and any failure can interrupt training, increasing overall duration. High-performance computing clusters offer more reliable services, reducing training time and addressing some computational challenges,” said Zhou Lijun.

Additionally, efficient scheduling of computing resources tests a provider’s technical capabilities. Xu Wei, head of East China internet solutions at Volcano Engine, told Titanium Media that acquiring computing resources is only part of the equation. The real challenge lies in effectively deploying and utilizing these resources. “Splitting a single GPU into smaller, distributed units for fine-grained scheduling can further reduce computational costs,” Xu noted.

Network conditions significantly impact the training speed and efficiency of large AI models. Training these models often requires thousands of GPUs, with high-speed network connections between hundreds of GPU servers. Any network congestion can drastically slow down training and reduce efficiency. "If just one server overheats and crashes, the entire cluster may need to stop, and the training task must restart. This places extremely high demands on cloud service operation, maintenance, and troubleshooting capabilities," said Qiu Yuepeng.

Some manufacturers have adopted alternative approaches, such as transitioning from cloud computing architectures to supercomputing architectures as a cost-reduction method. For tasks that don't require high-throughput computing or parallel processing, supercomputing cloud services can cost about half as much as cloud supercomputing. Performance optimization can also increase resource utilization from 30% to 60%.

Other manufacturers have chosen to use domestic platforms for large model training and inference to replace the hard-to-obtain NVIDIA GPUs. "We jointly launched the iFlytek Spark integrated machine with Huawei, which can perform training and inference on domestic platforms. This is truly remarkable. I'm particularly pleased to share that Huawei's GPU capabilities are now on par with NVIDIA's. Ren Zhengfei has placed great emphasis on this, with three Huawei board members working in dedicated teams at iFlytek to achieve performance comparable to NVIDIA's A100," said Liu Qingfeng, founder and chairman of iFlytek.

Each of these methods represents a significant engineering challenge, making it difficult for most companies to meet these demands with self-built data centers. Many algorithm teams opt for professional computing power providers. Parallel storage also constitutes a substantial cost, along with technical capabilities and reliability guarantees, all of which are part of the hardware costs. Additionally, operational costs such as IDC availability zone electricity, software, platforms, and personnel must be considered.

Only GPU clusters at the scale of thousands of cards achieve economies of scale. Choosing a computing power service provider effectively reduces marginal costs to zero.

Sun Ninghui, a researcher at the Institute of Computing Technology of the Chinese Academy of Sciences and an academician of the Chinese Academy of Engineering, noted in a speech that AIGC has spurred the explosive growth of the AI industry. However, the large-scale application of intelligent technology faces a typical long-tail problem: strong AI-capable entities (such as cybersecurity agencies, research institutes, and meteorological bureaus), research institutions, and large enterprises account for only about 20% of computing power demand, while the remaining 80% comes from small and medium-sized enterprises (SMEs). Due to their limited size and budgets, these SMEs often struggle to access computing resources or afford their high costs, making it difficult for them to benefit from the AI era.

To achieve the large-scale application of intelligent technology and ensure the AI industry is both "highly praised" and "profitable," there is a need for affordable and easy-to-use intelligent computing power, enabling SMEs to access computing resources conveniently and cheaply.

Whether addressing the urgent demand for computing power in large models or solving various challenges in its application, a key new development is that computing power has evolved into a new service model driven by market demand and technological advancements.

Exploring New Models of Computing Power Services

What kind of computing power are we competing for in large models? To answer this, we must first discuss computing power services.

In terms of types, computing power is divided into general-purpose computing power, intelligent computing power, and supercomputing power. The transformation of these computing powers into services is the result of both market and technological drivers.

The 2023 Computing Power Service White Paper (hereinafter referred to as the "White Paper") defines computing power services as a new domain in the computing power industry, which is based on diverse computing power, linked by computing power networks, and aims to provide effective computing power.

The essence of computing power services lies in achieving unified output of heterogeneous computing power through new computing technologies, while integrating with cloud computing, big data, AI, and other technologies. Computing power services encompass not only computing power but also the unified encapsulation of resources such as storage and networks, delivered in the form of services (e.g., APIs).

Understanding this reveals that a significant portion of those competing for Nvidia chips are computing power service providers, i.e., computing power producers. End-users who call computing power APIs only need to specify their computing power requirements.

From a software perspective, large model usage generated by software interactions can be categorized into three types: 1) API calls to large models, priced and settled accordingly; 2) proprietary small models where users purchase or even deploy their own computing power; and 3) collaborations between large model providers and cloud providers, such as dedicated clouds with monthly payments. "Generally, these are the three models. Kingsoft Office currently primarily uses API calls and has developed its own computing power scheduling platform for internal small models," said Yao Dong, Vice President of Kingsoft Office.

In the computing power industry chain, upstream enterprises primarily supply foundational resources like general computing power, AI computing power, supercomputing power, storage, and networks. For example, in the large model computing power competition, Nvidia belongs to the upstream, supplying chips to the industry. The rising stock prices of server manufacturers like Inspur Information reflect market demand.

Midstream enterprises, such as cloud service providers and new computing power service providers, focus on computing power orchestration, scheduling, and trading to produce computing power, which is then supplied via APIs. Companies like Tencent Cloud and Volcano Engine operate in this segment. The stronger the service capabilities of midstream enterprises, the lower the barriers for application-side users, facilitating the widespread and inclusive development of computing power.

Downstream enterprises rely on computing power services to generate value-added services or manufacturing. These users only need to specify their requirements, and computing power producers configure the necessary resources to complete the assigned "computing tasks."

Compared to purchasing servers and building large model computing environments independently, this approach offers cost and technical advantages. For instance, Bi Kaifeng trained the Pangu weather model by directly accessing Huawei Cloud's high-performance computing services. How does this differ from other large model enterprises' computing power usage or payment processes?

Evolution of Computing Power Business Models

ChatGLM was among the first general-purpose large models launched. Taking Zhipu AI's ChatGLM computing power usage as an example, publicly disclosed information indicates that Zhipu AI utilizes multiple mainstream domestic AI computing power service providers. "Theoretically, all of them are used," said an insider, possibly including major domestic computing power service providers/cloud service providers.

Pay-as-you-go and subscription-based billing are the mainstream models in the current computing power service market. There are roughly two types of usage demands: one is selecting the corresponding computing power service instance, where cloud service providers offer high-performance GPU servers equipped with mainstream NVIDIA A800, A100, and V100 graphics cards on their official websites.

High-performance computing GPU types provided by a computing power service provider

The other option is choosing a corresponding MaaS (Model-as-a-Service) platform for industry-specific fine-tuning of large models. Taking Tencent Cloud's TI-ONE platform as an example, the pay-as-you-go price for an 8C40G V100*1 configuration is 20.32 RMB per hour, which can be used for automated learning (vision), task-based modeling, Notebook, and visual modeling.

The industry is also advancing the 'integration of computing and network resources.' By comprehensively analyzing computing tasks and network resource status, cross-architecture, cross-regional, and cross-provider computing network orchestration solutions are formed, enabling flexible resource deployment. For instance, users can deposit funds into a computing power network and freely allocate resources across partitions, selecting the most suitable, fastest, or cost-effective partition based on application needs, with fees deducted from the prepaid balance.

Cloud service providers are similarly involved, as computing power services, a unique product in cloud services, allow them to quickly participate in the computing power industry chain.

According to data from the Ministry of Industry and Information Technology, China's total computing power scale reached 180 EFLOPS in 2022, ranking second globally. By 2022, the computing power industry in China had grown to 1.8 trillion RMB. Large models have significantly accelerated the development of the computing power industry.

Some argue that current computing power services resemble a new form of 'selling electricity.' However, depending on their role, some providers may also assist users with system performance tuning, software installation, large-scale job monitoring, and operational analysis—effectively handling last-mile maintenance tasks.

With the normalization of high-performance computing demands for large models, computing power services, which originated from cloud services, have rapidly gained prominence, forming a unique industry chain and business model. However, at the onset of the computing power boom driven by large models, shortages of high-end GPUs, soaring computing costs, and the scramble for chips have become defining features of this era.

'Right now, the competition is about who can secure GPUs in the supply chain. NVIDIA dominates the entire industry, controlling the entire market—that’s the reality,' commented an insider. In the current supply-constrained environment, securing GPUs is critical for delivering services.

However, not everyone is scrambling for GPUs, as shortages are temporary and will eventually be resolved. 'Those engaged in long-term research aren’t rushing—they can wait because their work isn’t at risk. The ones desperately competing for GPUs are mostly startups trying to survive until next year,' the source added.

Amid numerous uncertainties, the trend of computing power as a service remains certain. Computing power providers must stay prepared to adapt as large models stabilize and market dynamics shift rapidly.