Who Will Be the Next Computing Power Leader in the AI Era?

baoshi.rao

AI infrastructure

"If we are to make investment layouts for the new generation of artificial intelligence, what are the relatively certain investment opportunities at the current stage? I would suggest focusing on the AI infrastructure layer, which includes not only AI chips but also a series of services and tools for large model training and inference," said Wang Mengfei, a partner at Fanzhuo Capital, to Xiaofanzhuo.

In fact, a noticeable sentiment in the venture capital circle is: Although AI large models are highly hyped, there are very few truly investable opportunities.

As generative AI is still in its early stages, opinions on AI application investments vary widely, with both radicals and conservatives present. However, there is only one consensus—investing in AI infrastructure is absolutely the right move.

The reason is simple: Nvidia's latest Q3 financial report shows that it sold approximately 500,000 H100 and A100 GPUs in the third quarter, earning a staggering $14.5 billion in data center hardware, nearly triple the amount from the same period last year. In the second quarter, Nvidia had just sold 900 tons of GPUs, earning $10.3 billion in data center hardware.

While generative AI has yet to generate massive profits, Nvidia has already made a fortune with its unparalleled AI infrastructure.

An awkward reality is that despite the overwhelming demand for AI computing power and basic services, China's AI foundation started relatively late and has yet to form a mature ecosystem.

Under the combined effects of a heated AI market, NVIDIA GPU price premiums, production shortages, and chip export bans, domestic AI infrastructure construction and investment appear to be facing a brand-new opportunity.

To explore how domestic AI infrastructure companies can seize opportunities and catch up in the era of large models, as well as identify potential investment opportunities in AI-era infrastructure, Xiaofanzhuo organized a special salon on AI infrastructure. The event featured a dialogue between Wang Mengfei, partner of FanZhuo Capital, and Wei Haomin, director of Tiankai Zhisuan, along with discussions with numerous professionals and investors from the AI chip industry chain. Together, they explored the current state and future of AI infrastructure.

After the rise of generative AI, an industry consensus emerged that AI will disrupt countless industries. But perhaps before AI disrupts all these industries, the first thing to be disrupted might be the computing power foundation itself.

Data centers have long served the functions of transmitting, accelerating, displaying, computing, and storing data. With the widespread adoption of the internet, data centers have become as integral to modern society as transportation networks, exerting profound impacts across numerous industries.

Wang Mengfei, a partner at FanZhuo Capital, asserts: "In the era of large models, data centers remain a vital component of infrastructure as hubs for computational power. However, the heightened demands of large model training will necessitate extensive upgrades to traditional data centers."

This is because nearly all operational challenges in data centers boil down to power consumption and cooling—issues that will only intensify in the era of large models.

Taking electricity consumption as an example, according to data provided by conference participants, a single training session of GPT-4.0 consumes 62 million kWh of electricity. With such massive power usage, regional electricity price differences directly translate into significant cost variations.

Public data shows that China's lowest electricity tier costs about 0.32 yuan per kWh. In comparison, Beijing's electricity price can reach 0.80 yuan per kWh—nearly 2.5 times the lowest tier.

This means if calculated based on 62 million kWh, companies would need to pay approximately 30 million yuan more for each GPT-4.0 training session for the same service. For most startups, their annual net profit might not even reach 30 million yuan.

Beyond expensive electricity bills, the chain of problems caused by higher power consumption is also posing new requirements for data centers in the AI era.

For instance, as AI training requires higher computing power and greater electricity consumption, the heat dissipation issue in data centers is bound to become more severe.

At that point, will data centers need to add air conditioning? Will traditional air-cooling systems need to be replaced with liquid cooling systems? Or even, will materials need to be switched to new, less thermally conductive alternatives? A data center professional stated: "These series of issues are not 'likely' to occur—they are inevitable. Currently, all data centers are conducting extensive testing and updates to meet AI's new requirements."

Wang Mengfei, a partner at Fanchao Capital, also mentioned that from a broader perspective, under the 'dual carbon' policy framework, energy conservation and emission reduction in data centers are imperative. Therefore, future investments in data centers and related infrastructure must incorporate considerations for energy efficiency and environmental protection.

In 2022, according to data released by the State Grid, the total electricity consumption of data centers nationwide reached 270 billion kWh, accounting for approximately 3% of the total electricity consumption in society. In 2021, the electricity consumption of data centers nationwide was 216.6 billion kWh, which was twice the cumulative power generation of the Three Gorges Dam during the same period.

Hong Jingyi, Deputy Secretary-General of the China Electronics Society, also stated: "Data centers consume large amounts of electricity and have high power loads. Under the 'dual carbon' goals, data centers must undergo green transformation."

Faced with the energy conservation and environmental protection issues brought by data centers, several investors at the scene expressed concerns: "Could projects be shut down due to excessive energy consumption?"

However, Wang Mengfei believes that the demand for data centers in the AI era is certain, and we should not give up eating for fear of choking. On the contrary, to truly solve the problem of excessive energy consumption in data centers, we should invest in innovative technologies to fundamentally improve efficiency and reduce consumption.

Taking Huawei's data center in Ulanqab as an example, by adopting an indirect evaporative cooling solution and AI-based iCooling energy efficiency optimization technology, the data center has achieved an annual average PUE as low as 1.15. Compared to traditional chilled water solutions, it saves over 30 million kWh of electricity annually and reduces carbon dioxide emissions by approximately 14,000 tons per year.

It is foreseeable that as the foundation of computing power, data centers in the future AI era will continue to face old challenges like power consumption and heat dissipation, while also confronting new demands such as higher computing power and energy efficiency.

With the transformation of data centers imminent, what will the specific form of data centers in the era of large models look like? Moreover, for AI startups with limited financial resources, will data centers become more "democratized"?

Several conference participants pointed out that while the exact form of data centers in the era of large models is difficult to predict, from the demand side, data centers in the future may be divided into two categories: large model training type and scenario training type.

Specifically, enterprises developing proprietary large-scale AI models have extremely high demands for computing power and can better afford the substantial construction costs. Therefore, such companies are likely to choose regions with lower electricity prices, like Ulanqab City, to build large data centers for training their models. This approach offers two advantages: lower electricity costs and naturally cooler temperatures, which are beneficial for data center cooling.

On the other hand, scenario-specific training companies, whose models focus on particular areas such as marketing, documentation, finance, images, and short videos, have relatively lower computing power requirements. These companies can opt to build smaller data centers near their offices to save on management costs.

Wei Haomin, a director at Tiankai Intelligent Computing, further added: "The challenges of computing resource shortages and high costs in large-scale model development have drawn significant government attention. In the future, large-scale model enterprises may even consider utilizing public computing resources."

For example, Beijing is already constructing the Beijing Artificial Intelligence Public Computing Platform to provide open computing services that support R&D and innovation. Additionally, the initiative encourages enterprises to integrate computing resources and conduct R&D based on domestic computing power. Companies with highly innovative and ecologically rich large models can receive computing power subsidies of up to 10 million yuan.

In simple terms, large corporations construct massive data centers, while startups search for more economical solutions.

For both tech giants and startups facing the "Nvidia GPU shortage" issue, Wei Haomin suggests that "the solution lies in starting with the software ecosystem, combining independent hardware and software R&D to gradually break Nvidia's monopoly."

Wei specifically mentioned that in terms of hardware performance, China's first-tier AI chips are already close to Nvidia's. For example, Huawei's AI chip performance can rival Nvidia's high-end A100 GPU, especially in raw hardware capabilities.

However, matching Nvidia's hardware performance doesn't mean domestic AI chips can replace Nvidia. Nvidia's real barrier isn't just GPU hardware performance but also its vast AI software ecosystem, CUDA.

In simple terms, as a company established for 30 years that early on focused on GPU production, NVIDIA defined the CUDA general-purpose computing framework. Developers have grown accustomed to using CUDA's proprietary programming language to create GPU-driven applications.

If developers want to migrate to GPUs from Google, Amazon, Microsoft, or domestic manufacturers, they would need to learn entirely new software languages, making the transition cost prohibitively high. This explains why the industry rushes for foreign AI chips rather than adopting domestic alternatives—the ecosystem for domestic chips still lags behind.

Wei Haomin suggests that one potential solution lies in assisting more domestic AI manufacturers with custom chip development.

Wei Haomin revealed that Tiankai Intelligent Computing is collaborating with some large model developers on specialized chips. "Custom chips mean AI companies can modify both software and hardware according to their specific needs, ensuring the chip design fully aligns with algorithmic requirements to improve computational efficiency and performance. At the same time, AI companies can build private ecosystems around these custom chips, including software libraries, frameworks, tools, and support services to meet customized demands."

In other words, while domestic AI chip companies may find it difficult to compete head-on with Nvidia in the short term, they can at least address the urgent need for AI computing power through customized and privatized solutions.

When discussing how to break through the "computing power bottleneck," participants from several startups shared insights based on their practical experiences. Optimizing algorithms and software to solve inference scenario problems could be another viable approach.

Wang Mengfei summarized the on-site discussion: "Investing in AI infrastructure requires us to not only focus on our bottlenecks but also leverage China's industrial and scenario advantages. Large models must ultimately solve practical problems. On the inference side, many targeted optimization solutions can be developed for different task requirements and scenario characteristics. If these can be standardized as products or services, each scenario category holds significant potential."

From a longer-term perspective, Xiaofanzhuo also believes that the rapid iteration characteristics of AI chips mean the competitive landscape in this field will never remain fixed.

For instance, even major tech giants have grown weary of NVIDIA's dominance. According to market analysis firm Omdia, nearly all companies purchasing large quantities of H100 chips are developing their own custom processors for AI, HPC, and video workloads, aiming to reduce reliance on NVIDIA hardware.

Looking further ahead, silicon-based chips represented by NVIDIA have already been pushed to their computational limits, with little room for breakthrough improvements. Meanwhile, emerging computing paradigms like optical computing and quantum computing are flourishing globally, offering potential performance and energy efficiency improvements that could exceed silicon chips by orders of magnitude.

The next computing powerhouse of the AI era may not necessarily be NVIDIA.