Breaking the Computational Power Bottleneck: Huawei Supports Trillion-Parameter Model Training Equivalent to 18,000 GPUs

baoshi.rao

On September 22, during Huawei Connect 2023, Huawei officially launched the Ascend AI Computing Cluster Atlas 900 SuperCluster with a new architecture, capable of supporting trillion-parameter model training.

Wang Tao, Huawei's Executive Director and President of the ICT Infrastructure Business Management Committee and Enterprise BG, introduced that the new cluster adopts Huawei's innovative Galaxy AI Intelligent Computing Switch CloudEngine XH16800. With its high-density 800GE port capability, a two-layer switching network can achieve ultra-large-scale non-blocking cluster networking of 2,250 nodes (equivalent to 18,000 GPUs).

The new cluster also utilizes an innovative super-node architecture, significantly enhancing large model training capabilities.

Additionally, leveraging Huawei's comprehensive advantages in computing, networking, storage, and energy, the system reliability has been improved at the device, node, cluster, and business levels, elevating large model training stability from days to months.

To accelerate large model innovation, Huawei released the more open and user-friendly CANN 7.0, which is fully compatible with industry AI frameworks, acceleration libraries, and mainstream large models, while deeply opening up underlying capabilities. This allows AI frameworks and acceleration libraries to directly call and manage computing resources, enabling developers to customize high-performance operators.

Wang Tao stated that as AI enters the era of large models, massive computational power is becoming the core engine of AI development. Huawei has shifted from traditional server stacking to innovating system architectures to build AI clusters, achieving integrated design of computing, networking, and storage capabilities to break through the computational power bottleneck.