Tencent's Latest Large Model Training Method: Angel Framework Upgrade Boosts Efficiency by 2.6x

baoshi.rao

Amid exponential growth in large model parameter scales, Tencent recently disclosed its latest training method for the Hunyuan large model. By upgrading its self-developed machine learning framework Angel, the company has successfully improved large model training efficiency. This upgrade enables up to 50% reduction in computing power costs for training billion-parameter models, providing strong support to address computing power shortages. The Angel framework upgrade not only enhances efficiency but also supports ultra-large-scale training of single tasks with tens of thousands of GPUs, further improving the performance and efficiency of Tencent Cloud's HCC large model dedicated computing cluster.

Image source note: AI-generated image, authorized by Midjourney

To further enhance the training and inference efficiency of large models, Tencent has independently developed the machine learning training framework AngelPTM. In terms of storage, AngelPTM employs multi-dimensional parallel computing, including data parallelism, model parallelism, pipeline parallelism, and sequence parallelism.

Additionally, by introducing unified perspective technology based on ZeRO-Cache, the framework integrates GPU memory and main memory, effectively expanding GPU memory capacity and increasing single-machine storage capacity by 90%. For communication, Tencent adopted a hardware-software co-design approach, building a 3.2T RDMA network to broaden bandwidth while implementing GPU topology awareness at the framework software level to achieve load-balanced pipeline parallelism. To ensure stability, Tencent has implemented monitoring across infrastructure networks, hardware, storage, and cloud-native scheduling, along with automatic retraining and system fault tolerance mechanisms.

To address the rising costs of inference, Tencent has launched the large model inference framework AngelHCF. By expanding parallel capabilities and optimizing key functions—including Embedding sharing, Attention operator optimization, and Paged Attention optimization—the framework enhances inference performance. Compared to mainstream frameworks, AngelHCF achieves a 1.3x speedup in inference. In Tencent's Hunyuan large model for text-to-image generation, this framework reduces inference time from 10 seconds to just 3-4 seconds.

Tencent has not only achieved significant efficiency improvements in large model training but also made substantial optimizations in the inference phase. These technological advancements are now available on Tencent Cloud, offering users superior training and inference acceleration capabilities while supporting end-to-end fine-tuning for customized intelligent applications. Over 300 internal Tencent services and application scenarios have already integrated the Hunyuan large model for testing, covering areas such as text summarization, creation, translation, and coding. This marks a comprehensive upgrade across the entire production pipeline—from model development to application deployment—forming a one-stop platform that further accelerates the advancement of large model applications.