Global and Domestic Top 10 AI Chip Computing Power Rankings: How Many Do You Know?

baoshi.rao

As part of the '2022 Research and Analysis Report on 45 Domestic AI Chip Manufacturers,' the AspenCore analyst team compiled a list of 10 domestic AI chips and 10 international AI chips to showcase the latest technological advancements in global AI chips.

The Top 10 domestic AI chips come from the following manufacturers: Cambricon, Horizon Robotics, Kunlunxin Technology, Alibaba T-Head, Suiyuan Technology, SemiDrive, Enflame Technology, Kneron, Black Sesame Technologies, and GigaDevice. (Please vote at the end for your favorite domestic AI chip.)

The Top 10 international AI chips come from the following manufacturers: NVIDIA, Intel, Google, AWS, Qualcomm, Esperanto, Graphcore, Cerebras, Ambarella, and Hailo.

During the research and compilation of the 'AI Chip' report, significant support was provided by Arm China, a leader in processor IP; Hejian Industrial Software, a domestic EDA company; and SemiDrive, a leading domestic AI chip design company. We express our deepest gratitude for their contributions.

Top 10 Domestic AI Chips

Cambricon Third-Generation Cloud AI Chip Siyuan 370

The Siyuan 370 is based on a 7nm process, integrates 39 billion transistors, and utilizes chiplet technology, achieving a maximum computing power of 256 TOPS (INT8), which is twice that of the Siyuan 270. Built on the latest intelligent chip architecture MLUarch03, the Siyuan 370, which combines AI training and inference, demonstrates excellent performance in practical tests. For example, in the case of ResNet-50, the MLU370-S4 accelerator card (half-height, half-length) delivers twice the performance of mainstream GPUs of the same size. The MLU370-X4 accelerator card (full-height, full-length) matches the performance of mainstream GPUs of the same size while significantly leading in energy efficiency.

The Siyuan 370 integrates two AI computing chiplets (MLU-Die) within a single chip, each equipped with independent AI computing units, memory, I/O, and MLU-Fabric control interfaces. The MLU-Fabric ensures high-speed communication between the two MLU-Dies, while different MLU-Die configurations enable diverse product offerings, providing users with cost-effective AI chips suitable for various application scenarios.

Horizon Journey 5 Vehicle Intelligence Computing Platform

Journey 5 represents Horizon Robotics' third-generation automotive-grade AI chip, manufactured using TSMC's 16nm FinFET process. Developed following ISO 26262 functional safety certification procedures, it has achieved ASIL-B certification. Based on the latest dual-core BPU Bayesian architecture design, Journey 5 features an eight-core Arm Cortex-A55 CPU cluster delivering up to 128 TOPS equivalent computing power. It includes a CV engine, dual-core DSP, dual-core ISP, powerful codec support, and can handle multiple 4K and full HD video inputs. With dual-core lockstep MCU achieving ASIL-B(D) functional safety level, it meets AEC-Q100 Grade 2 automotive standards.

Designed for advanced autonomous driving and smart cockpit applications, the chip offers rich external interfaces supporting over 16 channels of HD video input with dual-channel "instant" image processing. Leveraging BPU, DSP, and CPU resources, it not only accelerates advanced image perception algorithms but also supports multi-sensor fusion including LiDAR and millimeter-wave radar. Equipped with PCIe 3.0 high-speed interfaces and dual gigabit real-time Ethernet (TSN), it provides hardware-level support for multi-sensor synchronization (PTP), while supporting predictive planning and H.265/JPEG real-time encoding/decoding.

Kunlun Core 2

The second-generation general-purpose AI cloud chip, launched by Kunlun Core Technology which was formerly Baidu's Intelligent Chip and Architecture Department, adopts 7nm process technology and is based on the new self-developed XPU-R architecture. Its computing power reaches 256 TOPS@INT8 and 128 TFLOPS@XFP16/FP16, with a maximum power consumption of 120W. It supports GDDR6 high-performance memory and integrates ARM CPU, supporting encoding/decoding, chip interconnection, security, and virtualization.

In terms of hardware design, this chip is the first general-purpose AI chip to adopt memory. On the software architecture side, Kunlun Core 2 has significantly upgraded its compilation engine and development kit, supporting C and C++ programming. Additionally, Kunlun Core 2 has completed end-to-end adaptation with multiple domestic general-purpose processors such as Feiteng, domestic operating systems like Kylin, and Baidu's self-developed PaddlePaddle deep learning framework, offering a full-stack domestic AI capability with integrated hardware and software. The chip is suitable for multiple scenarios including cloud, edge, and end devices, and can be applied in fields such as internet core algorithms, smart cities, and smart industries. It will also empower broader applications like high-performance computing clusters, bio-computing, intelligent transportation, and autonomous driving.

Alibaba T-Head's Hanguang 800

T-Head released the Hanguang 800 data center AI inference chip in 2019. Based on a 12nm process, it integrates 17 billion transistors and delivers a peak computing power of 820 TOPS. In the industry-standard ResNet-50 test, its inference performance reaches 78,563 IPS with an energy efficiency ratio of 500 IPS/W.

The Hanguang 800 adopts T-Head's self-developed architecture, achieving performance breakthroughs through software-hardware co-design. T-Head's independently developed AI chip software development kit enables the Hanguang 800 chip to deliver high throughput and low latency for deep learning applications. The Hanguang 800 has been successfully applied in scenarios such as data centers and edge servers.

Enflame's 'Suisi' 2.5 Cloud AI Inference Chip

The Suisi 2.5 AI inference chip, based on the second-generation GCU-CARA architecture, serves as the computing core of the Yunzui i20 high-performance inference card. With a large chip size of 55mm × 55mm, it provides full-precision AI computing power ranging from single-precision floating-point to INT8 integer types. Utilizing the HBM2E memory solution, it offers a memory bandwidth of 819GB/s. With hardware-based power monitoring and optimization features, it achieves a 3.5X improvement in energy efficiency. This chip supports model inference for various applications including vision, speech, NLP, search, and recommendation.

The new-generation Suisi AI inference chip adopts a 12nm process technology. Through architectural upgrades, it significantly improves transistor efficiency per unit area, achieving computing power comparable to current 7nm GPUs in the industry. The cost advantages brought by the mature 12nm process make the Yunzui i20 accelerator card more cost-effective at the same performance level.

Hanbo Semiconductor AI Inference Chip SV100

The SV100 series chip (SV102) is positioned as a general-purpose AI inference chip for cloud applications. Its main features include high inference performance (single-chip INT8 peak computing power of 200TOPS, also supporting FP16/BF16 data types), low latency, and video decoding performance supporting 64+ channels of 1080p (decoding formats include H.264, H.265, and AVS2).

The SV102 chip incorporates a dedicated hardware video decoding unit, delivering video processing and deep learning inference performance metrics several times higher than existing mainstream data center GPUs. Based on Hanbo's self-developed, general-purpose architecture optimized for various deep learning inference workloads, this chip supports AI inference applications such as computer vision, video processing, natural language processing, and search recommendations. It also integrates high-density video decoding, making it widely suitable for cloud and edge solutions, reducing equipment investment and operational costs.

Days GPGPU Cloud Training Chip

Days' GPGPU cloud training chip BI, based on a fully self-developed GPGPU architecture, utilizes TSMC's 7nm process and integrates 24 billion transistors with 2.5D CoWoS wafer packaging technology. The chip supports mixed-precision data training for FP32, FP16, BF16, and INT8, as well as inter-chip connectivity, achieving a single-chip computing power of 147T@FP16 per second.

With its rich self-developed instruction set, this chip supports scalar, vector, and tensor operations. Its programmable and configurable characteristics efficiently underpin various high-performance computing tasks. This GPGPU chip emphasizes high performance, versatility, and flexibility, providing computational power that matches the rapid development of artificial intelligence and related vertical application industries. It also addresses industry pain points such as product usability challenges and high development platform migration costs through standardized software and hardware ecosystems.

Kunyun Technology's Dataflow AI Chip CAISA

The CAISA chip adopts Kunyun's self-developed custom dataflow architecture CAISA 3.0, which significantly improves efficiency and actual performance compared to the previous generation. CAISA 3.0 offers four times the parallelism in multi-engine support, greatly enhancing architectural scalability. Within the AI chip, each CAISA can simultaneously handle AI workloads, achieving a sixfold increase in peak computing power while maintaining a chip utilization rate as high as 95.4%. Additionally, the chip is more versatile in operator support, enabling rapid deployment of most neural network models for detection, classification, and semantic segmentation tasks.

Kunyun has achieved a technological breakthrough in chip actual computing power through its self-developed dataflow technology, with chip utilization exceeding 95%—up to 11.6 times higher than comparable products. This customized dataflow technology does not rely on advanced wafer manufacturing processes or larger chip areas but improves actual performance by controlling computation sequences through data flow, offering users higher cost-performance computing power.

Black Sesame Intelligent Autonomous Driving Chip Huashan II A1000 Pro

Black Sesame Intelligent's A1000 Pro is built on a 16nm process, delivering 106 TOPS for INT8 and 196 TOPS for INT4 computing power, with a typical power consumption of 25W. It meets the ISO 26262 ASIL D functional safety requirements. The A1000 Pro integrates a high-performance GPU, supporting high-definition 360-degree 3D panoramic imaging rendering. It can be configured with different data pathways and computing mechanisms internally, deploying dual redundant systems and safety island verification within the chip.

Based on a single, dual, or quad A1000 Pro configuration, Black Sesame's FAD full autonomous driving platform can meet the computing power requirements for L3/L4 autonomous driving functions, supporting scenarios from parking to urban roads and highways.

SemiDrive's "Dragon Core One" Intelligent Cockpit Chip

Xinqing Technology's 7nm automotive-grade smart cockpit multimedia chip "Dragon Eagle One" is manufactured by TSMC. It integrates "one chip with multiple screens and systems" for smart cockpits, combining functions such as voice recognition, gesture control, LCD instrument panels, HUD, DMS, and ADAS fusion, providing drivers with a more intuitive and personalized interactive experience.

"Dragon Eagle One" features 8 CPU cores, 14-core GPU, and 8 TOPS INT 8 programmable convolutional neural network engine. The chip meets AEC-Q100 Grade 3 standards and adopts an ASIL-D compliant safety island design, with an independent Security Island for information security. It provides high-performance encryption and decryption engines, supporting national cryptographic algorithms such as SM2, SM3, and SM4, as well as secure boot, secure debugging, and secure OTA updates. The powerful CPU, GPU, VPU, ISP, DPU, NPU, and DSP heterogeneous computing engines, along with high-bandwidth, low-latency LPDDR5 memory channels and high-speed, large-capacity external storage, offer comprehensive computing support for smart cockpit applications.

Top 10 International AI Chips

NVIDIA A100 Tensor Core GPU

The NVIDIA A100 Tensor Core GPU is built on the NVIDIA Ampere architecture and comes in 40GB and 80GB configurations. As the engine of the NVIDIA data center platform, the A100 delivers up to 20x performance improvement over the previous generation and can be partitioned into seven GPU instances for dynamic adjustment based on changing demands. The A100 provides exceptional acceleration for AI, data analytics, and HPC applications across various scales, effectively powering high-performance elastic data centers.

For deep learning training, the A100's Tensor Cores, leveraging Tensor Float (TF32) precision, offer up to 20x higher performance than the previous-generation NVIDIA Volta without requiring code changes. With automatic mixed precision and FP16, performance can be further doubled. A cluster of 2,048 A100 GPUs can process large-scale training workloads like BERT in just one minute. For ultra-large models with massive data tables (e.g., DLRM for recommendation systems), the A100 80GB provides up to 1.3 TB of unified memory per node, delivering speeds up to 3x faster than the A100 40GB.

For deep learning inference, the A100 accelerates across the entire precision range from FP32 to INT4. Multi-Instance GPU (MIG) technology allows multiple networks to run simultaneously on a single A100, optimizing computational resource utilization. On top of other inference performance gains, structured sparsity support alone can deliver up to 2x performance improvement.

Intel Neuromorphic Chip Loihi 2

Intel's second-generation neuromorphic chip Loihi 2 measures 31mm and can package up to 1 million artificial neurons, compared to the previous generation's 60mm size supporting 131,000 neurons. Loihi 2 runs 10 times faster than its predecessor, with a 15-fold increase in resource density and higher energy efficiency. It features 128 neuromorphic cores, with each core now supporting 8 times more neurons and synapses than the first generation. These neurons are interconnected through 120 million synapses.

Loihi 2 utilizes a more advanced manufacturing process—Intel's first EUV process node, Intel 4—now requiring only half the space per core. Additionally, Loihi 2 enables inter-chip communication not just through a 2D connection grid but also in three dimensions, significantly increasing the total number of neurons it can handle. The number of embedded processors per chip has increased from three to six, and the number of neurons per chip has grown eightfold.

Loihi 2 neuromorphic chip utilizes spiking neural networks (SNNs) to solve many problems very efficiently. However, the current challenge lies in the fact that this very different programming paradigm requires an equally distinct approach to algorithm development. Most experts in this field come from theoretical neurobiology, and Loihi 2's focus on research limits its market reach. Intel combines Loihi 2 with the Lava open-source software framework, hoping that Loihi derivatives will eventually appear in broader systems, ranging from serving as coprocessors in embedded systems to large-scale Loihi clusters in data centers.

Google TPU 4

Google's fourth-generation AI chip, TPU v4, is 2.7 times faster than TPU v3. By integrating 4,096 TPU v4 chips into a TPU v4 Pod, it can achieve exaflop-level computing power, equivalent to the combined performance of 10 million laptops and twice that of the world's top supercomputer, Fugaku. In addition to using these systems for its own AI applications (such as search suggestions, language translation, or voice assistants), Google also offers TPU infrastructure as a paid cloud service to Google Cloud users.

The fourth-generation TPU provides more than double the matrix multiplication TFLOPs compared to TPU V3, significantly enhancing memory bandwidth. The performance of TPU v4 pods is 10 times better than TPU v3 pods and will primarily operate on carbon-free energy, making them not only faster but also more energy-efficient.

AWS Trainium Cloud Inference Chip

AWS's second custom machine learning chip, AWS Trainium, is specifically optimized for deep learning training workloads, including image classification, semantic search, translation, speech recognition, natural language processing, and recommendation engines. It supports frameworks like TensorFlow, PyTorch, and MXNet. Compared to standard AWS GPU instances, the EC2 TRN1 instances based on this chip offer 30% higher throughput and reduce model inference costs by 45%.

AWS Trainium shares the same AWS Neuron SDK with AWS Inferentia, making it easy for developers already using Inferentia to start using Trainium. AWS Trainium will be available through Amazon EC2 instances, AWS Deep Learning AMIs, and managed services, including Amazon SageMaker, Amazon ECS, EKS, and AWS Batch.

Qualcomm Cloud AI100

The Qualcomm Cloud AI100 inference chip is manufactured using a 7nm process and features 16 AI cores, delivering 400 TOPS of INT8 inference throughput. It incorporates a 4-way @ 64-bit LPDDR4X-4200 (2100MHz) memory controller, with each controller managing four 16-bit channels, resulting in a total system bandwidth of 134 GB/s.

Qualcomm offers three different packaging options for commercial deployment, including the mature PCIe 4.0 x8 interface, as well as DM.2 and DM.2e interfaces (25W / 15W TDP). The power consumption specifications are as follows: DM.2e @ 15W, DM.2 @ 25W, and PCIe/HHHL @ 75W.

Esperanto ET-SoC-1

Esperanto's RISC-V based ET-SoC-1 chip integrates 1000 cores and is specifically designed for AI inference in data centers. The chip is manufactured using TSMC's 7nm process, features 160MB of SRAM, and contains 24 billion transistors.

The ET-SoC-1's cores consist of 1088 ET-Minion and 4 ET-Maxion units. The ET-Minion is a general-purpose 64-bit in-order core with proprietary extensions for machine learning, including support for vector and tensor operations up to 256-bit floating-point numbers per clock cycle. The ET-Maxion is the company's proprietary high-performance 64-bit single-threaded core, featuring four-issue out-of-order execution, branch prediction, and prefetch algorithms.

Graphcore IPU Colossus Mk2 GC200

Graphcore's second-generation IPU chip, Colossus MK2 GC200, is also manufactured using TSMC's 7nm process. Its architecture is similar to the previous generation IPU but increases the number of cores to 1472 (a 20% increase) and boosts on-chip SRAM to 900MB (a 3-fold increase). In terms of interconnect scalability, it offers a 16-fold enhancement compared to its predecessor.

The system solution IPU-M2000, which includes four MK2 chips, can scale up to 1024 IPU-PODs, equivalent to 512 racks, and a maximum of 64,000 MK2 chips in a cluster. This configuration delivers a 16-bit FP computing power of up to 16 ExaFLOPs. The M2000 device incorporates a Gateway chip, providing access to DRAM, 100Gbps IPU-Fabric Links, PCIe interfaces for SmartNIC connections, a 1GbE OpenBMC management interface, and M.2 interfaces. In terms of neural network training performance, the M2000 is 7-9 times faster than its predecessor, with inference performance also improving by over 8 times.

Cerebras WSE-2

Cerebras has designed and manufactured the largest chip ever, known as the Wafer Scale Engine (WSE). The second-generation WSE-2, produced using TSMC's N7 process, covers an area of 46,225mm² and contains over 1.2 trillion transistors, with 850,000 cores fully optimized for deep learning. Compared to NVIDIA's A100 GPU, the WSE-2 is more than 56 times larger, featuring 40GB of on-chip memory, a memory bandwidth of up to 20 PB/s, and a network bandwidth of up to 220 PB/s.

The AI acceleration system CS-2, based on WSE-2, significantly increases memory and structural bandwidth while maintaining the same system power consumption (23 kW). A single CS-2 system's computational performance is equivalent to dozens or even hundreds of GPUs, reducing the time required to complete the most complex AI workloads from months to minutes.

Ambarella CV52S

Based on Ambarella's CVflow architecture and advanced 5nm process technology, the CV52S single SoC delivers ultra-low power consumption while supporting 4K encoding and powerful AI processing. The chip features dual-core 1.6GHz Arm A76 CPUs with 1MB L3 cache; its advanced ISP provides excellent wide dynamic range, low-light performance, fisheye correction, and rotation handling capabilities; built-in privacy masking can block specific camera scenes; newly added PCIe and USB 3.2 interfaces enable more complex multi-chip security system designs; rock-solid hardware-level digital security technologies including secure boot, OTP, and Arm TrustZone ensure information security for surveillance camera devices; supports multi-stream video input with up to 14 cameras via MIPI virtual channel interface; compatible with LPDDR4x/LPDDR5/LPDDR5x DRAM.

Compared to the previous generation, the CV52S series chips designed for monocular security cameras support 4K60fps video recording while delivering 4x improvement in AI computer vision performance, 2x faster CPU performance, and over 50% increased memory bandwidth. The enhanced neural network (NN) performance enables more types of AI processing to be executed on edge devices without requiring cloud uploads.

Hailo边缘AI处理器Hailo-8

以色列AI芯片公司Hailo的边缘AI处理器Hailo-8 性能达到26 tera/秒(TOPS)，具有2.8 TOPS/W的高效能。据该公司称，Hailo-8在多项AI语义分割和对象检测基准测试中的表现优于Nvidia的Xavier AGX、英特尔的Myriad-X和谷歌的Edge TPU模块等硬件。

基于Hailo-8的M.2模块是一个专门针对AI应用的加速器模块，可提供高达26TPOS的算力支持，适合边缘计算、机器学习、推理决策等应用场景。M.2 模块具有完整的 PCIe Gen-3.0 4 通道接口，可插入带 M.2 插座的现有边缘设备，以实时和低功耗深度神经网络推断，可对广泛的细分市场进行推断。