Open Accelerator Specification AI Server Design Guide Released to Address Surging Computing Demands of Generative AI

baoshi.rao

On August 10, at the 2023 Open Compute Project China Summit (OCP China Day 2023), the Open Accelerator Specification AI Server Design Guide (hereinafter referred to as the Guide) was released. Targeting generative AI application scenarios, the Guide further develops and refines the design theories and methodologies for open accelerator-compliant AI servers. It will assist community members in efficiently developing AI accelerator cards that adhere to open standards while significantly reducing adaptation cycles for AI servers. This enables users to access optimally matched AI computing solutions tailored to specific applications, seizing the vast opportunities presented by the generative AI boom in the computing industry.

WeChat Image_20230809104207.jpg

Currently, generative AI technology is advancing rapidly, driving a new wave of AI innovation. Large AI models serve as the foundational platform for generative AI, offering significant potential to enhance productivity and transform traditional industries. However, efficient training of these models typically requires AI server clusters powered by thousands of high-performance AI chips. With the accelerated adoption of generative AI, industry demand for AI servers equipped with high-performance chips has surged. Over a hundred companies worldwide are now developing new AI accelerator chips, highlighting the trend toward diversified AI computing solutions. The lack of unified industry standards has led to significant variations among different vendors' AI accelerator chips, necessitating customized hardware platforms for each chip type—resulting in higher development costs and longer cycles.

The OCP (Open Compute Project) is the most extensive and influential open-source organization in global foundational hardware technology. In 2019, OCP established the Open Accelerator Infrastructure (OAI) group to define standardized AI accelerator card specifications optimized for large-scale deep learning training, addressing fragmentation in form factors and interfaces. By late 2019, OCP released the OAI-UBB (Universal Baseboard) 1.0 design specification, followed by open accelerator hardware platforms based on this standard that support OAM (Open Accelerator Module) products from different vendors without hardware modifications. In recent years, system manufacturers like Inspur have developed multiple AI servers compliant with open accelerator standards, achieving industrial-scale implementation.

Building on product development and engineering expertise in open accelerator computing, the Guide further refines design theories and methodologies for AI servers. It introduces four core design principles and a full-stack design approach, covering hardware references, management interface standards, and performance testing criteria. These guidelines aim to help community members develop AI accelerator cards and adapt them to open-standard servers more efficiently, addressing the computational challenges of generative AI.

The Guide emphasizes four design principles for open accelerator-compliant AI servers: application-oriented, diverse and open, energy-efficient, and holistic design. To enhance deployment efficiency, system stability, and usability, it advocates multi-dimensional co-design, comprehensive system testing, and performance evaluation and optimization.

Multi-dimensional co-design requires early-stage collaboration between system and chip manufacturers to minimize custom development. Large-model computing systems typically integrate high-density compute clusters encompassing computing, storage, networking, software frameworks, models, and infrastructure (racks, cooling, power, liquid cooling). Only through multi-dimensional coordination can optimal performance, energy efficiency, or TCO (Total Cost of Ownership) be achieved, improving system adaptation and cluster deployment efficiency. The Guide provides full-stack reference designs from node to cluster levels.
Comprehensive system testing addresses the typically higher failure rates of heterogeneous accelerator nodes. Rigorous testing is essential to minimize risks during production, deployment, and operation, ensuring system stability and reducing training interruptions. The Guide details testing protocols for structural integrity, thermal management, stress, stability, and software compatibility.

Performance evaluation and tuning involves conducting multi-level performance assessments and deep software-hardware optimization for large model acceleration computing systems. The guidelines specify key points and metrics for basic performance, interconnect performance, and model performance testing. They also highlight optimization priorities for large model training and inference performance, ensuring that open acceleration standard AI servers can effectively support innovative applications of current mainstream large models.