Making Large Models Accessible: Yanxi AI Development Computing Platform Launched

baoshi.rao

In the year when large models became a sensation, the most critical bottleneck in the generative AI industry emerged on the computing power side. On September 20, Sequoia Capital US mentioned in its article 'The Second Act of Generative AI' that many generative AI companies quickly realized over the past year that their development bottleneck was not customer demand but GPU shortages. Long GPU wait times have become the norm, giving rise to a simple business model: paying a subscription fee to skip the queue and gain access to better models.

In the training of large models, the exponential increase in parameter scale has led to a sharp rise in training costs. For scarce GPU resources, maximizing hardware performance and improving training efficiency have become even more critical.

An AI development computing platform is a key solution. Using such a platform, a large model developer can complete the entire AI development process—from data preparation and model development to model training and deployment—in one place. Beyond lowering the barrier to large model development, AI computing platforms enhance resource efficiency by offering training optimization and inference management services.

On September 26, according to JD Cloud's introduction of the Yanxi AI Development Computing Platform at the Xi'an City Conference, using this platform, the entire workflow—from data preparation and model training to deployment—can be completed in less than a week. Tasks that previously required a team of over 10 scientists can now be handled by just 1-2 algorithm engineers. With the platform's model acceleration tools, teams can save 90% on inference costs.

More importantly, as large models rapidly integrate into various industries, the Yanxi AI Development Computing Platform empowers both large model algorithm developers and application developers. For the latter, it enables low-code development of large model products. This makes industrial large model development more accessible, simplifying the utilization and adaptation of large models.

The Era of Large Models Demands New Digital Infrastructure

For a large model developer, the absence of an AI development computing platform means having to independently set up systems for GPU resource scheduling, storage networks, model management, and more during algorithm and application development. This makes the entire development process highly primitive and raises the barrier to entry significantly.

For companies implementing large model applications internally, this leads to rapidly escalating costs and difficulties in ensuring training efficiency.

Over the past year, industries such as finance, marketing, automotive, content creation, legal, and office automation have actively integrated large models. The powerful momentum of large models has become a key factor reshaping competitive landscapes across many sectors. The ability to quickly identify scenarios where business operations can leverage large models and efficiently execute implementations has become crucial for competitiveness.

However, developing industry-specific models is not without challenges. A series of obstacles and opportunities remain:

Data Challenges: Different industries exhibit varying levels of data concentration and dispersion, with unique cycles and difficulties in data preparation. Efficiently loading massive multimodal data during training is a critical issue that must be resolved.
Training Stability: The stability of the training environment, fault recovery, and resumption of interrupted training significantly impact efficiency. Additionally, optimizing computing resource scheduling during training and deployment to improve utilization is a key cost consideration for enterprises.

At the Xi'an City Conference, JD Cloud shared insights from its recent practices, emphasizing that the challenges of industrial large models are not just about the technology itself. The real challenges lie in integrating the technology with industry application scenarios and balancing cost, efficiency, and user experience.

Returning to the fundamental development level, balancing cost, efficiency, and experience means addressing and optimizing several issues anew.

Gong Yicheng, head of JD Cloud's IaaS Product R&D Department, explained in an interview that the requirements for development infrastructure in the era of large models have significantly diverged from traditional needs. In terms of efficiency, while relatively low-cost GPUs could handle many tasks in past AI development, large model scenarios now heavily rely on high-cost GPUs like the A100 and A800, demanding greater computational power and performance, which rapidly escalates costs.

"Therefore, under high costs, maximizing the performance of these hardware becomes particularly crucial for the cost efficiency of large model development," Gong noted.

In traditional AI development, data throughput concurrency was not as high as in large models, which require many GPUs to work simultaneously. Even with relatively small data volumes, the concurrent reading and potential latency issues of large models impose new demands on high-performance storage, which conventional storage mechanisms often fail to meet.

Gong also mentioned that lower latency in data access significantly boosts overall model efficiency. Utilizing self-developed smart chips with low-latency networks can enhance the efficiency of model training.

Additionally, at the scale level, training large models with hundreds of billions of parameters typically requires thousands of GPUs. Gong pointed out that this was extremely rare in past AI development, presenting entirely new and high demands on development experience and infrastructure.

For companies looking to enhance large model development efficiency and facilitate better industry adoption, a new infrastructure has become essential.

JD Cloud Launches Yanxi AI Computing Platform

On September 26, JD officially released the Yanxi AI Development Computing Platform at the Xi'an City Conference. The platform covers the full lifecycle of AI development, including data preparation, model development, training, and deployment. It comes pre-installed with mainstream open-source and commercial large models, as well as over 100 inference tools and frameworks, significantly lowering the barriers and costs associated with large model development.

In terms of performance improvements, the Yanxi AI Development Computing Platform has achieved several technological breakthroughs in computing power and storage. At the foundational level, the platform optimizes GPU resource allocation and scheduling, enhancing the efficiency of underlying resource utilization.

According to JD Cloud, the platform will feature fifth-generation cloud hosts and various high-performance product configurations, supporting computing power for up to hundreds of thousands of GPU nodes. On the network front, the platform employs a self-developed RDMA congestion algorithm to globally manage RDMA network traffic paths, enabling up to 3.2Tbps RDMA bandwidth between GPU nodes with latency as low as 2 microseconds.

For storage, addressing the high throughput demands of large model training, JD Cloud's Yunhai distributed storage supports massive data concurrency requirements, delivering millions of IOPS with latency in the hundreds of microseconds. Coupled with a new compute-storage separation architecture, Yunhai can save customers over 30% in infrastructure costs and is already widely used in emerging scenarios like high-performance computing and AI training, as well as traditional applications such as audio-video storage and data reporting.

In addition to optimizing underlying resources, the Yanxi AI Computing Platform helps large model developers improve efficiency across the entire pipeline, enabling streamlined data processing, model development, training, deployment, evaluation, as well as training and inference optimization, and model security.

In the data management phase, Yanxi leverages intelligent annotation models, data augmentation models, and data conversion toolkits to assist developers in data import, cleaning, labeling, and enhancement. It supports multiple file formats for data import and intelligent parsing, offering automated and semi-automated data labeling capabilities. This helps address challenges such as fragmented data storage, inconsistent data formats, varying data quality, and inefficient manual labeling.

In the distributed training phase, the Yanxi platform is compatible with domestic hardware, supports HPC (High-Performance Computing), and integrates high-performance file systems. It provides resource allocation and scheduling strategies to ensure optimal hardware utilization and offers a unified interface to simplify training task management. This tackles issues like the rapid growth in network and algorithm complexity leading to resource scarcity and waste, difficulties in adapting HPC, high-performance computing, file systems, and heterogeneous hardware, as well as the increasing learning curve for diverse model training scenarios.

For no-code development, the platform further simplifies the large model development process. Users can select built-in large models, upload data, choose training methods, and specify hyperparameters or AutoML for no-code training, resulting in a fine-tuned model or application.

At the application layer, the Yanxi platform includes no-code development tools for common scenarios such as Q&A development, document analysis, and plugin development. Users can select models, knowledge bases, prompt templates, and development platforms for one-click deployment, with support for monitoring, tracking, testing, and evaluation.

Overall, the Yanxi AI development computing platform meets the needs of users with different levels of expertise. For large model algorithm developers, it provides end-to-end support from data preparation, model selection, code optimization, to deployment. For application-layer developers, it enables no-code model selection, data uploading, and parameter configuration through visual interfaces, eliminating coding requirements to initiate model training and lowering the entry barrier.

Regarding model integration, the platform currently includes commercial and open-source models such as Yanxi, Spark, and LLama2. Gong Yicheng stated that Yanxi prioritizes quality over quantity in model selection: choosing top-tier commercial models across technical domains and industry-specific models built around foundational models to prevent user selection paralysis.

Furthermore, Yanxi will focus on introducing JD.com's industry-specific model applications based on foundational models, such as retail and healthcare scenarios, along with already scaled industry application models to help platform developers implement relevant business solutions.

Yanxi currently offers three delivery methods: 1) MaaS service, where developers can explore and use large models through API-based pay-as-you-go pricing; 2) Public cloud SaaS version, providing one-stop model development, training, and deployment capabilities leveraging public cloud resource elasticity for cost-effective initiation of industrial large model projects; 3) Privatized deployment version for clients with special data security requirements, ensuring complete data localization.

In the future, Yanxi will continue upgrading platform capabilities in domestic hardware support, model ecosystem collaboration, plugin development, application evaluation services, all-in-one machine delivery, and Agent development services. This systematic approach aims to address challenges in industrial large model development/implementation, application development difficulties, expensive training/inference costs, model/application accessibility issues, and high-performance computing/file system and heterogeneous hardware adaptation challenges.

Promoting Large Model Adoption Across Industries

At the Xi'an City Conference, Cao Peng, Chairman of JD.com's Technology Committee and President of JD Cloud, emphasized in his speech that as large AI models gradually integrate into industries, there is a growing demand to improve industrial efficiency, create greater industrial value, and enable replication across more scenarios. This essentially raises higher requirements for model training processes and infrastructure: models need to be more user-friendly, require lower thresholds and costs, and allow flexible computing power allocation.

An AI development computing platform is one of the key solutions to address these challenges. A high-performance and user-friendly AI development computing platform can enable more industries to participate in large model development at lower costs, stimulate the emergence of more industrial-scale models, and accelerate their implementation across various sectors.

In the actual market, Gong Yicheng noted that industry clients primarily consider two key factors when selecting an AI computing platform: industry understanding and platform efficiency. Compared to other AI computing platforms, the Yanxi AI Development Computing Platform not only delivers exceptional performance but also leverages JD.com's extensive experience in retail, finance, logistics, healthcare, and other dominant sectors, offering more specialized industrial-scale model options.

Within the Yanxi AI Computing Platform's model ecosystem, in addition to built-in excellent commercial and open-source models, the platform further enhances these large models with additional capabilities such as Chinese language processing and mathematical skills to make them more accessible and professional for users.

More importantly, as the Yanxi AI Development Computing Platform also targets large model application developers, it supports no-code development of proprietary models. Beyond the foundational models mentioned above, the Yanxi platform will provide users with more proprietary models for specific application scenarios, enabling rapid deployment in their respective industries.

Currently, the proprietary models offered by the Yanxi platform mainly include mature high-frequency scenarios such as Q&A development and document analysis development. These applications have been repeatedly validated in JD.com's core business areas and can significantly improve efficiency when combined with large models.

Taking conversational tools as an example, since 2021, MINISO has partnered with JD Cloud to implement JD's Yanxi series of customer service technology products across MINISO's store customer service teams, user operations teams, and IT service operation teams. In April 2022, the Yanxi product series was gradually launched, including online customer service robots, voice response robots, outbound call robots, intelligent quality inspection, and intelligent knowledge base systems, all of which have delivered remarkable results.

Feedback data shows that the Yanxi product series currently handles nearly 10,000 daily consultation services. The online customer service robot achieves an accuracy rate of over 97% with an independent reception rate exceeding 70%, reducing service costs by 40%. The voice response robot maintains an accuracy rate of over 93%, independently handling 46.1% of customer inquiries. Intelligent quality inspection has completed hundreds of thousands of checks, identifying and addressing nearly 3,000 service risk issues while improving user satisfaction by 20%. The intelligent knowledge base covers approximately 8,800 core SKUs under the MINISO brand and about 4,600 SKUs under the TOP TOY brand.

The implementation of large models has reached a stage of moving from single-point applications to widespread adoption. In the industry, there are many companies like MINISO where conversational robot scenarios can bring greater value. The launch of JD's Yanxi AI development computing platform provides comprehensive empowerment for industrial companies across the entire chain from underlying computing power and data management to no-code applications, offering a more accessible, lower-cost, and shorter-training-cycle solution for large model industrialization. It can be foreseen that similar cases to MINISO will become increasingly common.

Additionally, JD Cloud emphasizes that compared to other competitors, the Yanxi AI computing platform further lowers the development threshold for application developers through its low-code approach. It features fully autonomous high-performance storage and a complete technical system with high compatibility and efficiency.

With the popularization of new digital infrastructure, the implementation of large models across various industries will accelerate, expanding the possibilities for balancing cost efficiency and innovation.