Understanding AI Commodity Model Training Platforms in One Article

baoshi.rao

The original intention of AI platforms is always to improve development efficiency and accelerate algorithm iteration cycles. By productizing AI technology, it enables operational personnel to get closer to technology, better guide and empower business scenarios, and provide customers with a better technical and product experience.

This article is the first formal article of 2020, introducing information about deep learning platforms in the field of artificial intelligence. The content includes: a basic introduction to AI platforms, system architecture, implementation challenges, and related capabilities.

Next, focusing on the author's own case of building a commodity model training platform, it shares relevant design experiences. The content includes: business scenarios, planning of the training platform system architecture, data and model centers, return on investment issues, and related summaries.

AI model training platforms, based on different core modules and application scenarios, can also be called deep learning platforms, machine learning platforms, or artificial intelligence platforms (hereinafter collectively referred to as AI platforms).

AI platforms provide end-to-end, online artificial intelligence application solutions from business to product, data to model.

Users can use different deep learning frameworks on AI platforms for large-scale training, manage and iterate datasets and models, and integrate them into specific business scenarios through APIs and local deployments.

Simply put, AI platform = AI SAAS + (PAAS) + (IAAS).

The following are introductions to Tencent's DI-X and Alibaba's PAI platforms:

DI-X (Data Intelligence X) is a one-stop deep learning platform based on Tencent Cloud's powerful computing capabilities. Through a visual drag-and-drop interface, it combines various data sources, components, algorithms, models, and evaluation modules, allowing algorithm engineers and data scientists to conveniently perform model training, evaluation, and prediction.

Alibaba Cloud's Machine Learning Platform PAI (Platform of Artificial Intelligence) provides a one-stop service for traditional machine learning and deep learning, from data processing, model training, service deployment to prediction.

Using AI platforms can simplify developers' tedious code operations for data preprocessing and management, model training and deployment, accelerate algorithm development efficiency, and improve product iteration cycles. Moreover, AI platforms can integrate computing resources, data resources, and model resources, enabling users to reuse and schedule different resources.

Opening AI platforms can also effectively commercialize them, promoting and providing feedback on the AI business ecosystem in the enterprise's field.

Relevant AI platforms at home and abroad include:

Domestic:

International:

From the perspective of an enterprise's overall system architecture, AI platforms can be regarded as one of the technical support middle platforms (parallel to data middle platforms), playing a connecting role (supporting business and connecting to the technical bottom layer).

If an enterprise already has a data middle platform, the data middle platform can serve as the data input and output system object for the AI middle platform, while the AI middle platform serves as the model and algorithm supply platform for the business front-end. If the business front-end has AI needs (such as image recognition, semantic recognition, product recommendation, etc.), the algorithm operations team supports them by training and iterating models on the AI platform.

Depending on the enterprise's scale, resources, and business scenarios, its AI platform will have different positioning.

For example, AI and data can be part of the same middle platform, the AI platform can be considered part of the business middle platform, or the AI platform can be integrated into the technical middle or back-end. Smaller enterprises with limited resources often choose to use third-party AI platforms to serve their business rather than building their own AI platforms.

Enterprise Architecture Example: AI Platform as an AI Middle Platform

Regarding the architecture design of AI platforms themselves, various third-party platforms are largely similar, mainly differing in technical architecture, and there is no need to delve deeper for now.

Here, we take JD.com's NeuFoundry project system architecture as an example for a preliminary exploration:

NeuFoundry Platform Architecture Diagram

The NeuFoundry infrastructure layer uses Docker containers for pooling computing resources, manages overall resources, resource allocation, task execution, and status monitoring through Kubernetes. The platform integrates various middleware services such as MySQL, Redis, and MQ, and generates customized AI capabilities through data annotation, model training, and model release, providing strong support for business services across industries.

Big Data Processing Issues

At the current stage, the underlying technical principles of AI determine that "the more data, the better the model's capabilities." At the same time, enterprises continuously generate new data during daily business operations.

When both data requirements and actual data volumes are large, big data management and processing capabilities are the most fundamental for an AI platform. Developers need to combine AI model training tasks to formulate reasonable data scheduling plans and manage the data lifecycle (such as periodically deleting redundant and irregular data).

Distributed Computing

Big data processing and model training are highly resource-intensive. If business scenarios are complex, model training times are long, or sample sizes are large, exceeding the capabilities of a single server, distributed training support is needed.

The solution for Weibo's deep learning training cluster is as follows:

Taking TensorFlow's distributed operation mode as an example, as shown in Figure 5.

A TensorFlow distributed program corresponds to an abstract cluster, which consists of worker nodes and parameter servers. Worker nodes perform specific computational tasks such as matrix multiplication and vector addition, calculate corresponding parameters (weights and biases), and summarize them to the parameter server. The parameter server collects and summarizes parameters from multiple worker nodes, calculates them, and passes them back to the respective worker nodes for the next round of calculations, and so on.

The Biggest Bottleneck in AI Platform Implementation

Lies in the enterprise's trade-off on the return on investment of the AI platform (the value recognition of the AI platform by top, middle, and execution layers), which will be discussed in detail below.

AI platforms not only need to provide the basic skills required for AI development processes but also need to offer corresponding services for different users (product managers, operational personnel, algorithm engineers, etc.) and different customers (large enterprises, small and medium-sized enterprises, traditional enterprises, tech enterprises, etc.).

I categorize AI platform capabilities into the following five types:

Huawei ModelArts Platform Skills

In daily operations, each new product requires data collection and annotation, followed by feeding the data into corresponding model files for training, involving many repetitive and tedious tasks.

Platformizing the process from data collection and processing to model training and deployment can greatly improve development efficiency, allowing operational and algorithm personnel to better manage scenarios and models, respectively.

Moreover, data and models (usable online) are the most core technical resources for enterprises, though initially, they were in a black box state, accessible only to algorithm personnel. Therefore, when business development reaches a certain stage, effective management becomes necessary.

The AI platform in this article primarily serves the business scenario of commodity model training in the retail industry, hence referred to as the AI Commodity Model Training Platform.

Considering resources, scenarios, service efficiency, commercialization, and other dimensions, the author designed the commodity model training platform mainly composed of two core subsystems: the data center and the model center. On one hand, it can cover the core and personalized processes required by current business with minimal development resources; on the other hand, it is conducive to the platform's future capability expansion and commercialization.

AI Commodity Model Training Platform

The data center mainly serves three data management business needs: data acquisition, data processing, and data evaluation. The involved capabilities include dataset acquisition, dataset management, data enhancement, enhancement strategy configuration, data annotation, annotation task systems, and semi-automatic annotation.

The model center mainly serves three model management business needs: model training and validation, model management, and model deployment. The involved capabilities include model training, parameter configuration, training task management, training status visualization, model file management, model version management, model status management, model operations, model processing, model processing strategy management, model deployment, and deployment business management.

Next, solutions for core business needs are explained one by one.

2.3.1 Data Acquisition

The first step in AI model training is data acquisition (here, the data refers to image data).

Data can be collected by setting up the required environment for specific business scenarios for shooting, or obtained from existing data within the platform (online data, old data), third-party data (through open-source, paid purchases, web scraping, etc.).

Since the dataset consists of image data and the model is built on deep learning technology, processes like data ETL and feature engineering are not immediately necessary. These can be added later based on business scenarios and technological advancements in the technical and platform architecture.

After acquiring the dataset, the data can be stored and managed through the dataset management page, categorized by different dimensions:

Standard vs. non-standard products
Data source channels
Data formats (images, videos, 2D, 3D, etc.)
Data usability (basic datasets, training datasets with annotations, validation datasets, anomaly datasets, custom datasets).

Datasets should have lifecycle management and notes to avoid clutter and redundancy over time.

2.3.2 Data Processing

Before training models for certain scenarios, data may need enhancement to varying degrees and through different methods. Operators or algorithm engineers can select the corresponding dataset and enhancement strategy on the data enhancement page. The enhanced dataset will be displayed as a sub-file of the original data under the "Enhanced Dataset" type in dataset management.

To adapt to multiple business scenarios and accelerate the effectiveness of data enhancement experiments, multiple enhancement schemes can be configured using existing techniques. For example:

After dataset resource management is complete, data can be annotated on the platform. Operators can annotate existing datasets or import new datasets for annotation.

Common annotation tasks include:

Image classification annotation
Bounding box annotation
Circular annotation
Polygon annotation
Semantic segmentation annotation
3D annotation

Annotations can cover standard and non-standard products, as well as other elements like hands or faces.

2.3.3 Data Evaluation

Data evaluation runs through the entire process from data acquisition to processing. Its quality and rigor directly determine data quality and indirectly affect model performance.

During data acquisition and enhancement, operators evaluate data usability based on general rules and experience, consulting algorithm engineers when uncertain. However, uncertain data rules depend on factors like products, current models, requirements, and algorithmic knowledge, often relying on "personal experience." This area has significant room for optimization as employees gain experience.

Regular and quantitative checks on existing datasets are necessary to validate data and annotation quality. Annotation workflow (including task assignment, multi-level review, annotator performance tracking, and reward/penalty mechanisms) is also crucial for ensuring data quality.

2.4.1 Model Training and Validation

Once data is ready, operators or algorithm engineers can select the model, dataset, and training parameters (e.g., AI algorithm, network depth, training steps) on the model training page to start incremental or full training.

If GPU server resources are a consideration, the appropriate server can be selected. Visualizing training progress helps operators monitor tasks and pause or cancel unexpected training (e.g., when loss stops decreasing), freeing up algorithm engineers' time.

After training, model performance can be evaluated using metrics like MAP, precision, and recall on the training set. Unannotated validation datasets can also be used to test model quality.

2.4.2 Model Management

Initial models can be imported from external files or generated through new training tasks.

Models are often in a "used" state (online or updated), so management focuses on:

Versioning
Status (service, training)
Operation logs
Detailed parameters

For optimization, replacement, or anomalies, operations like pausing, copying, deploying, or deleting models can be performed via "Model Management."

For edge computing or resource-limited scenarios, one-click solutions for model compression and optimization can reduce engineering workload.

2.4.3 Model Deployment

After training and validation, models can be deployed via "Model Deployment," typically transitioning from grayscale to full deployment.

For edge scenarios, edge devices can periodically pull the latest model or deploy via edge nodes.

The biggest bottleneck in AI platform implementation is the trade-off between investment and return. After internal discussions, the team addressed three key questions before proceeding with the 1.0 development (primarily to validate utility):

Can the AI platform truly support business needs? How much will it improve efficiency (development and operational)? Are there hidden post-implementation costs (e.g., training operators)? Can value be quantified, and at what additional cost?
Can third-party AI platforms suffice for initial business incubation? Can manual processing handle custom data and training needs?
Can the platform's commercial value be realized short-term (given data security concerns and reliance on brand reputation)? If not, when?

There’s no standard answer—each enterprise must weigh resources, business needs, and stakeholder perspectives to determine if the ROI justifies developing an AI platform.

Regardless of the scenario, the goal of an AI platform is to boost development efficiency and accelerate algorithmic iteration. By productizing AI, operators can better leverage technology to enhance business scenarios and customer experiences.

Moreover, enterprises can refine their AI platforms internally before commercializing them, serving external clients who lack resources to adopt AI, thereby advancing the broader AI ecosystem.

The development and use of AI platforms mark a critical milestone in applied AI, signifying its productization, practicality, and closer alignment with business needs—enabling enterprises to harness AI more efficiently.

The cover image is from Unsplash, based on the CC0 license.