Google's 'Big Move': How Many AI Annotation Companies Will Be Driven Out of Business?

baoshi.rao

Handicraft workshops ultimately cannot compete with assembly lines.

If current generative AI is likened to a growing child, then the continuous influx of data serves as the nourishment for its development. Data annotation is the process of preparing this 'nourishment.'

However, this process is highly competitive and exhausting. 'Annotators' not only need to repeatedly identify various objects, colors, shapes, etc., in images but sometimes also have to clean and preprocess the data.

With the continuous advancement of AI technology, the limitations of manual data annotation are becoming increasingly apparent. Manual annotation is not only time-consuming and labor-intensive but also sometimes fails to ensure quality.

To address these issues, Google recently proposed a method called Reinforcement Learning from AI Feedback (RLAIF), which uses large models to replace human preference annotation.

Research results show that RLAIF can achieve improvements comparable to Reinforcement Learning from Human Feedback (RLHF) without relying on human annotation, with both methods having a 50% win rate. Additionally, both RLAIF and RLHF outperform the baseline strategy of Supervised Fine-Tuning (SFT).

These results indicate that RLAIF does not depend on manual annotation and is a viable alternative to RLHF.

So, if this technology is widely adopted in the future, will the many data annotation companies still relying on manual 'box-drawing' be driven to extinction?

01 The Current State of Data Annotation

To summarize the current state of the annotation industry in simple terms: the workload is heavy, but efficiency is relatively low, making it a thankless task.

Annotation companies are referred to as data factories in the AI field and are typically concentrated in regions with abundant labor resources, such as Southeast Asia, Africa, or China's Henan, Shanxi, and Shandong provinces.

To control costs, annotation company owners rent spaces in small towns, set up computers, and hire local part-time workers when orders come in. When there are no orders, they disband and take a break.

In short, this job is somewhat akin to temporary construction workers on the roadside.

At their workstations, 'annotators' are randomly assigned a set of data by the system, typically consisting of several questions and corresponding answers.

Subsequently, the annotators first need to categorize the type of question and then score and rank these answers.

Previously, when discussing the gap between domestic large models and advanced models like GPT-4, the low quality of domestic data was identified as a key reason.

But why is the data quality low? Part of the reason lies in the 'assembly line' of data annotation.

Currently, Chinese large models rely on two main sources of data: one is open-source datasets, and the other is data scraped from the Chinese internet.

One of the primary reasons for the underperformance of Chinese large models is the quality of internet data. For example, professionals generally do not use Baidu when searching for information.

Therefore, when dealing with more specialized and vertical data issues, such as in healthcare or finance, collaboration with professional teams is necessary.

However, this introduces another problem: for professional teams, the return on investment in data is not only slow but also risks being exploited by latecomers.

For instance, an annotation team might invest significant time and money to compile extensive datasets, only for others to purchase them at a fraction of the cost.

Faced with this 'free-rider dilemma,' domestic large models find themselves in a paradoxical situation where data quantity is high, but quality remains low.

So, how do leading international AI companies like OpenAI address this issue?

In fact, OpenAI also relies on cheap, intensive labor to reduce costs in data annotation.

For example, it was previously reported that OpenAI hired Kenyan workers at $2 per hour to label toxic content.

The key difference lies in how they address the issues of data quality and annotation efficiency.

Specifically, OpenAI's approach differs from domestic companies in minimizing the impact of 'subjectivity' and 'instability' in manual annotation.

02 OpenAI's Approach

To mitigate the 'subjectivity' and 'instability' of human annotators, OpenAI primarily adopts two key strategies:

1. Combining Human Feedback with Reinforcement Learning

First, let's discuss the first point. In terms of annotation methods, OpenAI's human feedback differs from domestic practices in that it primarily involves ranking or scoring the behavior of intelligent systems rather than modifying or annotating their outputs.

Intelligent system behavior refers to a series of actions or decisions made by the system in a complex environment based on its goals and strategies—such as playing a game, controlling a robot, or engaging in a conversation. On the other hand, intelligent system outputs refer to results or responses generated for simple tasks based on input data, such as writing an article or drawing a picture.

Generally, intelligent system behavior is harder to judge as 'correct' or 'incorrect' and is better evaluated based on preferences or satisfaction. This preference- or satisfaction-based evaluation system reduces the impact of human subjectivity, knowledge levels, and other factors on the quality and accuracy of data annotation since it doesn't require modifying or annotating specific content.

While domestic companies also use ranking or scoring systems during annotation, without a 'reward model' like OpenAI's to optimize the system's strategy as a reward function, such 'ranking' and 'scoring' essentially remain methods of modifying or annotating outputs.

2. Diverse, Large-Scale Data Sources

Domestic data annotation primarily relies on third-party annotation companies or in-house teams at tech companies, often composed of undergraduates who lack sufficient expertise and experience, making it difficult to provide high-quality and efficient feedback.

In contrast, OpenAI's human feedback comes from multiple channels and teams. OpenAI not only uses open-source datasets and web crawlers to gather data but also collaborates with various data companies and institutions, such as Scale AI, Appen, and Lionbridge AI, to obtain more diverse and high-quality data.

Compared to domestic counterparts, these data companies and institutions employ much more 'automated' and 'intelligent' annotation methods. For example, Scale AI uses a technology called Snorkel, a weakly supervised learning-based data annotation method that generates high-quality labels from multiple imprecise data sources.

Meanwhile, Snorkel can utilize various signals such as rules, models, and knowledge bases to label data without requiring manual annotation for each data point. This significantly reduces the cost and time of human labeling.

With reduced costs and shorter cycles in data annotation, competitive data companies can further enhance their core competitiveness and differentiation by focusing on high-value, high-difficulty, and high-barrier niche fields such as autonomous driving, large language models, and synthetic data.

As a result, the "first-mover disadvantage" dilemma is mitigated by strong technological and industry barriers.

03 Standardization vs. Small Workshops

It’s clear that AI-powered annotation technology is only eliminating companies that still rely solely on manual labeling.

Although data annotation may seem like a "labor-intensive" industry, delving into the details reveals that achieving high-quality data is no easy task.

Take Scale AI, a unicorn in overseas data annotation, as an example. Scale AI not only leverages cheap labor resources in places like Africa but also employs dozens of Ph.D. holders to handle industry-specific data.

Data annotation quality is the core value that Scale AI provides to major AI model companies like OpenAI.

To maximize data quality assurance, besides AI-assisted annotation, Scale AI’s other major innovation is its unified data platform.

These platforms include Scale Audit, Scale Analytics, and Scale Data Quality. Through these, clients can monitor and analyze various metrics during the annotation process, validate and optimize labeled data, and assess accuracy, consistency, and completeness.

Such standardized and unified tools and processes have become the key differentiator between "assembly-line factories" and "handcrafted small workshops" in the annotation industry.

Currently, most domestic annotation companies in China still rely on "manual review" to assess data quality, with only a few giants like Baidu adopting advanced management and evaluation tools such as the EasyData intelligent data service platform.

Without specialized tools to monitor and analyze annotation results and metrics, quality control remains at the level of a "master craftsman’s intuition"—a small-workshop approach.

As a result, more Chinese companies, such as Baidu and Longmao Data, are beginning to leverage machine learning and AI technologies to improve annotation efficiency and quality, adopting a human-machine collaboration model.

It can be seen that the emergence of AI annotation is not the end for domestic annotation companies, but rather the end of inefficient, low-cost, and low-tech labor-intensive annotation methods.