Models Like ChatGPT: By 2026, High-Quality Training Data Will Be Exhausted

baoshi.rao

MIT Technology Review published an article stating that with the ongoing popularity of large models like ChatGPT, the demand for training data is increasing. Large models act like 'web black holes,' continuously absorbing data, which will eventually lead to insufficient data for training.

The renowned AI research institution Epochai directly addressed the issue of data training in a paper, pointing out that by 2026, large models will exhaust high-quality data; by 2030–2050, they will exhaust all low-quality data.

By 2030–2060, they will exhaust all image training data. (Here, 'data' refers to raw, unlabeled, and unpolluted data.)

Paper link: https://arxiv.org/pdf/2211.04325.pdf

In fact, the issue with training data has already surfaced. OpenAI stated that the lack of high-quality training data will be one of the significant challenges in developing GPT-5. It's akin to human education—when your knowledge reaches a doctoral level, revisiting middle school material offers no learning benefit.

To enhance GPT-5's learning, reasoning, and AGI capabilities, OpenAI has established a 'Data Alliance' aimed at extensively collecting private, long-form text, video, and audio data to deeply simulate and learn human thought and work patterns.

Currently, organizations like Iceland and the Free Law Project have joined this alliance, providing OpenAI with diverse data to accelerate model development.

Moreover, as AI-generated content from models such as ChatGPT, Midjourney, and Gen-2 enters public networks, it will severely pollute the human-built public data pools, resulting in homogenization and logical uniformity, thereby accelerating the consumption of high-quality data.

High-quality training data is crucial for the development of large models

From a technical perspective, large language models can be seen as "language prediction machines." They learn from vast amounts of text data to establish patterns of association between words and then use these patterns to predict the next word or sentence in a text.

Transformer is one of the most famous and widely used architectures, and models like ChatGPT have drawn inspiration from this technology.

In simple terms, large language models "follow the pattern"—they mimic how humans speak. So when you generate text using models like ChatGPT, you might notice that the narrative style feels familiar.

Therefore, the quality of training data directly determines whether the structure learned by the model is accurate. If the data contains numerous grammatical errors, poor phrasing, incorrect sentence breaks, or false information, the model's predictions will naturally include these issues.

For example, if you train a translation model using poorly fabricated, low-quality data, the AI's translations will inevitably be very poor.

This is also why we often see models with smaller parameters outperforming those with higher parameters—one of the main reasons is the use of high-quality training data.

In the Era of Large Models, Data Reigns Supreme

Given the importance of data, high-quality training data has become a precious resource fiercely contested by companies such as OpenAI, Baidu, Anthropic, and Cohere, serving as the 'oil' of the large model era.

As early as March of this year, while the domestic market was still fervently researching large models, Baidu took the lead by launching its generative AI product, Wenxin Yiyan, positioning it as a competitor to ChatGPT.

Beyond its formidable R&D capabilities, Baidu's 20-plus years of accumulated vast Chinese language data from its search engine played a crucial role, significantly contributing to the multiple iterations of Wenxin Yiyan and far surpassing other domestic competitors.

High-quality data typically includes published books, literary works, academic papers, school textbooks, authoritative media reports, Wikipedia, Baidu Encyclopedia, etc. These are texts, videos, and audio that have been verified over time by humans.

However, research institutions have found that the growth of such high-quality data is very slow. Taking published books as an example, they require market research, drafting, editing, and review processes, taking months or even years to publish a single book. This rate of data production lags far behind the growing demand for training data in large models.

Looking at the development trends of large language models over the past four years, their annual training data volume has grown by more than 50%. In other words, every year, double the amount of data is required to train models to achieve improvements in performance and functionality.

Therefore, you will see many countries and enterprises strictly protecting data privacy and establishing relevant regulations. On one hand, this is to safeguard user privacy from being collected by third-party institutions, preventing theft and misuse.

On the other hand, it is to prevent important data from being monopolized and hoarded by a few institutions, leaving no data available for technological research and development.

By 2026, high-quality training data may be exhausted

To study the issue of training data consumption, researchers at Epochai simulated the global annual generation of language and image data from 2022 to 2100, then calculated the total volume of this data.

They also simulated the consumption rate of data by large models like ChatGPT. Finally, by comparing the growth rate of data with its consumption rate, they arrived at the following key conclusions:

Under the current rapid development trend of large models, all low-quality data will likely be exhausted between 2030 and 2050. High-quality data may be completely depleted as early as 2026.

Between 2030 and 2060, all image training data will be consumed. By 2040, due to the lack of training data, functional iteration of large models may show signs of slowing down.

Researchers used two models for calculations: The first model analyzed the growth trends of datasets actually used in the fields of large language models and image models, extrapolating historical statistics to predict when they would reach peak and average consumption levels.

The second model: Forecasted the annual global generation of new data. This model was based on three variables: global population size, internet penetration rate, and the average amount of data generated per internet user annually.

Additionally, researchers used United Nations data to fit population growth curves, applied an S-shaped function to model internet usage rates, and made the simple assumption that the amount of data generated per person annually remains constant. By multiplying these three factors, they estimated the annual global volume of new data.

This model has accurately predicted the monthly data output from Reddit (a well-known forum), demonstrating high accuracy.

Finally, researchers combined two models to arrive at the above conclusions.

Researchers indicate that although this data is simulated and estimated, with some inherent uncertainty, it serves as a wake-up call for the AI community. Training data may soon become a significant bottleneck limiting the expansion and application of AI models.

AI companies need to proactively develop effective methods for data regeneration and synthesis to avoid abrupt data shortages during the development of large models.