Only Large Tech Companies Have the Capacity to Bear the Costs of AI Training Data

baoshi.rao

The development of AI relies on data, and the cost of such data is increasing, making it difficult for companies other than the wealthiest tech firms to bear these costs. According to an article by OpenAI researcher James Betker last year, the training data for AI models is a key factor determining their capabilities. Traditional AI systems are primarily based on statistical machines, using vast examples to guess the most 'reasonable' data distribution. Therefore, the larger the dataset a model relies on, the better its performance.

Kyle Lo, a senior research scientist at the nonprofit AI research organization AI2, pointed out that Meta's Llama3 model significantly outperforms AI2's OLMo model in terms of data volume, which explains its advantage in many popular AI benchmarks. However, larger datasets do not always lead to linear improvements in model performance; data quality and organization are equally important, sometimes even more so than quantity. Some AI models are trained using human-annotated data, and higher-quality annotations have a significant impact on model performance.

However, Lo and other experts worry that the demand for large, high-quality training datasets is concentrating AI development in the hands of a few companies with multi-billion-dollar budgets. Although some illegal or even criminal activities may raise questions about data acquisition methods, tech giants can secure data licenses due to their financial strength. These data transactions do not foster a fair and open generative AI ecosystem, harming the broader AI research community.

Some independent, nonprofit organizations, such as EleutherAI and Hugging Face, are attempting to open large-scale datasets, but whether they can keep up with the pace of large tech companies remains uncertain. Only when research breakthroughs eliminate technical barriers and the costs of data collection and organization are no longer an issue will these open datasets have a chance to compete with tech giants.