Google Introduces New Method: Training Data Reduced by 10,000 Times While Improving Model Accuracy
-
Recently, Google proposed a novel active learning screening process in its research, aiming to significantly reduce the amount of training data required for fine-tuning large language models. Experimental results show that this method can reduce the training data volume to 1/10,000 of the original while improving the alignment between model predictions and human expert judgments by 65%. In practical applications such as ad content classification and financial data security analysis, the demand for high-fidelity training data has always been high, but screening qualified data is not only challenging but also extremely costly.
Image source note: The image is AI-generated and licensed by Midjourney.
This new method starts with a zero-shot or few-shot initial model, where users define target content through prompts, such as asking whether an ad is "clickbait." The initial model then labels the ads as clickbait or benign, generating a large labeled dataset. However, the initial dataset often suffers from severe class imbalance, resulting in weak model accuracy.
To address this issue, researchers grouped the content labeled by the model as clickbait and benign ads, discovering overlaps between certain groups, indicating that the model was prone to misjudgments in these cases. Therefore, researchers could select sample pairs from these overlapping groups for expert review, thereby controlling auditing costs while prioritizing samples that cover diverse scenarios. This approach yields valuable samples that encompass various potential error cases.
During the model fine-tuning process, expert-provided labels are divided into two groups: one for evaluating model consistency and the other for fine-tuning the model. This process is repeated until the model's performance approaches that of human experts.
Google's experiments used the Gemini Nano-1 and Nano-2 models and tested them on two tasks of varying complexity. Each task utilized approximately 100,000 crowdsourced labeled data points, despite severe data imbalance. The results showed high consistency among expert judgments, while crowdsourced labels aligned relatively poorly with expert judgments. Using the new method, a 3.25-billion-parameter model achieved significant alignment improvements on low-difficulty tasks with only 250–450 data points—a drastic reduction from the original 100,000—while still delivering strong performance.
In summary, Google's new method demonstrates that large models can achieve excellent performance during training with only a small amount of high-quality data, provided expert label consistency exceeds 0.8.
Key Takeaways:
Training data volume can be reduced to 1/10,000 of the original while improving model accuracy.
The new method relies on expert judgment and iterative model refinement to ensure sample quality.
Experiments show that using a small amount of high-quality data can match or even surpass the effects of traditional large-scale data.