Whoever Masters Hallucination Problem-Solving Capabilities Will Dominate Large Model Industrial Applications

baoshi.rao

With the continuous advancement of deep learning technology and the expansion of model scale, 'emergent intelligence' has endowed large models with the ability to generalize, but also brought increasingly prominent hallucination issues.

First, we need to consider a question: What is 'hallucination' in large models?

So-called large model hallucination mainly refers to situations where the model outputs content inconsistent with the real world, such as fabricating facts, confusing fiction with reality, or believing rumors and legends—what we commonly call 'confident nonsense.' In practical application scenarios, hallucination issues not only affect the accuracy and stability of models but also constrain the reliability of large models in real-world applications. Therefore, solving large model hallucination has always been a key challenge in achieving comprehensive industrial applications of large models.

Recently, Fudan University collaborated with the Shanghai AI Laboratory to construct the HalluQA dataset for evaluating large model hallucinations. The dataset covers 30 domains and focuses on potential issues large models may face in real-world applications. The 'Misleading' and 'Knowledge' sections specifically examine imitative falsehoods and factual errors in models.

In the HalluQA evaluation dataset, the Fudan University team uses the 'hallucination-free rate' as a metric to assess model capabilities. The hallucination-free rate (the higher, the better) not only directly reflects the strengths and weaknesses of different models in addressing hallucination issues—closely related to factual accuracy—but also provides guidance for the feasibility of models in practical applications.

The emergence of the HalluQA evaluation set offers the industry a more professional method to distinguish between truly 'well-prepared' models, 'rote-learners,' and those 'swimming naked' in the evolution of large model technology.

Notably, the HalluQA team conducted a large-scale hallucination evaluation of 24 mainstream AI models in their report. The results show that only 6 models achieved a non-hallucination rate above 50%, representing just one-quarter of all tested models. Among them, Baidu's ERNIE-Bot topped the list with a 69.33% non-hallucination rate, while GPT-4 ranked sixth with 53.11%.

The primary causes of model hallucinations stem from model complexity and data quality. As model scale expands, increased complexity often leads to overfitting, making models overly reliant on training data details and noise while struggling to generalize to new data. Additionally, data quality significantly impacts model performance. Biased, noisy, or insufficient training data covering real-world diversity makes it difficult for models to learn accurate data distributions, resulting in hallucinations.

Currently, the industry primarily addresses model hallucinations through two approaches beyond pretraining and fine-tuning: First, alignment techniques can significantly reduce hallucinations when answering misleading questions. Second, retrieval-augmented generation substantially improves non-hallucination rates for long-tail knowledge questions.

How exactly can models reduce hallucinations? According to Fudan University's research, models first require strong foundational capabilities. For example, ERNIE-Bot 4.0 undergoes pretraining on trillions of data points and hundreds of billions of knowledge elements to enhance generalization and factual accuracy. Supervised fine-tuning and RLHF further align models with human judgment to mitigate hallucination risks. Retrieval-augmented techniques also play crucial roles - ERNIE-Bot's semantic understanding architecture improves output accuracy and timeliness, further reducing hallucinations.

The root of large model hallucination issues lies in the complexity of language models. These models generate text by training on vast amounts of data and knowledge. However, in practical applications, if a large model confidently provides incorrect answers, it may produce inaccurate or misleading results. This poses significant challenges in fields requiring high precision and expertise, such as customer service, financial services, legal decision-making, and medical diagnosis, making large models difficult to deploy effectively in real-world scenarios.

It can be said that the large model industry has long suffered from 'hallucination.' The first model to significantly reduce the impact of hallucinations will gain a competitive edge in the current 'Hundred Models Battle.'

Lower hallucination rates are becoming a crucial factor for enterprises when selecting large models.

Large Model Home believes that with the introduction of hallucination evaluation standards, a new dimension will open for the development of large model capabilities, enabling more trustworthy and reliable AI to be widely adopted across industries. We are also witnessing increasing technological innovations aimed at addressing hallucination issues, driving progress in this area. With the parallel advancement of technology and standards, large models will unlock greater value in various fields, bringing more convenience and surprises to human life.