How to Evaluate an Intelligent Dialogue System (Part 2)
-
In this chapter, we introduce the currently popular evaluation methods for intelligent dialogue systems in the industry, including manual evaluation and automated evaluation. We elaborate on the advantages and disadvantages of different evaluation methods and explain the importance and necessity of manual evaluation in the task of assessing intelligent dialogue systems. Enjoy~
In the previous chapter, we discussed the classification of intelligent dialogue systems and outlined the objectives of different types of dialogue systems. We transformed the question of 'how to evaluate an intelligent dialogue system' into 'how to define a good intelligent dialogue system.' From the perspectives of dialogue context, dialogue scenarios, and dialogue intent, we defined an intelligent dialogue system, thereby understanding its evaluation criteria and benchmarks.
At the same time, we also mentioned that the evaluation of intelligent dialogue systems is an open and hot topic. Professionals in related fields have proposed many evaluation methods over time. Next, let’s take a look at the mainstream evaluation methods currently in use.
Generally speaking, the evaluation methods for intelligent dialogue systems fall into two broad categories: manual annotation evaluation and automated algorithm scoring.
Manual evaluation involves hiring testers to annotate the results generated by the dialogue system manually. Humans use their common sense and experience to judge the performance of AI in dialogue. Testers interact with the system within predefined task domains or scenarios and score the system’s performance during the interaction.
Manual evaluation is the primary method for assessing intelligent dialogue systems. This approach allows humans to personally test the conversational abilities of machines to determine whether they perform as well as humans. If humans think it’s good, it’s good; if not, it’s not.
Currently, there are many crowdsourcing platforms used for AI-related tasks. These platforms can quickly integrate large amounts of human resources via the internet to conduct manual evaluations of intelligent dialogue systems. For example, Amazon’s AMT (Amazon Mechanical Turk) is one such platform.
However, manual evaluation has two fatal flaws.
The first flaw is that manual evaluation is very costly—summed up in one word: 'expensive.' Evaluating dialogue systems often requires testers to invest significant time and effort. To ensure the universality of evaluation results, a certain scale of testers must be organized to participate in the task. This means manual evaluation consumes substantial human resources.
The second flaw is that manual evaluation inevitably includes uncontrollable errors.
Automated evaluation generally refers to assessing an intelligent dialogue system using predefined computer algorithms or rules. The results of automated evaluation are often presented as scores or thresholds.
Currently, there are two widely recognized automated dialogue evaluation methods in the industry. One method evaluates based on the word overlap rate between the system’s responses and standard answers. Among these, BLEU and METEOR are widely used in machine translation tasks, while ROUGE has achieved good results in automated text summarization tasks. Another approach involves judging the relevance of responses by understanding the meaning of each word. Word embeddings (Word2Vec) are the foundation of this evaluation method.
Although the evaluation methods mentioned above are effective, they are limited to specific experimental scenarios.
In recent years, with continuous breakthroughs in AI algorithms, many new evaluation methods have been proposed. These include GAN-inspired models and automated evaluation models like ADEM, trained using recurrent neural networks (RNN). The former is used to intuitively judge the similarity between responses generated by the system and human responses, while the latter predicts human evaluation results of system responses, achieving more accurate evaluations with fewer manual annotations.
Automated evaluation methods can save human resources and complete dialogue system evaluations quickly and efficiently. However, these methods are mostly used as reference indicators in specific experimental scenarios. In real-world applications, they cannot replace manual annotations to achieve objective and comprehensive dialogue evaluations. Much of the communication and interaction between humans in daily life cannot be constrained by predefined rules—it often has a 'you know it when you see it' quality.
Although automated evaluation methods are convenient, they still fail to solve practical problems. The real world is not a predefined laboratory setting, and every automated evaluation method faces numerous counterexamples. Language is a uniquely human ability, so evaluating dialogue capabilities must ultimately be done by humans, relying on some uniquely human intuition.
Rather than painstakingly researching an idealized automated evaluation algorithm, it’s better to focus on optimizing manual evaluation tasks. Thus, reducing subjective judgment factors and human resource costs in manual evaluation has become our goal.
After clarifying the evaluation criteria and benchmarks for dialogue systems, we next attempt to break down the task into smaller units to assess the performance of an intelligent dialogue system from different dimensions. We call this evaluation method 'distributed evaluation.'
Distributed evaluation aims to break down the evaluation task into the smallest possible units for processing.
Step 1: We divide a dialogue system’s performance into the smallest unit—a single round of dialogue (a question-answer pair). By aggregating the performance of many rounds of dialogue, we can objectively reflect the overall performance of a dialogue system.
Step 2: We further break down the evaluation task for each round of dialogue, assessing the system from different dimensions to judge the performance of each question-answer pair in terms of dialogue context, dialogue scenario, and dialogue intent. After multiple attempts and explorations, we defined six evaluation dimensions for dialogue systems.
The six evaluation dimensions for intelligent dialogue systems include: grammatical quality, content quality, content relevance, logical relevance, emotional intensity, and divergence.
Grammatical quality: This dimension focuses on the basic grammar usage in the system’s responses. The replies generated by an intelligent dialogue system should conform to general language grammar, with correct and standardized word usage, coherent and complete sentences. This dimension is relatively objective, as each language has its own grammatical rules.
Content quality: Content quality can be judged from three perspectives. First, the length of the system’s response should be appropriate—neither too long nor too short. Second, the dialogue content should be substantive, containing concrete information without linguistic ambiguity. Third, the content should avoid violent, obscene, or negative material, as well as politically sensitive topics, while expressing correct stances and viewpoints.
Content relevance: This refers to the relevance of the system’s answer to the question. It involves judging whether the system’s reply and the user’s question are discussing the same topic and whether the preceding and following content are about the same subject. Generally, question-answer pairs containing the same entities can be considered content-relevant.
Logical relevance: This refers to the logical connection between the system’s response and the preceding dialogue. This logic includes temporal logic, comparative logic, and objective laws. For example, if the preceding content is about the size of an object, the reply should also relate to the object’s size. If the reply naturally follows the preceding content, we can say the question-answer pair is logically relevant.
Emotional intensity: This dimension assesses whether the system’s response expresses emotion. The definition of emotional intensity varies by person, making it hard to standardize. However, based on reasonable content and accurate logic, it can be judged by factors like whether the reply includes modal particles or onomatopoeia. Emotion in dialogue is also reflected in whether the reply is perfunctory, expresses strong subjective attitudes or intentions, or is humorous, funny, sad, or somber.
Divergence: Divergence refers to the topic expansion capability of machine responses during human-machine dialogue. It evaluates whether the current reply can stimulate further rounds of conversation. In other words, it assesses whether users feel inclined to continue the dialogue after seeing the system's response. Generally, if the machine generates an open-ended question, the conversation is more likely to proceed naturally. This involves interactive techniques such as content recommendation and proactive questioning.
This chapter introduces popular intelligent dialogue evaluation methods in the industry, including manual and automated evaluation. It explains the advantages and disadvantages of different evaluation approaches and emphasizes the importance and necessity of manual evaluation in intelligent dialogue assessment tasks. We believe that proposing more effective manual testing methods is more practical than pursuing idealized automated evaluation approaches for intelligent dialogue system assessment.
We introduce a distributed evaluation method that breaks down a problem into six distinct dimensions, each with clear definitions. The dialogue system is decomposed into individual question-answer pairs, with the evaluation task focused on each system response. We evaluate every response generated by a dialogue system across these six dimensions and then consolidate all evaluation results to derive the final assessment.
In the next article, we will explain the specific implementation steps of the distributed evaluation method. We will share insights on designing annotation tasks for dialogue evaluation and analyze the technical principles behind each annotation question.