How to Evaluate an Intelligent Dialogue System (Part 1)

baoshi.rao

Natural language dialogue, as a new generation of human-computer interaction medium, has given rise to a wide range of applications. For a long time, researchers have been exploring various methods for machines to generate natural responses, including retrieval-based responses, end-to-end generated responses, as well as question-answering and recommendation systems. From smart home devices to smartphone assistants, from customer service to emotional companionship, various chatbots have emerged around us. However, the performance of intelligent dialogue systems often varies depending on different application scenarios and objectives, so there has been no unified standard for evaluating dialogue quality in the industry.

In recent years, the open question of 'how to evaluate an intelligent dialogue system' has attracted significant attention from researchers in related fields. Over the past few years, I have been dedicated to exploring evaluation methods for intelligent dialogue systems. The dialogue evaluation method I designed has been validated in multiple intelligent dialogue products, effectively driving continuous optimization and iteration of these products. At the same time, this evaluation method was selected as the standard for the open-domain dialogue system competition at NLPCC2019 and has been recognized by experts in the field.

So, how did I approach designing a solution for such a seemingly unsolvable problem?

First, dialogue evaluation is a very broad concept that involves knowledge from many different fields and is highly subjective, making it impossible to judge with a unified standard. Simply put, this problem is about evaluating how well someone speaks, except the subject being evaluated is a robot. However, this is not entirely without rules. We can break down this big problem into smaller, quantifiable sub-problems through focus and decomposition.

To effectively evaluate a dialogue system, we first need to understand the objectives of the dialogue system being evaluated. In other words, what value do we expect the dialogue system to deliver? Once the objectives are clear, we can establish standards around them and derive evaluation methods from these standards.

Classification of Intelligent Dialogue Systems

When discussing the objectives of dialogue systems, it is essential to mention their classification. Generally, human-computer dialogue scenarios fall into three major categories: task-oriented dialogues, question-answering dialogues, and chit-chat dialogues. This is a widely accepted classification in the industry, as the core technologies and implementation methods behind these three types of dialogue systems are fundamentally different.

However, in real-world applications, almost every dialogue-based product exhibits features of at least two of these dialogue system types. Most dialogue systems on the market today combine the ability to solve tasks, answer questions, and engage in chit-chat. Therefore, we cannot simply design evaluation methods for dialogue systems based solely on this classification. Instead, we should step outside the technical framework and identify common characteristics shared by all intelligent dialogue systems from an application perspective, using these characteristics as criteria for designing evaluation methods. I summarize these characteristics as the dialogue context, dialogue scenario, and dialogue purpose.

Considerations for Evaluating Intelligent Dialogue Systems

Dialogue Context – Content of the Conversation

In a dialogue system, the quality of a response is directly related to the preceding content. When evaluating a response, the primary constraint is the content of the preceding question. To fairly and accurately judge the quality of a response generated by a dialogue system, the evaluator must consider the preceding content. This involves not only assessing the quality of the current dialogue content but also the logical consistency and emotional coherence of the conversation. Context plays a crucial role in generating multi-turn dialogues. The same set of dialogue content can produce entirely different effects in different contexts. Therefore, when evaluating a set of dialogue content, it is necessary to fully understand its context.

Dialogue Scenario – The Role Played by the Robot

In different application scenarios, dialogue systems need to play different roles to meet users' specific needs and intentions. Currently, mainstream application scenarios include home settings, early education, customer service, and in-car environments. Dialogue content in a specific scenario often includes domain-specific terminology or patterns, as well as relevant knowledge bases or knowledge graphs. Such dialogues often return conventional responses or solutions. Before evaluating a dialogue system, the evaluator must use their imagination to place themselves in the scenario. Understanding the role the dialogue system is trying to play helps us evaluate it more objectively.

Dialogue Purpose – Topics and Intentions

In real life, natural language conversations between people can be divided into two categories: purposeful and purposeless. Purposeful dialogues can be guided by the questioner or the initiator of the conversation. At the end of the dialogue, we can judge its quality by determining whether the purpose was achieved. In practice, however, the purpose of a dialogue is not always clearly defined. When evaluating dialogues, we cannot focus solely on those with clear purposes and ignore purposeless ones. Even purposeless dialogues involve information exchange and emotional interaction. Therefore, regardless of whether the dialogue has a clear topic or intention, we should pay attention to the information and emotions it conveys.

Evaluation Objectives of Dialogue Systems

The general considerations for dialogue systems described above are the prerequisites for evaluating an intelligent dialogue system. Once the evaluation criteria are clear, we can define a set of dialogue content to evaluate an intelligent dialogue system.

First, good dialogue content should be semantically coherent, with closely related and logically consistent context. Next, good dialogue content should meet the requirements of specific application scenarios, with clear and unambiguous expressions that align with user expectations. Finally, whether the topic is open-domain or vertical, and whether the intention is achieved or not, good dialogue content should always convey information and emotions.

Summary

At this point, we have transformed the question of 'how to evaluate an intelligent dialogue system' into 'how to define a good intelligent dialogue system.' By examining the actual usage of dialogue-based products in real-world scenarios, we have derived the considerations and standards for dialogue evaluation systems. With clear standards in place, the task design for intelligent dialogue evaluation becomes more structured.

Generally, dialogue evaluation work is handled from two perspectives: automated evaluation and human evaluation. In the next article, I will introduce the current mainstream automated dialogue evaluation tasks and human annotation methods. I will analyze the shortcomings of these tasks and methods and explain how I combine automated evaluation with human annotation to design intelligent dialogue evaluation methods.