How to Evaluate an Intelligent Dialogue System (Part 3)

baoshi.rao

This article will focus on the implementation details of distributed evaluation methods, introducing the sampling of labeled data for evaluation systems, the design of annotation questions, and the technical principles behind them.

In the previous chapter, we introduced the popular evaluation methods for intelligent dialogue systems in the industry, including manual evaluation and automated evaluation. We discussed the advantages and disadvantages of different evaluation methods and explained the importance and necessity of manual evaluation in intelligent dialogue assessment tasks.

Subsequently, we introduced the distributed evaluation method, which breaks down a problem into six dimensions: grammatical quality, content quality, content relevance, logical relevance, emotional intensity, and divergence. These six dimensions serve as perspectives to evaluate the dialogue content of a dialogue system.

The distributed evaluation method involves splitting a problem into multiple dimensions, decomposing it into actionable questions, and answering them separately. The results are then consolidated to compute a meaningful outcome. Complex problems often require nuanced answers, and the distributed statistical approach is particularly effective at handling large-scale, intricate issues.

Next, let’s focus on the implementation details of the distributed evaluation method. This chapter will introduce the sampling of labeled data for evaluation systems, the design of annotation questions, and the technical principles behind them.

To create an effective evaluation annotation task, we first need to prepare a dataset—a set of questions (or queries, as the input is not limited to interrogative sentences) for evaluating the dialogue system.

The query set for dialogue evaluation typically consists of a collection of natural language sentences in various forms and categories. Currently, major NLP competitions and research reports provide open-source datasets for researchers. However, these datasets are mostly in English. Therefore, we need to compile a targeted Chinese sample dataset for evaluation tasks.

Chinese is a profound and complex language. Objectively speaking, a simple dataset cannot encompass all dialogue intents and language usage scenarios in Chinese. Thus, in theory, the larger the sample size, the better (data for training language models often exceeds hundreds of millions).

However, to facilitate the execution of evaluation annotation tasks, we aim to keep the number of annotation questions as small as possible (so annotators can complete the task quickly). Hence, the key challenge lies in covering as many language topics as possible with minimal data.

Here, the author shares their method for collecting and organizing sample data, hoping readers can adapt it to create their own evaluation datasets.

Due to confidentiality reasons, the author cannot fully disclose the data here (interested readers can search for "NLPCC2019 – Open Domain Dialogue System Evaluation Task" for more details).

Of course, the dataset created by the author may not be the best. If better datasets are available, the author welcomes sharing and discussion.

Data Source: Real user logs and publicly available data from social media.

Collection Method: Script filtering and manual annotation to extract a data pool from billions of raw data entries.

Topic Classification:

Sample Data:

Data Allocation: Out of 1,700 questions, 200 are used as test questions, and the remaining 1,500 are used for actual evaluation.

With the evaluation dataset prepared, we next need to design the specific annotation tasks. To accurately and efficiently evaluate a dialogue system’s performance, annotation tasks should adhere to two principles: objectivity and conciseness.

The underlying methodology for dialogue evaluation involves assessing a dialogue system across six dimensions. During evaluation, we primarily judge whether the system meets the information characteristics of these dimensions. To make the judgment more intuitive, we decompose the six-dimensional evaluation into 12 closed-ended questions (True or False). Closed-ended questions help evaluators minimize subjective thinking and provide rational judgments more quickly.

Below are the 12 questions designed by the author for dialogue evaluation, along with their corresponding dimensions:

Grammatical Quality:

Content Quality:

Content Relevance:

Logical Relevance:

Emotional Intensity:

Divergence:

Annotation tasks include basic annotations and special annotations.

Typically, when evaluating an answer, we first determine whether its content is acceptable. If the response is acceptable, we proceed to evaluate it across multiple dimensions. If not, we skip further questions and label the Q&A pair as unqualified.

We combine "Does the response conform to correct grammar?" and "Is the response content unacceptable?" into a special annotation type, while all other evaluation questions are basic annotation types.

Although most of the annotation questions mentioned above are based on the distributed dialogue evaluation method, they still require sufficient technical theory for support.

On one hand, classical technical theories lend credibility to the evaluation method. On the other hand, mathematical models from these theories can partially automate the evaluation of dialogue systems.

While effectively assessing the performance of intelligent dialogue products in the market, this approach also holds scientific research value.

Grammar and Content Quality: References common NLP evaluation theories such as PPL, BLEU, and Distinct.

Relevance and Divergence: Statistical analysis of named entity changes (NER) and LSTM deep learning algorithms for multi-turn dialogue probability calculations.

Emotional Intensity: Based on sentiment analysis algorithms and theoretical support.

This chapter introduced the implementation details of the distributed dialogue system evaluation method, including data classification and sampling, annotation question design, and the underlying technical principles. We elaborated on methods for obtaining labeled data and identifying language data topic types. Additionally, we introduced 12 closed-ended questions derived from six-dimensional information characteristics. By mapping data to questions, we create an actionable and statistically viable dialogue evaluation annotation task.

Different dialogue systems have different focuses. Some excel at single-turn Q&A dialogues, while others perform better in multi-turn dialogue scenarios. In the next two articles, we will explain how to use the distributed dialogue evaluation method to assess single-turn and multi-turn dialogue systems, respectively.