Deconstruction of Intelligent Q&A Technology in Chatbot Platforms

baoshi.rao

The author analyzes intelligent Q&A chatbot technology, starting from practical scenarios to explain its functional logic.

The history of personal computers dates back to the late 1990s, when people used keyboards and mice to send requests to shopping websites and purchase desired products. Over the next decade, the internet rapidly developed, and smartphones emerged, leading people to become accustomed to using touch-based interactions to operate devices like phones and tablets more flexibly. Products capable of such interactions steadily became embedded in millions of apps, all designed based on GUI (Graphical User Interface).

Given that humans can issue commands not only through manual actions (represented by keyboards, mice, and touch) but also by speaking, voice-controlled devices for human-computer interaction have also entered the realm of human exploration.

However, enabling machines to understand human language is a highly challenging task. In recent years, the rise of deep learning and the rapid development of speech recognition and natural language understanding have made this interaction model increasingly viable. It is believed that in the near future, humans will gradually enter the era of CUI (Conversational User Interface).

In our daily lives, intelligent dialogue is widely used in scenarios with frequent repetitive conversations, such as customer service and marketing, or as a supplement to GUI to provide users with efficient and personalized experiences. It is even directly integrated into hardware devices like smart speakers, smart homes, and smart navigation systems, independently bearing the burden of human-computer interaction.

Based on the intelligence level of dialogue, we can categorize intelligent Q&A into five stages: single-turn Q&A, multi-turn conversations, intent reasoning, personalization, and emotional interaction.

From the perspective of Q&A types, we can further divide them into five major categories: Community QA, KBQA, TableQA, PassageQA, and VQA.

Figure 1: Classification of Intelligent Q&A Domains [Duan 2017]

Given the widespread application of intelligent Q&A, this article takes the approach of how to build an intelligent Q&A system as a starting point and provides an in-depth yet accessible introduction to attempts made in Community QA.

Let’s look at a few practical examples to understand the usage scenarios of Q&A:

Based on the previous examples, we have a more concrete picture of Q&A scenarios. The next step is clear: how to proceed? We begin by modeling the problem.

First, we consider the structured representation of knowledge:

Figure 2: Knowledge Base Representation

Why do we manage knowledge this way instead of using charts or tables?

In reality, the method of knowledge management stems from actual business scenarios. This approach is highly maintainable, as multiple synonymous questions can be managed under a single knowledge point, reducing the workload of maintaining answers. Questions under the same knowledge point also serve as excellent training data.

Now, with a domain-specific knowledge base, when a user asks a question, we also need a Q&A model to find the question that best matches the user’s query and provide the corresponding answer. Here, we adopt a retrieval + matching + ranking architecture.

Figure 3: Q&A Modeling Process

The following is an example of applying the knowledge base and Q&A model in a maternal and child scenario:

Figure 4: QA Application Example

Figure 5: Data Type Distribution

Before formally building the Q&A model, we need to consider what data is currently available and what data is required to support subsequent work.

General-domain Q&A data from platforms like Tieba, Douban, Weibo, and Zhidao can be used to train word vectors or to calculate co-occurrence and mutual information dictionaries. Manually annotated Q/P pairs can be used to train supervised classification models. Vertical-domain knowledge bases can be used to train domain-specific classification models or fine-tune word vectors, and they also serve as effective evaluation data.

Figure 6: QA Architecture Diagram

The overall QA architecture is shown in Figure 6. Below, we briefly introduce the ideas behind each iteration.

1. BoW+LR

The first iteration introduced only the Bag of Words (BoW) model, with five-dimensional representative features.

The above feature set is derived from statistical features of large-scale text data and has weak semantic representation capabilities for sentences.

2. BoW+WE+LR

The second iteration introduced some semantic representation capabilities to the model. For those familiar with deep learning and natural language processing, word2vec has made outstanding contributions in many tasks. This model captures the relationship between surrounding words and the current word from different angles. After training, we obtain a vector representation of vocabulary.

Figure 7: Two Models of word2vec

Based on the trained word vectors, we use IDF to weight and average the word vectors, obtaining sentence-level vector representations for Q and P at the word level. Finally, cosine similarity is used to measure the semantic relevance between the two.

Figure 8: w2v Features Based on Sentence Representation

The similarity given by cosine similarity essentially describes the relative consistency of two words. Since word2vec does not consider word order, a high cosine similarity indicates that two words are likely to co-occur or co-occur with the same set of words. The drawback of this feature is that words with high similarity are often interchangeable but not semantically identical.

For example, Q = "What to do if the baby has a cold?" and P = "What to do if the baby has a fever?" The words "cold" and "fever" can be interchanged, and the sentences remain reasonable. Moreover, since they often co-occur with the same set of words, they have similar distribution features and high relevance, but their semantics are not the same.

Next, we introduce another semantic measurement method: WMD (the Word Mover’s Distance). It describes the minimal cost required to match each word in Q to each word in P with different weights. This is a method of representing semantics based on interactions between sentences.

Figure 9: WMD Features Based on Sentence Interaction

When calculating the relevance of Q/P using WMD, we segment the sentences and remove stop words. For each word in Q, we find another word in P for semantic transfer, with the transfer cost measured by the cosine similarity of word2vec between the two words.

When two words are semantically similar, we can transfer more; if they are semantically distant, we can choose to transfer less or not at all. The goal is to minimize the weighted sum of all transfers, with word frequency used as the weighting feature. A specific example is shown in Figure 10.

Figure 10: The Word Mover’s Distance (WMD)

Since WMD also heavily relies on word2vec word vectors, the shortcomings of word2vec cosine features mentioned earlier still apply to WMD features. It does not consider word order information and is unfriendly to OOV (Out of Vocabulary) cases, with weak semantic generalization and poor ability to distinguish similar intents.

3. BoW+WE+SE+fine-tune

The first two iterations did not consider data within the knowledge base, making them more suitable for knowledge bases with little or no corpus. When the knowledge base reaches a certain scale, as mentioned earlier, similar questions under the same knowledge point serve as excellent training data.

We adopted the fastText model, leveraging the fact that questions under the same knowledge point are semantically identical or similar as a supervisory signal to train a classification model. This model directly uses the words of a question to predict the knowledge point to which the question belongs.

FastText is an improvement proposed by Tomas Mikolov to address some shortcomings of word2vec. Unlike the CBOW model in word2vec, its input is not the context but the entire sentence. It also accepts subword and n-gram features, capturing more information.

For example, prefixes and suffixes of words, as well as word order features. Compared to other deep learning models, fastText has a simple structure, essentially being a linear model. It is less suitable for handling long or linearly inseparable samples but is more appropriate for short-text classification tasks that are relatively linear. It can achieve good results quickly with limited training data.

Figure 11: fastText Model Structure

Additionally, fastText generates a set of word vectors during training. Experiments have shown that fine-tuning word2vec word vectors trained on large datasets with fastText word vectors trained on knowledge bases can improve Q&A performance in the domain to some extent.

4. BoW+WE+SE+DM+fine-tune

Previously, we utilized semantically similar questions from a knowledge base as supervision signals to extract semantic information of a knowledge point. However, our ultimate goal is to determine the relevance between user input and similar questions. Here, we employ a deep learning model called ESIM (Enhanced LSTM for Natural Language Inference), which uses supervision signals from the two sentences being compared to train the model and observe the impact of sentence pair interaction on the model.

Figure 12 ESIM Model Architecture (Left)

The above figure shows the network structure of ESIM, which performs well in many short-text classification tasks. The key lies in inputting two sentences, each processed through embedding and BiLSTM. After learning the relationship between words and their context in a sentence via BiLSTM, the model uses an attention operation to calculate the similarity between words of the two sentences to update the embeddings before inference. This means the two sentences being compared interact during model training, allowing the model to learn more global information compared to other similar networks that only compute distances in the final layer.

5. BERT+MTL+fine-tune

Of course, the academic field is constantly evolving. For models with good performance, we also need to experiment to find suitable conditions and scenarios for industrial application.

Figure 13 BERT+Multi-Task Learning (MTL) Framework

After BERT swept 11 NLP evaluation tasks, we applied it to the knowledge point classification task, hoping to leverage BERT's inherent advantages to improve classification performance. Additionally, based on knowledge base data, we fine-tuned BERT using MTL and then further fine-tuned it for individual tasks based on BERT-MTL.

Figure 14 BERT+MTL Fine-Tuning Process

Figure 15 Example Evaluation Data

Did the performance improve? How much? Visual inspection is obviously not a reliable feedback metric, so we need to scientifically evaluate these improvements. For evaluation, we selected six different domains, each with 50 knowledge points. Each knowledge point had 12 similar questions for training and 3 for evaluation testing. Example data is shown in Figure 15. Based on this dataset, we evaluated precision, recall, and F1. The specific results are shown in Figure 16, where you can see that without threshold constraints, accuracy improved from 0.8 to 0.968.

Figure 16 Iterative Evaluation Data Table

Iteration is a gradual process. Some might wonder how to decide which features to use or which model to choose. Based on multiple iterations, bad case analysis is crucial. Feature design comes partly from intuitive perception of the problem.

For example, we need to design methods for semantic representation of sentences from multiple aspects (statistical level, lexical representation, sentence representation, inter-sentence interaction, etc.). On the other hand, it comes from compensating for existing bad cases in the model, designing targeted features by analyzing the patterns or tendencies revealed in these cases.

Meanwhile, the academic field continues to update new models. Technologies popular three years ago are highly likely to be completely replaced today, so we need to keep up with the times.

The entire intelligent Q&A system upgrade process revolves around four steps: first, understanding the essence of the task and modeling the problem appropriately; then evaluating and selecting suitable language tools to implement it; iterating steadily from shallow to deep, forming a closed loop of data, model, and feedback; and finally, continuous learning, embracing change and technology.

References:

[1] Nan Duan. Building Informational Bot (InfoBot) with Question Answering & Generation. In Association for Computational Linguistics (ACL). 2017.

[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119.

[3] Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, Diana Inkpen. “Enhanced LSTM for Natural Language Inference”. In Association for Computational Linguistics (ACL). 2017.

[4] Le, Quoc, and Tomas Mikolov. “Distributed representations of sentences and documents.” International Conference on Machine Learning. 2014.

[5] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[6] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019a. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482.

[7] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019b. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504.