Dialogue systems, also called chatbots, are now used in a wide range of applications. However, they still have some major weaknesses. One key weakness is that they are typically trained from manually-labeled data and/or written with handcrafted rules, and their knowledge bases (KBs) are also compiled by human experts. Due to the huge amount of manual effort involved, they are difficult to scale and also tend to produce many errors ought to their limited ability to understand natural language and the limited knowledge in their KBs. Thus, the level of user satisfactory is often low. In this paper, we propose to dramatically improve this situation by endowing the system the ability to continually learn (1) new world knowledge, (2) new language expressions to ground them to actions, and (3) new conversational skills, during conversation or "on the job" by themselves so that as the systems chat more and more with users, they become more and more knowledgeable and are better and better able to understand diverse natural language expressions and improve their conversational skills. A key approach to achieving these is to exploit the multi-user environment of such systems to self-learn through interactions with users via verb and non-verb means. The paper discusses not only key challenges and promising directions to learn from users during conversation but also how to ensure the correctness of the learned knowledge.
This paper proposes a dually interactive matching network (DIM) for presenting the personalities of dialogue agents in retrieval-based chatbots. This model develops from the interactive matching network (IMN) which models the matching degree between a context composed of multiple utterances and a response candidate. Compared with previous persona fusion approaches which enhance the representation of a context by calculating its similarity with a given persona, the DIM model adopts a dual matching architecture, which performs interactive matching between responses and contexts and between responses and personas respectively for ranking response candidates. Experimental results on PERSONA-CHAT dataset show that the DIM model outperforms its baseline model, i.e., IMN with persona fusion, by a margin of 14.5% and outperforms the current state-of-the-art model by a margin of 27.7% in terms of top-1 accuracy [email protected]
In this paper, we study the problem of employing pre-trained language models for multi-turn response selection in retrieval-based chatbots. A new model, named Speaker-Aware BERT (SA-BERT), is proposed in order to make the model aware of the speaker change information, which is an important and intrinsic property of multi-turn dialogues. Furthermore, a speaker-aware disentanglement strategy is proposed to tackle the entangled dialogues. This strategy selects a small number of most important utterances as the filtered context according to the speakers' information in them. Finally, domain adaptation is performed in order to incorporate the in-domain knowledge into pre-trained language models. Experiments on five public datasets show that our proposed model outperforms the present models on all metrics by large margins and achieves new state-of-the-art performances for multi-turn response selection.
Popular conversational agents frameworks such as Alexa Skills Kit (ASK) and Google Actions (gActions) offer unprecedented opportunities for facilitating the development and deployment of voice-enabled AI solutions in various verticals. Nevertheless, understanding user utterances with high accuracy remains a challenging task with these frameworks. Particularly, when building chatbots with large volume of domain-specific entities. In this paper, we describe the challenges and lessons learned from building a large scale virtual assistant for understanding and responding to equipment-related complaints. In the process, we describe an alternative scalable framework for: 1) extracting the knowledge about equipment components and their associated problem entities from short texts, and 2) learning to identify such entities in user utterances. We show through evaluation on a real dataset that the proposed framework, compared to off-the-shelf popular ones, scales better with large volume of entities being up to 30% more accurate, and is more effective in understanding user utterances with domain-specific entities.
We study the problem of response selection for multi-turn conversation in retrieval-based chatbots. The task requires matching a response candidate with a conversation context, whose challenges include how to recognize important parts of the context, and how to model the relationships among utterances in the context. Existing matching methods may lose important information in contexts as we can interpret them with a unified framework in which contexts are transformed to fixed-length vectors without any interaction with responses before matching. The analysis motivates us to propose a new matching framework that can sufficiently carry the important information in contexts to matching and model the relationships among utterances at the same time. The new framework, which we call a sequential matching framework (SMF), lets each utterance in a context interacts with a response candidate at the first step and transforms the pair to a matching vector. The matching vectors are then accumulated following the order of the utterances in the context with a recurrent neural network (RNN) which models the relationships among the utterances. The context-response matching is finally calculated with the hidden states of the RNN. Under SMF, we propose a sequential convolutional network and sequential attention network and conduct experiments on two public data sets to test their performance. Experimental results show that both models can significantly outperform the state-of-the-art matching methods. We also show that the models are interpretable with visualizations that provide us insights on how they capture and leverage the important information in contexts for matching.
Recently, open domain multi-turn chatbots have attracted much interest from lots of researchers in both academia and industry. The dominant retrieval-based methods use context-response matching mechanisms for multi-turn response selection. Specifically, the state-of-the-art methods perform the context-response matching by word or segment similarity. However, these models lack a full exploitation of the sentence-level semantic information, and make simple mistakes that humans can easily avoid. In this work, we propose a matching network, called sequential sentence matching network (S2M), to use the sentence-level semantic information to address the problem. Firstly and most importantly, we find that by using the sentence-level semantic information, the network successfully addresses the problem and gets a significant improvement on matching, resulting in a state-of-the-art performance. Furthermore, we integrate the sentence matching we introduced here and the usual word similarity matching reported in the current literature, to match at different semantic levels. Experiments on three public data sets show that such integration further improves the model performance.
We study response selection for multi-turn conversation in retrieval-based chatbots. Existing work either concatenates utterances in context or matches a response with a highly abstract context vector finally, which may lose relationships among utterances or important contextual information. We propose a sequential matching network (SMN) to address both problems. SMN first matches a response with each utterance in the context on multiple levels of granularity, and distills important matching information from each pair as a vector with convolution and pooling operations. The vectors are then accumulated in a chronological order through a recurrent neural network (RNN) which models relationships among utterances. The final matching score is calculated with the hidden states of the RNN. An empirical study on two public data sets shows that SMN can significantly outperform state-of-the-art methods for response selection in multi-turn conversation.
Spoken language understanding (SLU) systems, such as goal-oriented chatbots or personal assistants, rely on an initial natural language understanding (NLU) module to determine the intent and to extract the relevant information from the user queries they take as input. SLU systems usually help users to solve problems in relatively narrow domains and require a large amount of in-domain training data. This leads to significant data availability issues that inhibit the development of successful systems. To alleviate this problem, we propose a technique of data selection in the low-data regime that enables us to train with fewer labeled sentences, thus smaller labelling costs. We propose a submodularity-inspired data ranking function, the ratio-penalty marginal gain, for selecting data points to label based only on the information extracted from the textual embedding space. We show that the distances in the embedding space are a viable source of information that can be used for data selection. Our method outperforms two known active learning techniques and enables cost-efficient training of the NLU unit. Moreover, our proposed selection technique does not need the model to be retrained in between the selection steps, making it time efficient as well.
We consider matching with pre-trained contextualized word vectors for multi-turn response selection in retrieval-based chatbots. When directly applied to the task, state-of-the-art models, such as CoVe and ELMo, do not work as well as they do on other tasks, due to the hierarchical nature, casual language, and domain-specific word use of conversations. To tackle the challenges, we propose pre-training a sentence-level and a session-level contextualized word vectors by learning a dialogue generation model from large-scale human-human conversations with a hierarchical encoder-decoder architecture. The two levels of vectors are then integrated into the input layer and the output layer of a matching model respectively. Experimental results on two benchmark datasets indicate that the proposed contextualized word vectors can significantly and consistently improve the performance of existing matching models for response selection.