Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"chatbots": models, code, and papers

When to (or not to) trust intelligent machines: Insights from an evolutionary game theory analysis of trust in repeated games

Jul 22, 2020
The Anh Han, Cedric Perret, Simon T. Powers

The actions of intelligent agents, such as chatbots, recommender systems, and virtual assistants are typically not fully transparent to the user. Consequently, using such an agent involves the user exposing themselves to the risk that the agent may act in a way opposed to the user's goals. It is often argued that people use trust as a cognitive shortcut to reduce the complexity of such interactions. Here we formalise this by using the methods of evolutionary game theory to study the viability of trust-based strategies in repeated games. These are reciprocal strategies that cooperate as long as the other player is observed to be cooperating. Unlike classic reciprocal strategies, once mutual cooperation has been observed for a threshold number of rounds they stop checking their co-player's behaviour every round, and instead only check with some probability. By doing so, they reduce the opportunity cost of verifying whether the action of their co-player was actually cooperative. We demonstrate that these trust-based strategies can outcompete strategies that are always conditional, such as Tit-for-Tat, when the opportunity cost is non-negligible. We argue that this cost is likely to be greater when the interaction is between people and intelligent agents, because of the reduced transparency of the agent. Consequently, we expect people to use trust-based strategies more frequently in interactions with intelligent agents. Our results provide new, important insights into the design of mechanisms for facilitating interactions between humans and intelligent agents, where trust is an essential factor.


Understanding User Satisfaction with Task-oriented Dialogue Systems

Apr 26, 2022
Clemencia Siro, Mohammad Aliannejadi, Maarten de Rijke

$ $Dialogue systems are evaluated depending on their type and purpose. Two categories are often distinguished: (1) task-oriented dialogue systems (TDS), which are typically evaluated on utility, i.e., their ability to complete a specified task, and (2) open domain chatbots, which are evaluated on the user experience, i.e., based on their ability to engage a person. What is the influence of user experience on the user satisfaction rating of TDS as opposed to, or in addition to, utility? We collect data by providing an additional annotation layer for dialogues sampled from the ReDial dataset, a widely used conversational recommendation dataset. Unlike prior work, we annotate the sampled dialogues at both the turn and dialogue level on six dialogue aspects: relevance, interestingness, understanding, task completion, efficiency, and interest arousal. The annotations allow us to study how different dialogue aspects influence user satisfaction. We introduce a comprehensive set of user experience aspects derived from the annotators' open comments that can influence users' overall impression. We find that the concept of satisfaction varies across annotators and dialogues, and show that a relevant turn is significant for some annotators, while for others, an interesting turn is all they need. Our analysis indicates that the proposed user experience aspects provide a fine-grained analysis of user satisfaction that is not captured by a monolithic overall human rating.

* To appear in SIGIR 2022 short paper track 

Dialogue Act Segmentation for Vietnamese Human-Human Conversational Texts

Aug 16, 2017
Thi Lan Ngo, Khac Linh Pham, Minh Son Cao, Son Bao Pham, Xuan Hieu Phan

Dialog act identification plays an important role in understanding conversations. It has been widely applied in many fields such as dialogue systems, automatic machine translation, automatic speech recognition, and especially useful in systems with human-computer natural language dialogue interfaces such as virtual assistants and chatbots. The first step of identifying dialog act is identifying the boundary of the dialog act in utterances. In this paper, we focus on segmenting the utterance according to the dialog act boundaries, i.e. functional segments identification, for Vietnamese utterances. We investigate carefully functional segment identification in two approaches: (1) machine learning approach using maximum entropy (ME) and conditional random fields (CRFs); (2) deep learning approach using bidirectional Long Short-Term Memory (LSTM) with a CRF layer (Bi-LSTM-CRF) on two different conversational datasets: (1) Facebook messages (Message data); (2) transcription from phone conversations (Phone data). To the best of our knowledge, this is the first work that applies deep learning based approach to dialog act segmentation. As the results show, deep learning approach performs appreciably better as to compare with traditional machine learning approaches. Moreover, it is also the first study that tackles dialog act and functional segment identification for Vietnamese.

* 6 pages, 2 figures 

Automatic Detection of Sexist Statements Commonly Used at the Workplace

Jul 08, 2020
Dylan Grosz, Patricia Conde-Cespedes

Detecting hate speech in the workplace is a unique classification task, as the underlying social context implies a subtler version of conventional hate speech. Applications regarding a state-of the-art workplace sexism detection model include aids for Human Resources departments, AI chatbots and sentiment analysis. Most existing hate speech detection methods, although robust and accurate, focus on hate speech found on social media, specifically Twitter. The context of social media is much more anonymous than the workplace, therefore it tends to lend itself to more aggressive and "hostile" versions of sexism. Therefore, datasets with large amounts of "hostile" sexism have a slightly easier detection task since "hostile" sexist statements can hinge on a couple words that, regardless of context, tip the model off that a statement is sexist. In this paper we present a dataset of sexist statements that are more likely to be said in the workplace as well as a deep learning model that can achieve state-of-the art results. Previous research has created state-of-the-art models to distinguish "hostile" and "benevolent" sexism based simply on aggregated Twitter data. Our deep learning methods, initialized with GloVe or random word embeddings, use LSTMs with attention mechanisms to outperform those models on a more diverse, filtered dataset that is more targeted towards workplace sexism, leading to an F1 score of 0.88.

* Published at the PAKDD 2020 Workshop on Learning Data Representation for Clustering 

Inflection-Tolerant Ontology-Based Named Entity Recognition for Real-Time Applications

Dec 05, 2018
Christian Jilek, Markus Schröder, Rudolf Novik, Sven Schwarz, Heiko Maus, Andreas Dengel

A growing number of applications users daily interact with have to operate in (near) real-time: chatbots, digital companions, knowledge work support systems -- just to name a few. To perform the services desired by the user, these systems have to analyze user activity logs or explicit user input extremely fast. In particular, text content (e.g. in form of text snippets) needs to be processed in an information extraction task. Regarding the aforementioned temporal requirements, this has to be accomplished in just a few milliseconds, which limits the number of methods that can be applied. Practically, only very fast methods remain, which on the other hand deliver worse results than slower but more sophisticated Natural Language Processing (NLP) pipelines. In this paper, we investigate and propose methods for real-time capable Named Entity Recognition (NER). As a first improvement step we address are word variations induced by inflection, for example present in the German language. Our approach is ontology-based and makes use of several language information sources like Wiktionary. We evaluated it using the German Wikipedia (about 9.4B characters), for which the whole NER process took considerably less than an hour. Since precision and recall are higher than with comparably fast methods, we conclude that the quality gap between high speed methods and sophisticated NLP pipelines can be narrowed a bit more without losing too much runtime performance.

* 14 pages, 11 figures 

Human-like informative conversations: Better acknowledgements using conditional mutual information

Apr 16, 2021
Ashwin Paranjape, Christopher D. Manning

This work aims to build a dialogue agent that can weave new factual content into conversations as naturally as humans. We draw insights from linguistic principles of conversational analysis and annotate human-human conversations from the Switchboard Dialog Act Corpus to examine humans strategies for acknowledgement, transition, detail selection and presentation. When current chatbots (explicitly provided with new factual content) introduce facts into a conversation, their generated responses do not acknowledge the prior turns. This is because models trained with two contexts - new factual content and conversational history - generate responses that are non-specific w.r.t. one of the contexts, typically the conversational history. We show that specificity w.r.t. conversational history is better captured by Pointwise Conditional Mutual Information ($\text{pcmi}_h$) than by the established use of Pointwise Mutual Information ($\text{pmi}$). Our proposed method, Fused-PCMI, trades off $\text{pmi}$ for $\text{pcmi}_h$ and is preferred by humans for overall quality over the Max-PMI baseline 60% of the time. Human evaluators also judge responses with higher $\text{pcmi}_h$ better at acknowledgement 74% of the time. The results demonstrate that systems mimicking human conversational traits (in this case acknowledgement) improve overall quality and more broadly illustrate the utility of linguistic principles in improving dialogue agents.

* NAACL 2021 

ValueNet: A New Dataset for Human Value Driven Dialogue System

Dec 12, 2021
Liang Qiu, Yizhou Zhao, Jinchao Li, Pan Lu, Baolin Peng, Jianfeng Gao, Song-Chun Zhu

Building a socially intelligent agent involves many challenges, one of which is to teach the agent to speak guided by its value like a human. However, value-driven chatbots are still understudied in the area of dialogue systems. Most existing datasets focus on commonsense reasoning or social norm modeling. In this work, we present a new large-scale human value dataset called ValueNet, which contains human attitudes on 21,374 text scenarios. The dataset is organized in ten dimensions that conform to the basic human value theory in intercultural research. We further develop a Transformer-based value regression model on ValueNet to learn the utility distribution. Comprehensive empirical results show that the learned value model could benefit a wide range of dialogue tasks. For example, by teaching a generative agent with reinforcement learning and the rewards from the value model, our method attains state-of-the-art performance on the personalized dialog generation dataset: Persona-Chat. With values as additional features, existing emotion recognition models enable capturing rich human emotions in the context, which further improves the empathetic response generation performance in the EmpatheticDialogues dataset. To the best of our knowledge, ValueNet is the first large-scale text dataset for human value modeling, and we are the first one trying to incorporate a value model into emotionally intelligent dialogue systems. The dataset is available at

* Paper accepted by AAAI 2022 

Your instruction may be crisp, but not clear to me!

Aug 23, 2020
Pradip Pramanick, Chayan Sarkar, Indrajit Bhattacharya

The number of robots deployed in our daily surroundings is ever-increasing. Even in the industrial set-up, the use of coworker robots is increasing rapidly. These cohabitant robots perform various tasks as instructed by co-located human beings. Thus, a natural interaction mechanism plays a big role in the usability and acceptability of the robot, especially by a non-expert user. The recent development in natural language processing (NLP) has paved the way for chatbots to generate an automatic response for users' query. A robot can be equipped with such a dialogue system. However, the goal of human-robot interaction is not focused on generating a response to queries, but it often involves performing some tasks in the physical world. Thus, a system is required that can detect user intended task from the natural instruction along with the set of pre- and post-conditions. In this work, we develop a dialogue engine for a robot that can classify and map a task instruction to the robot's capability. If there is some ambiguity in the instructions or some required information is missing, which is often the case in natural conversation, it asks an appropriate question(s) to resolve it. The goal is to generate minimal and pin-pointed queries for the user to resolve an ambiguity. We evaluate our system for a telepresence scenario where a remote user instructs the robot for various tasks. Our study based on 12 individuals shows that the proposed dialogue strategy can help a novice user to effectively interact with a robot, leading to satisfactory user experience.


Using Voice and Biofeedback to Predict User Engagement during Requirements Interviews

Apr 06, 2021
Alessio Ferrari, Thaide Huichapa, Paola Spoletini, Nicole Novielli, Davide Fucci, Daniela Girardi

Capturing users engagement is crucial for gathering feedback about the features of a software product. In a market-driven context, current approaches to collect and analyze users feedback are based on techniques leveraging information extracted from product reviews and social media. These approaches are hardly applicable in bespoke software development, or in contexts in which one needs to gather information from specific users. In such cases, companies need to resort to face-to-face interviews to get feedback on their products. In this paper, we propose to utilize biometric data, in terms of physiological and voice features, to complement interviews with information about the engagement of the user on the discussed product-relevant topics. We evaluate our approach by interviewing users while gathering their physiological data (i.e., biofeedback) using an Empatica E4 wristband, and capturing their voice through the default audio-recorder of a common laptop. Our results show that we can predict users' engagement by training supervised machine learning algorithms on biometric data, and that voice features alone can be sufficiently effective. The performance of the prediction algorithms is maximised when pre-processing the training data with the synthetic minority oversampling technique (SMOTE). The results of our work suggest that biofeedback and voice analysis can be used to facilitate prioritization of requirements oriented to product improvement, and to steer the interview based on users' engagement. Furthermore, the usage of voice features can be particularly helpful for emotion-aware requirements elicitation in remote communication, either performed by human analysts or voice-based chatbots.

* 44 pages, submitted for peer-review to Empirical Software Engineering Journal 

Can Language Models Make Fun? A Case Study in Chinese Comical Crosstalk

Jul 02, 2022
Benyou Wang, Xiangbo Wu, Xiaokang Liu, Jianquan Li, Prayag Tiwari, Qianqian Xie

Language is the principal tool for human communication, in which humor is one of the most attractive parts. Producing natural language like humans using computers, a.k.a, Natural Language Generation (NLG), has been widely used for dialogue systems, chatbots, machine translation, as well as computer-aid creation e.g., idea generations, scriptwriting. However, the humor aspect of natural language is relatively under-investigated, especially in the age of pre-trained language models. In this work, we aim to preliminarily test whether NLG can generate humor as humans do. We build a new dataset consisting of numerous digitized Chinese Comical Crosstalk scripts (called C$^3$ in short), which is for a popular Chinese performing art called `Xiangsheng' since 1800s. (For convenience for non-Chinese speakers, we called `crosstalk' for `Xiangsheng' in this paper.) We benchmark various generation approaches including training-from-scratch Seq2seq, fine-tuned middle-scale PLMs, and large-scale PLMs (with and without fine-tuning). Moreover, we also conduct a human assessment, showing that 1) large-scale pretraining largely improves crosstalk generation quality; and 2) even the scripts generated from the best PLM is far from what we expect, with only 65% quality of human-created crosstalk. We conclude, humor generation could be largely improved using large-scaled PLMs, but it is still in its infancy. The data and benchmarking code is publicly available in \url{}.

* Submitted to 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks