Alert button
Picture for Alexandros Papangelis

Alexandros Papangelis

Alert button

"What do others think?": Task-Oriented Conversational Modeling with Subjective Knowledge

May 20, 2023
Chao Zhao, Spandana Gella, Seokhwan Kim, Di Jin, Devamanyu Hazarika, Alexandros Papangelis, Behnam Hedayatnia, Mahdi Namazifar, Yang Liu, Dilek Hakkani-Tur

Figure 1 for "What do others think?": Task-Oriented Conversational Modeling with Subjective Knowledge
Figure 2 for "What do others think?": Task-Oriented Conversational Modeling with Subjective Knowledge
Figure 3 for "What do others think?": Task-Oriented Conversational Modeling with Subjective Knowledge
Figure 4 for "What do others think?": Task-Oriented Conversational Modeling with Subjective Knowledge

Task-oriented Dialogue (TOD) Systems aim to build dialogue systems that assist users in accomplishing specific goals, such as booking a hotel or a restaurant. Traditional TODs rely on domain-specific APIs/DBs or external factual knowledge to generate responses, which cannot accommodate subjective user requests (e.g., "Is the WIFI reliable?" or "Does the restaurant have a good atmosphere?"). To address this issue, we propose a novel task of subjective-knowledge-based TOD (SK-TOD). We also propose the first corresponding dataset, which contains subjective knowledge-seeking dialogue contexts and manually annotated responses grounded in subjective knowledge sources. When evaluated with existing TOD approaches, we find that this task poses new challenges such as aggregating diverse opinions from multiple knowledge snippets. We hope this task and dataset can promote further research on TOD and subjective content understanding. The code and the dataset are available at https://github.com/alexa/dstc11-track5.

Viaarxiv icon

PLACES: Prompting Language Models for Social Conversation Synthesis

Feb 17, 2023
Maximillian Chen, Alexandros Papangelis, Chenyang Tao, Seokhwan Kim, Andy Rosenbaum, Yang Liu, Zhou Yu, Dilek Hakkani-Tur

Figure 1 for PLACES: Prompting Language Models for Social Conversation Synthesis
Figure 2 for PLACES: Prompting Language Models for Social Conversation Synthesis
Figure 3 for PLACES: Prompting Language Models for Social Conversation Synthesis
Figure 4 for PLACES: Prompting Language Models for Social Conversation Synthesis

Collecting high quality conversational data can be very expensive for most applications and infeasible for others due to privacy, ethical, or similar concerns. A promising direction to tackle this problem is to generate synthetic dialogues by prompting large language models. In this work, we use a small set of expert-written conversations as in-context examples to synthesize a social conversation dataset using prompting. We perform several thorough evaluations of our synthetic conversations compared to human-collected conversations. This includes various dimensions of conversation quality with human evaluation directly on the synthesized conversations, and interactive human evaluation of chatbots fine-tuned on the synthetically generated dataset. We additionally demonstrate that this prompting approach is generalizable to multi-party conversations, providing potential to create new synthetic data for multi-party tasks. Our synthetic multi-party conversations were rated more favorably across all measured dimensions compared to conversation excerpts sampled from a human-collected multi-party dataset.

* In Findings of EACL 2023. 25 pages, 4 figures, 26 tables. Code available at https://github.com/alexa/PLACES 
Viaarxiv icon

Selective In-Context Data Augmentation for Intent Detection using Pointwise V-Information

Feb 10, 2023
Yen-Ting Lin, Alexandros Papangelis, Seokhwan Kim, Sungjin Lee, Devamanyu Hazarika, Mahdi Namazifar, Di Jin, Yang Liu, Dilek Hakkani-Tur

Figure 1 for Selective In-Context Data Augmentation for Intent Detection using Pointwise V-Information
Figure 2 for Selective In-Context Data Augmentation for Intent Detection using Pointwise V-Information
Figure 3 for Selective In-Context Data Augmentation for Intent Detection using Pointwise V-Information
Figure 4 for Selective In-Context Data Augmentation for Intent Detection using Pointwise V-Information

This work focuses on in-context data augmentation for intent detection. Having found that augmentation via in-context prompting of large pre-trained language models (PLMs) alone does not improve performance, we introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model. Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents. It then employs intent-aware filtering, based on PVI, to remove datapoints that are not helpful to the downstream intent classifier. Our method is thus able to leverage the expressive power of large language models to produce diverse training data. Empirical results demonstrate that our method can produce synthetic training data that achieve state-of-the-art performance on three challenging intent detection datasets under few-shot settings (1.28% absolute improvement in 5-shot and 1.18% absolute in 10-shot, on average) and perform on par with the state-of-the-art in full-shot settings (within 0.01% absolute, on average).

* Accepted at EACL 2023 
Viaarxiv icon

Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding

Nov 02, 2022
Maximillian Chen, Alexandros Papangelis, Chenyang Tao, Andy Rosenbaum, Seokhwan Kim, Yang Liu, Zhou Yu, Dilek Hakkani-Tur

Figure 1 for Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding
Figure 2 for Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding
Figure 3 for Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding
Figure 4 for Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding

Dialogue understanding tasks often necessitate abundant annotated data to achieve good performance and that presents challenges in low-resource settings. To alleviate this barrier, we explore few-shot data augmentation for dialogue understanding by prompting large pre-trained language models and present a novel approach that iterates on augmentation quality by applying weakly-supervised filters. We evaluate our methods on the emotion and act classification tasks in DailyDialog and the intent classification task in Facebook Multilingual Task-Oriented Dialogue. Models fine-tuned on our augmented data mixed with few-shot ground truth data are able to approach or surpass existing state-of-the-art performance on both datasets. For DailyDialog specifically, using 10% of the ground truth data we outperform the current state-of-the-art model which uses 100% of the data.

* To appear in SyntheticData4ML @ NeurIPS 2022. 16 pages, 10 figures, 3 tables 
Viaarxiv icon

Knowledge-Grounded Conversational Data Augmentation with Generative Conversational Networks

Jul 22, 2022
Yen-Ting Lin, Alexandros Papangelis, Seokhwan Kim, Dilek Hakkani-Tur

Figure 1 for Knowledge-Grounded Conversational Data Augmentation with Generative Conversational Networks
Figure 2 for Knowledge-Grounded Conversational Data Augmentation with Generative Conversational Networks
Figure 3 for Knowledge-Grounded Conversational Data Augmentation with Generative Conversational Networks
Figure 4 for Knowledge-Grounded Conversational Data Augmentation with Generative Conversational Networks

While rich, open-domain textual data are generally available and may include interesting phenomena (humor, sarcasm, empathy, etc.) most are designed for language processing tasks, and are usually in a non-conversational format. In this work, we take a step towards automatically generating conversational data using Generative Conversational Networks, aiming to benefit from the breadth of available language and knowledge data, and train open domain social conversational agents. We evaluate our approach on conversations with and without knowledge on the Topical Chat dataset using automatic metrics and human evaluators. Our results show that for conversations without knowledge grounding, GCN can generalize from the seed data, producing novel conversations that are less relevant but more engaging and for knowledge-grounded conversations, it can produce more knowledge-focused, fluent, and engaging conversations. Specifically, we show that for open-domain conversations with 10\% of seed data, our approach performs close to the baseline that uses 100% of the data, while for knowledge-grounded conversations, it achieves the same using only 1% of the data, on human ratings of engagingness, fluency, and relevance.

* Accepted at SIGDial 2022 
Viaarxiv icon

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

Jun 24, 2022
Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt, Hiroaki Hayashi, Jekaterina Novikova, Jenna Kanerva, Jenny Chim, Jiawei Zhou, Jordan Clive, Joshua Maynez, João Sedoc, Juraj Juraska, Kaustubh Dhole, Khyathi Raghavi Chandu, Laura Perez-Beltrachini, Leonardo F. R. Ribeiro, Lewis Tunstall, Li Zhang, Mahima Pushkarna, Mathias Creutz, Michael White, Mihir Sanjay Kale, Moussa Kamal Eddine, Nico Daheim, Nishant Subramani, Ondrej Dusek, Paul Pu Liang, Pawan Sasanka Ammanamanchi, Qi Zhu, Ratish Puduppully, Reno Kriz, Rifat Shahriyar, Ronald Cardenas, Saad Mahamood, Salomey Osei, Samuel Cahyawijaya, Sanja Štajner, Sebastien Montella, Shailza, Shailza Jolly, Simon Mille, Tahmid Hasan, Tianhao Shen, Tosin Adewumi, Vikas Raunak, Vipul Raheja, Vitaly Nikolaev, Vivian Tsai, Yacine Jernite, Ying Xu, Yisi Sang, Yixin Liu, Yufang Hou

Figure 1 for GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Figure 2 for GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Figure 3 for GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Figure 4 for GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.

Viaarxiv icon

Understanding How People Rate Their Conversations

Jun 01, 2022
Alexandros Papangelis, Nicole Chartier, Pankaj Rajan, Julia Hirschberg, Dilek Hakkani-Tur

Figure 1 for Understanding How People Rate Their Conversations
Figure 2 for Understanding How People Rate Their Conversations
Figure 3 for Understanding How People Rate Their Conversations
Figure 4 for Understanding How People Rate Their Conversations

User ratings play a significant role in spoken dialogue systems. Typically, such ratings tend to be averaged across all users and then utilized as feedback to improve the system or personalize its behavior. While this method can be useful to understand broad, general issues with the system and its behavior, it does not take into account differences between users that affect their ratings. In this work, we conduct a study to better understand how people rate their interactions with conversational agents. One macro-level characteristic that has been shown to correlate with how people perceive their inter-personal communication is personality. We specifically focus on agreeableness and extraversion as variables that may explain variation in ratings and therefore provide a more meaningful signal for training or personalization. In order to elicit those personality traits during an interaction with a conversational agent, we designed and validated a fictional story, grounded in prior work in psychology. We then implemented the story into an experimental conversational agent that allowed users to opt-in to hearing the story. Our results suggest that for human-conversational agent interactions, extraversion may play a role in user ratings, but more data is needed to determine if the relationship is significant. Agreeableness, on the other hand, plays a statistically significant role in conversation ratings: users who are more agreeable are more likely to provide a higher rating for their interaction. In addition, we found that users who opted to hear the story were, in general, more likely to rate their conversational experience higher than those who did not.

* Published at IWSDS 2021 
Viaarxiv icon

What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation

Mar 25, 2022
Sarik Ghazarian, Behnam Hedayatnia, Alexandros Papangelis, Yang Liu, Dilek Hakkani-Tur

Figure 1 for What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation
Figure 2 for What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation
Figure 3 for What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation
Figure 4 for What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation

Accurate automatic evaluation metrics for open-domain dialogs are in high demand. Existing model-based metrics for system response evaluation are trained on human annotated data, which is cumbersome to collect. In this work, we propose to use information that can be automatically extracted from the next user utterance, such as its sentiment or whether the user explicitly ends the conversation, as a proxy to measure the quality of the previous system response. This allows us to train on a massive set of dialogs with weak supervision, without requiring manual system turn quality annotations. Experiments show that our model is comparable to models trained on human annotated data. Furthermore, our model generalizes across both spoken and written open-domain dialog corpora collected from real and paid users.

* Accepted at ACL Findings 2022. 11 pages, 8 figures, 5 tables 
Viaarxiv icon