Alert button
Picture for Emine Yilmaz

Emine Yilmaz

Alert button

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation

Oct 25, 2023
Xi Wang, Hossein A. Rahmani, Jiqun Liu, Emine Yilmaz

Conversational Recommendation System (CRS) is a rapidly growing research area that has gained significant attention alongside advancements in language modelling techniques. However, the current state of conversational recommendation faces numerous challenges due to its relative novelty and limited existing contributions. In this study, we delve into benchmark datasets for developing CRS models and address potential biases arising from the feedback loop inherent in multi-turn interactions, including selection bias and multiple popularity bias variants. Drawing inspiration from the success of generative data via using language models and data augmentation techniques, we present two novel strategies, 'Once-Aug' and 'PopNudge', to enhance model performance while mitigating biases. Through extensive experiments on ReDial and TG-ReDial benchmark datasets, we show a consistent improvement of CRS techniques with our data augmentation approaches and offer additional insights on addressing multiple newly formulated biases.

* Accepted by EMNLP 2023 (Findings) 
Viaarxiv icon

Enhancing Conversational Search: Large Language Model-Aided Informative Query Rewriting

Oct 18, 2023
Fanghua Ye, Meng Fang, Shenghui Li, Emine Yilmaz

Query rewriting plays a vital role in enhancing conversational search by transforming context-dependent user queries into standalone forms. Existing approaches primarily leverage human-rewritten queries as labels to train query rewriting models. However, human rewrites may lack sufficient information for optimal retrieval performance. To overcome this limitation, we propose utilizing large language models (LLMs) as query rewriters, enabling the generation of informative query rewrites through well-designed instructions. We define four essential properties for well-formed rewrites and incorporate all of them into the instruction. In addition, we introduce the role of rewrite editors for LLMs when initial query rewrites are available, forming a "rewrite-then-edit" process. Furthermore, we propose distilling the rewriting capabilities of LLMs into smaller models to reduce rewriting latency. Our experimental evaluation on the QReCC dataset demonstrates that informative query rewrites can yield substantially improved retrieval performance compared to human rewrites, especially with sparse retrievers.

* 22 pages, accepted to EMNLP Findings 2023 
Viaarxiv icon

Schema-Guided User Satisfaction Modeling for Task-Oriented Dialogues

May 26, 2023
Yue Feng, Yunlong Jiao, Animesh Prasad, Nikolaos Aletras, Emine Yilmaz, Gabriella Kazai

Figure 1 for Schema-Guided User Satisfaction Modeling for Task-Oriented Dialogues
Figure 2 for Schema-Guided User Satisfaction Modeling for Task-Oriented Dialogues
Figure 3 for Schema-Guided User Satisfaction Modeling for Task-Oriented Dialogues
Figure 4 for Schema-Guided User Satisfaction Modeling for Task-Oriented Dialogues

User Satisfaction Modeling (USM) is one of the popular choices for task-oriented dialogue systems evaluation, where user satisfaction typically depends on whether the user's task goals were fulfilled by the system. Task-oriented dialogue systems use task schema, which is a set of task attributes, to encode the user's task goals. Existing studies on USM neglect explicitly modeling the user's task goals fulfillment using the task schema. In this paper, we propose SG-USM, a novel schema-guided user satisfaction modeling framework. It explicitly models the degree to which the user's preferences regarding the task attributes are fulfilled by the system for predicting the user's satisfaction level. SG-USM employs a pre-trained language model for encoding dialogue context and task attributes. Further, it employs a fulfillment representation layer for learning how many task attributes have been fulfilled in the dialogue, an importance predictor component for calculating the importance of task attributes. Finally, it predicts the user satisfaction based on task attribute fulfillment and task attribute importance. Experimental results on benchmark datasets (i.e. MWOZ, SGD, ReDial, and JDDC) show that SG-USM consistently outperforms competitive existing methods. Our extensive analysis demonstrates that SG-USM can improve the interpretability of user satisfaction modeling, has good scalability as it can effectively deal with unseen tasks and can also effectively work in low-resource settings by leveraging unlabeled data.

Viaarxiv icon

A Survey on Asking Clarification Questions Datasets in Conversational Systems

May 25, 2023
Hossein A. Rahmani, Xi Wang, Yue Feng, Qiang Zhang, Emine Yilmaz, Aldo Lipani

Figure 1 for A Survey on Asking Clarification Questions Datasets in Conversational Systems
Figure 2 for A Survey on Asking Clarification Questions Datasets in Conversational Systems
Figure 3 for A Survey on Asking Clarification Questions Datasets in Conversational Systems
Figure 4 for A Survey on Asking Clarification Questions Datasets in Conversational Systems

The ability to understand a user's underlying needs is critical for conversational systems, especially with limited input from users in a conversation. Thus, in such a domain, Asking Clarification Questions (ACQs) to reveal users' true intent from their queries or utterances arise as an essential task. However, it is noticeable that a key limitation of the existing ACQs studies is their incomparability, from inconsistent use of data, distinct experimental setups and evaluation strategies. Therefore, in this paper, to assist the development of ACQs techniques, we comprehensively analyse the current ACQs research status, which offers a detailed comparison of publicly available datasets, and discusses the applied evaluation metrics, joined with benchmarks for multiple ACQs-related tasks. In particular, given a thorough analysis of the ACQs task, we discuss a number of corresponding research directions for the investigation of ACQs as well as the development of conversational systems.

* ACL 2023, 17 pages 
Viaarxiv icon

Towards Asking Clarification Questions for Information Seeking on Task-Oriented Dialogues

May 23, 2023
Yue Feng, Hossein A. Rahmani, Aldo Lipani, Emine Yilmaz

Figure 1 for Towards Asking Clarification Questions for Information Seeking on Task-Oriented Dialogues
Figure 2 for Towards Asking Clarification Questions for Information Seeking on Task-Oriented Dialogues
Figure 3 for Towards Asking Clarification Questions for Information Seeking on Task-Oriented Dialogues
Figure 4 for Towards Asking Clarification Questions for Information Seeking on Task-Oriented Dialogues

Task-oriented dialogue systems aim at providing users with task-specific services. Users of such systems often do not know all the information about the task they are trying to accomplish, requiring them to seek information about the task. To provide accurate and personalized task-oriented information seeking results, task-oriented dialogue systems need to address two potential issues: 1) users' inability to describe their complex information needs in their requests; and 2) ambiguous/missing information the system has about the users. In this paper, we propose a new Multi-Attention Seq2Seq Network, named MAS2S, which can ask questions to clarify the user's information needs and the user's profile in task-oriented information seeking. We also extend an existing dataset for task-oriented information seeking, leading to the \ourdataset which contains about 100k task-oriented information seeking dialogues that are made publicly available\footnote{Dataset and code is available at \href{https://github.com/sweetalyssum/clarit}{https://github.com/sweetalyssum/clarit}.}. Experimental results on \ourdataset show that MAS2S outperforms baselines on both clarification question generation and answer prediction.

Viaarxiv icon

Rethinking Semi-supervised Learning with Language Models

May 22, 2023
Zhengxiang Shi, Francesco Tonolini, Nikolaos Aletras, Emine Yilmaz, Gabriella Kazai, Yunlong Jiao

Figure 1 for Rethinking Semi-supervised Learning with Language Models
Figure 2 for Rethinking Semi-supervised Learning with Language Models
Figure 3 for Rethinking Semi-supervised Learning with Language Models
Figure 4 for Rethinking Semi-supervised Learning with Language Models

Semi-supervised learning (SSL) is a popular setting aiming to effectively utilize unlabelled data to improve model performance in downstream natural language processing (NLP) tasks. Currently, there are two popular approaches to make use of unlabelled data: Self-training (ST) and Task-adaptive pre-training (TAPT). ST uses a teacher model to assign pseudo-labels to the unlabelled data, while TAPT continues pre-training on the unlabelled data before fine-tuning. To the best of our knowledge, the effectiveness of TAPT in SSL tasks has not been systematically studied, and no previous work has directly compared TAPT and ST in terms of their ability to utilize the pool of unlabelled data. In this paper, we provide an extensive empirical study comparing five state-of-the-art ST approaches and TAPT across various NLP tasks and data sizes, including in- and out-of-domain settings. Surprisingly, we find that TAPT is a strong and more robust SSL learner, even when using just a few hundred unlabelled samples or in the presence of domain shifts, compared to more sophisticated ST approaches, and tends to bring greater improvements in SSL than in fully-supervised settings. Our further analysis demonstrates the risks of using ST approaches when the size of labelled or unlabelled data is small or when domain shifts exist. We offer a fresh perspective for future SSL research, suggesting the use of unsupervised pre-training objectives over dependency on pseudo labels.

* Findings of ACL 2023. Code is available at https://github.com/amzn/pretraining-or-self-training 
Viaarxiv icon

Modeling User Satisfaction Dynamics in Dialogue via Hawkes Process

May 21, 2023
Fanghua Ye, Zhiyuan Hu, Emine Yilmaz

Figure 1 for Modeling User Satisfaction Dynamics in Dialogue via Hawkes Process
Figure 2 for Modeling User Satisfaction Dynamics in Dialogue via Hawkes Process
Figure 3 for Modeling User Satisfaction Dynamics in Dialogue via Hawkes Process
Figure 4 for Modeling User Satisfaction Dynamics in Dialogue via Hawkes Process

Dialogue systems have received increasing attention while automatically evaluating their performance remains challenging. User satisfaction estimation (USE) has been proposed as an alternative. It assumes that the performance of a dialogue system can be measured by user satisfaction and uses an estimator to simulate users. The effectiveness of USE depends heavily on the estimator. Existing estimators independently predict user satisfaction at each turn and ignore satisfaction dynamics across turns within a dialogue. In order to fully simulate users, it is crucial to take satisfaction dynamics into account. To fill this gap, we propose a new estimator ASAP (sAtisfaction eStimation via HAwkes Process) that treats user satisfaction across turns as an event sequence and employs a Hawkes process to effectively model the dynamics in this sequence. Experimental results on four benchmark dialogue datasets demonstrate that ASAP can substantially outperform state-of-the-art baseline estimators.

* To appear at ACL 2023 
Viaarxiv icon

Scalable Educational Question Generation with Pre-trained Language Models

May 13, 2023
Sahan Bulathwela, Hamze Muse, Emine Yilmaz

Figure 1 for Scalable Educational Question Generation with Pre-trained Language Models
Figure 2 for Scalable Educational Question Generation with Pre-trained Language Models
Figure 3 for Scalable Educational Question Generation with Pre-trained Language Models
Figure 4 for Scalable Educational Question Generation with Pre-trained Language Models

The automatic generation of educational questions will play a key role in scaling online education, enabling self-assessment at scale when a global population is manoeuvring their personalised learning journeys. We develop \textit{EduQG}, a novel educational question generation model built by adapting a large language model. Our extensive experiments demonstrate that \textit{EduQG} can produce superior educational questions by further pre-training and fine-tuning a pre-trained language model on the scientific text and science question data.

* To be published at the Int. Conf. on Artificial Intelligence in Education (Tokyo, 2023) 
Viaarxiv icon

Query-specific Variable Depth Pooling via Query Performance Prediction towards Reducing Relevance Assessment Effort

Apr 23, 2023
Debasis Ganguly, Emine Yilmaz

Figure 1 for Query-specific Variable Depth Pooling via Query Performance Prediction towards Reducing Relevance Assessment Effort
Figure 2 for Query-specific Variable Depth Pooling via Query Performance Prediction towards Reducing Relevance Assessment Effort
Figure 3 for Query-specific Variable Depth Pooling via Query Performance Prediction towards Reducing Relevance Assessment Effort

Due to the massive size of test collections, a standard practice in IR evaluation is to construct a 'pool' of candidate relevant documents comprised of the top-k documents retrieved by a wide range of different retrieval systems - a process called depth-k pooling. A standard practice is to set the depth (k) to a constant value for each query constituting the benchmark set. However, in this paper we argue that the annotation effort can be substantially reduced if the depth of the pool is made a variable quantity for each query, the rationale being that the number of documents relevant to the information need can widely vary across queries. Our hypothesis is that a lower depth for the former class of queries and a higher depth for the latter can potentially reduce the annotation effort without a significant change in retrieval effectiveness evaluation. We make use of standard query performance prediction (QPP) techniques to estimate the number of potentially relevant documents for each query, which is then used to determine the depth of the pool. Our experiments conducted on standard test collections demonstrate that this proposed method of employing query-specific variable depths is able to adequately reflect the relative effectiveness of IR systems with a substantially smaller annotation effort.

* To appear in SIGIR 2023 
Viaarxiv icon

Task2KB: A Public Task-Oriented Knowledge Base

Jan 24, 2023
Procheta Sen, Xi Wang, Ruiqing Xu, Emine Yilmaz

Figure 1 for Task2KB: A Public Task-Oriented Knowledge Base
Figure 2 for Task2KB: A Public Task-Oriented Knowledge Base

Search engines and conversational assistants are commonly used to help users complete their every day tasks such as booking travel, cooking, etc. While there are some existing datasets that can be used for this purpose, their coverage is limited to very few domains. In this paper, we propose a novel knowledge base, 'Task2KB', which is constructed using data crawled from WikiHow, an online knowledge resource offering instructional articles on a wide range of tasks. Task2KB encapsulates various types of task-related information and attributes, such as requirements, detailed step description, and available methods to complete tasks. Due to its higher coverage compared to existing related knowledge graphs, Task2KB can be highly useful in the development of general purpose task completion assistants

Viaarxiv icon