Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Spandana Gella

Multimodal Contextualized Plan Prediction for Embodied Task Completion

May 10, 2023

Mert İnan, Aishwarya Padmakumar, Spandana Gella, Patrick Lange, Dilek Hakkani-Tur

Abstract:Task planning is an important component of traditional robotics systems enabling robots to compose fine grained skills to perform more complex tasks. Recent work building systems for translating natural language to executable actions for task completion in simulated embodied agents is focused on directly predicting low level action sequences that would be expected to be directly executable by a physical robot. In this work, we instead focus on predicting a higher level plan representation for one such embodied task completion dataset - TEACh, under the assumption that techniques for high-level plan prediction from natural language are expected to be more transferable to physical robot systems. We demonstrate that better plans can be predicted using multimodal context, and that plan prediction and plan execution modules are likely dependent on each other and hence it may not be ideal to fully decouple them. Further, we benchmark execution of oracle plans to quantify the scope for improvement in plan prediction models.

* NILLI at EMNLP 2022

Via

Access Paper or Ask Questions

Using In-Context Learning to Improve Dialogue Safety

Feb 02, 2023

Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, Dilek Hakkani-Tür

Figure 1 for Using In-Context Learning to Improve Dialogue Safety

Figure 2 for Using In-Context Learning to Improve Dialogue Safety

Figure 3 for Using In-Context Learning to Improve Dialogue Safety

Figure 4 for Using In-Context Learning to Improve Dialogue Safety

Abstract:While large neural-based conversational models have become increasingly proficient as dialogue agents, recent work has highlighted safety issues with these systems. For example, these systems can be goaded into generating toxic content, which often perpetuates social biases or stereotypes. We investigate a retrieval-based framework for reducing bias and toxicity in responses generated from neural-based chatbots. It uses in-context learning to steer a model towards safer generations. Concretely, to generate a response to an unsafe dialogue context, we retrieve demonstrations of safe model responses to similar dialogue contexts. We find our proposed approach performs competitively with strong baselines which use fine-tuning. For instance, using automatic evaluation, we find our best fine-tuned baseline only generates safe responses to unsafe dialogue contexts from DiaSafety 2.92% more than our approach. Finally, we also propose a straightforward re-ranking procedure which can further improve response safeness.

Via

Access Paper or Ask Questions

DialGuide: Aligning Dialogue Model Behavior with Developer Guidelines

Dec 20, 2022

Prakhar Gupta, Yang Liu, Di Jin, Behnam Hedayatnia, Spandana Gella, Sijia Liu, Patrick Lange, Julia Hirschberg, Dilek Hakkani-Tur

Abstract:Dialogue models are able to generate coherent and fluent responses, but they can still be challenging to control and may produce non-engaging, unsafe results. This unpredictability diminishes user trust and can hinder the use of the models in the real world. To address this, we introduce DialGuide, a novel framework for controlling dialogue model behavior using natural language rules, or guidelines. These guidelines provide information about the context they are applicable to and what should be included in the response, allowing the models to generate responses that are more closely aligned with the developer's expectations and intent. We evaluate DialGuide on three tasks in open-domain dialogue response generation: guideline selection, response generation, and response entailment verification. Our dataset contains 10,737 positive and 15,467 negative dialogue context-response-guideline triplets across two domains - chit-chat and safety. We provide baseline models for the tasks and benchmark their performance. We also demonstrate that DialGuide is effective in the dialogue safety domain, producing safe and engaging responses that follow developer guidelines.

Via

Access Paper or Ask Questions

Dialog Acts for Task-Driven Embodied Agents

Sep 26, 2022

Spandana Gella, Aishwarya Padmakumar, Patrick Lange, Dilek Hakkani-Tur

Figure 1 for Dialog Acts for Task-Driven Embodied Agents

Figure 2 for Dialog Acts for Task-Driven Embodied Agents

Figure 3 for Dialog Acts for Task-Driven Embodied Agents

Figure 4 for Dialog Acts for Task-Driven Embodied Agents

Abstract:Embodied agents need to be able to interact in natural language understanding task descriptions and asking appropriate follow up questions to obtain necessary information to be effective at successfully accomplishing tasks for a wide range of users. In this work, we propose a set of dialog acts for modelling such dialogs and annotate the TEACh dataset that includes over 3,000 situated, task oriented conversations (consisting of 39.5k utterances in total) with dialog acts. TEACh-DA is one of the first large scale dataset of dialog act annotations for embodied task completion. Furthermore, we demonstrate the use of this annotated dataset in training models for tagging the dialog acts of a given utterance, predicting the dialog act of the next response given a dialog history, and use the dialog acts to guide agent's non-dialog behaviour. In particular, our experiments on the TEACh Execution from Dialog History task where the model predicts the sequence of low level actions to be executed in the environment for embodied task completion, demonstrate that dialog acts can improve end task success rate by up to 2 points compared to the system without dialog acts.

* accepted at SIGDIAL 2022

Via

Access Paper or Ask Questions

Analyzing the Limits of Self-Supervision in Handling Bias in Language

Dec 16, 2021

Lisa Bauer, Karthik Gopalakrishnan, Spandana Gella, Yang Liu, Mohit Bansal, Dilek Hakkani-Tur

Figure 1 for Analyzing the Limits of Self-Supervision in Handling Bias in Language

Figure 2 for Analyzing the Limits of Self-Supervision in Handling Bias in Language

Figure 3 for Analyzing the Limits of Self-Supervision in Handling Bias in Language

Figure 4 for Analyzing the Limits of Self-Supervision in Handling Bias in Language

Abstract:Prompting inputs with natural language task descriptions has emerged as a popular mechanism to elicit reasonably accurate outputs from large-scale generative language models with little to no in-context supervision. This also helps gain insight into how well language models capture the semantics of a wide range of downstream tasks purely from self-supervised pre-training on massive corpora of unlabeled text. Such models have naturally also been exposed to a lot of undesirable content like racist and sexist language and there is limited work on awareness of models along these dimensions. In this paper, we define and comprehensively evaluate how well such language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing. We define three broad classes of task descriptions for these tasks: statement, question, and completion, with numerous lexical variants within each class. We study the efficacy of prompting for each task using these classes and the null task description across several decoding methods and few-shot examples. Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation. We believe our work is an important step towards unbiased language models by quantifying the limits of current self-supervision objectives at accomplishing such sociologically challenging tasks.

* 16 pages, 1 figure

Via

Access Paper or Ask Questions

TEACh: Task-driven Embodied Agents that Chat

Oct 15, 2021

Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, Dilek Hakkani-Tur

Figure 1 for TEACh: Task-driven Embodied Agents that Chat

Figure 2 for TEACh: Task-driven Embodied Agents that Chat

Figure 3 for TEACh: Task-driven Embodied Agents that Chat

Figure 4 for TEACh: Task-driven Embodied Agents that Chat

Abstract:Robots operating in human spaces must be able to engage in natural language interaction with people, both understanding and executing instructions, and using conversation to resolve ambiguity and recover from mistakes. To study this, we introduce TEACh, a dataset of over 3,000 human--human, interactive dialogues to complete household tasks in simulation. A Commander with access to oracle information about a task communicates in natural language with a Follower. The Follower navigates through and interacts with the environment to complete tasks varying in complexity from "Make Coffee" to "Prepare Breakfast", asking questions and getting additional information from the Commander. We propose three benchmarks using TEACh to study embodied intelligence challenges, and we evaluate initial models' abilities in dialogue understanding, language grounding, and task execution.

* 7 pages main, 28 pages total, 29 figures; Version 2 includes information on data cleaning and experimental results use a modified data split that has been released

Via

Access Paper or Ask Questions

Rome was built in 1776: A Case Study on Factual Correctness in Knowledge-Grounded Response Generation

Oct 11, 2021

Sashank Santhanam, Behnam Hedayatnia, Spandana Gella, Aishwarya Padmakumar, Seokhwan Kim, Yang Liu, Dilek Hakkani-Tur

Figure 1 for Rome was built in 1776: A Case Study on Factual Correctness in Knowledge-Grounded Response Generation

Figure 2 for Rome was built in 1776: A Case Study on Factual Correctness in Knowledge-Grounded Response Generation

Figure 3 for Rome was built in 1776: A Case Study on Factual Correctness in Knowledge-Grounded Response Generation

Figure 4 for Rome was built in 1776: A Case Study on Factual Correctness in Knowledge-Grounded Response Generation

Abstract:Recently neural response generation models have leveraged large pre-trained transformer models and knowledge snippets to generate relevant and informative responses. However, this does not guarantee that generated responses are factually correct. In this paper, we examine factual correctness in knowledge-grounded neural response generation models. We present a human annotation setup to identify three different response types: responses that are factually consistent with respect to the input knowledge, responses that contain hallucinated knowledge, and non-verifiable chitchat style responses. We use this setup to annotate responses generated using different stateof-the-art models, knowledge snippets, and decoding strategies. In addition, to facilitate the development of a factual consistency detector, we automatically create a new corpus called Conv-FEVER that is adapted from the Wizard of Wikipedia dataset and includes factually consistent and inconsistent responses. We demonstrate the benefit of our Conv-FEVER dataset by showing that the models trained on this data perform reasonably well to detect factually inconsistent responses with respect to the provided knowledge through evaluation on our human annotated data. We will release the Conv-FEVER dataset and the human annotated responses.

Via

Access Paper or Ask Questions

An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

Aug 11, 2020

Lifu Tu, Garima Lalwani, Spandana Gella, He He

Figure 1 for An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

Figure 2 for An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

Figure 3 for An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

Figure 4 for An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

Abstract:Recent work has shown that pre-trained language models such as BERT improve robustness to spurious correlations in the dataset. Intrigued by these results, we find that the key to their success is generalization from a small amount of counterexamples where the spurious correlations do not hold. When such minority examples are scarce, pre-trained models perform as poorly as models trained from scratch. In the case of extreme minority, we propose to use multi-task learning (MTL) to improve generalization. Our experiments on natural language inference and paraphrase identification show that MTL with the right auxiliary tasks significantly improves performance on challenging examples without hurting the in-distribution performance. Further, we show that the gain from MTL mainly comes from improved generalization from the minority examples. Our results highlight the importance of data diversity for overcoming spurious correlations.

* Accepted to TACL 2020

Via

Access Paper or Ask Questions

Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions

May 04, 2020

Arjun R Akula, Spandana Gella, Yaser Al-Onaizan, Song-Chun Zhu, Siva Reddy

Figure 1 for Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions

Figure 2 for Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions

Figure 3 for Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions

Figure 4 for Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions

Abstract:Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7% of test instances do not require reasoning on linguistic structure, i.e., words are enough to identify the target object, the word order doesn't matter. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn't. Additionally, we create an out-of-distribution dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that the target object changes. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12% to 23% lower in performance than the established progress for this task. We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. Our datasets are publicly available at https://github.com/aws/aws-refcocog-adv

* ACL 2020

Via

Access Paper or Ask Questions

Neural Word Decomposition Models for Abusive Language Detection

Oct 02, 2019

Sravan Babu Bodapati, Spandana Gella, Kasturi Bhattacharjee, Yaser Al-Onaizan

Figure 1 for Neural Word Decomposition Models for Abusive Language Detection

Figure 2 for Neural Word Decomposition Models for Abusive Language Detection

Figure 3 for Neural Word Decomposition Models for Abusive Language Detection

Figure 4 for Neural Word Decomposition Models for Abusive Language Detection

Abstract:User generated text on social media often suffers from a lot of undesired characteristics including hatespeech, abusive language, insults etc. that are targeted to attack or abuse a specific group of people. Often such text is written differently compared to traditional text such as news involving either explicit mention of abusive words, obfuscated words and typological errors or implicit abuse i.e., indicating or targeting negative stereotypes. Thus, processing this text poses several robustness challenges when we apply natural language processing techniques developed for traditional text. For example, using word or token based models to process such text can treat two spelling variants of a word as two different words. Following recent work, we analyze how character, subword and byte pair encoding (BPE) models can be aid some of the challenges posed by user generated text. In our work, we analyze the effectiveness of each of the above techniques, compare and contrast various word decomposition techniques when used in combination with others. We experiment with finetuning large pretrained language models, and demonstrate their robustness to domain shift by studying Wikipedia attack, toxicity and Twitter hatespeech datasets

* https://www.aclweb.org/anthology/events/acl-2019/
* Accepted at ALW Workshop at ACL2019, Florence; BERT has a WordPiece model and it enhances performance of word based models in noisy settings

Via

Access Paper or Ask Questions