Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stephen Roller

Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System

Aug 16, 2023

Jianguo Zhang, Stephen Roller, Kun Qian, Zhiwei Liu, Rui Meng, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong

Figure 1 for Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System

Figure 2 for Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System

Figure 3 for Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System

Figure 4 for Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System

Abstract:End-to-end task-oriented dialogue (TOD) systems have achieved promising performance by leveraging sophisticated natural language understanding and natural language generation capabilities of pre-trained models. This work enables the TOD systems with more flexibility through a simple cache. The cache provides the flexibility to dynamically update the TOD systems and handle both existing and unseen dialogue scenarios. Towards this end, we first fine-tune a retrieval module to effectively retrieve the most relevant information entries from the cache. We then train end-to-end TOD models that can refer to and ground on both dialogue history and retrieved information during TOD generation. The cache is straightforward to construct, and the backbone models of TOD systems are compatible with existing pre-trained generative models. Extensive experiments demonstrate the superior performance of our framework, with a notable improvement in non-empty joint goal accuracy by 6.7% compared to strong baselines.

* Accepted by SIGDIAL 2023 as a long paper

Via

Access Paper or Ask Questions

Leveraging Implicit Feedback from Deployment Data in Dialogue

Jul 26, 2023

Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, Jason Weston

Figure 1 for Leveraging Implicit Feedback from Deployment Data in Dialogue

Figure 2 for Leveraging Implicit Feedback from Deployment Data in Dialogue

Figure 3 for Leveraging Implicit Feedback from Deployment Data in Dialogue

Figure 4 for Leveraging Implicit Feedback from Deployment Data in Dialogue

Abstract:We study improving social conversational agents by learning from natural dialogue between users and a deployed model, without extra annotations. To implicitly measure the quality of a machine-generated utterance, we leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes. Our experiments use the publicly released deployment data from BlenderBot (Xu et al., 2023). Human evaluation indicates improvements in our new models over baseline responses; however, we find that some proxy signals can lead to more generations with undesirable properties as well. For example, optimizing for conversation length can lead to more controversial or unfriendly generations compared to the baseline, whereas optimizing for positive sentiment or reaction can decrease these behaviors.

Via

Access Paper or Ask Questions

A Theory on Adam Instability in Large-Scale Machine Learning

Apr 25, 2023

Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva(+7 more)

Figure 1 for A Theory on Adam Instability in Large-Scale Machine Learning

Figure 2 for A Theory on Adam Instability in Large-Scale Machine Learning

Figure 3 for A Theory on Adam Instability in Large-Scale Machine Learning

Figure 4 for A Theory on Adam Instability in Large-Scale Machine Learning

Abstract:We present a theory for the previously unexplained divergent behavior noticed in the training of large language models. We argue that the phenomenon is an artifact of the dominant optimization algorithm used for training, called Adam. We observe that Adam can enter a state in which the parameter update vector has a relatively large norm and is essentially uncorrelated with the direction of descent on the training loss landscape, leading to divergence. This artifact is more likely to be observed in the training of a deep model with a large batch size, which is the typical setting of large-scale language model training. To argue the theory, we present observations from the training runs of the language models of different scales: 7 billion, 30 billion, 65 billion, and 546 billion parameters.

Via

Access Paper or Ask Questions

Scaling Laws for Generative Mixed-Modal Language Models

Jan 10, 2023

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, Luke Zettlemoyer

Abstract:Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.

Via

Access Paper or Ask Questions

BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage

Aug 10, 2022

Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane(+8 more)

Figure 1 for BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage

Figure 2 for BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage

Figure 3 for BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage

Figure 4 for BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage

Abstract:We present BlenderBot 3, a 175B parameter dialogue model capable of open-domain conversation with access to the internet and a long-term memory, and having been trained on a large number of user defined tasks. We release both the model weights and code, and have also deployed the model on a public web page to interact with organic users. This technical report describes how the model was built (architecture, model and training scheme), and details of its deployment, including safety mechanisms. Human evaluations show its superiority to existing open-domain dialogue agents, including its predecessors (Roller et al., 2021; Komeili et al., 2022). Finally, we detail our plan for continual learning using the data collected from deployment, which will also be publicly released. The goal of this research program is thus to enable the community to study ever-improving responsible agents that learn through interaction.

Via

Access Paper or Ask Questions

OPT: Open Pre-trained Transformer Language Models

May 05, 2022

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin(+9 more)

Figure 1 for OPT: Open Pre-trained Transformer Language Models

Figure 2 for OPT: Open Pre-trained Transformer Language Models

Figure 3 for OPT: Open Pre-trained Transformer Language Models

Figure 4 for OPT: Open Pre-trained Transformer Language Models

Abstract:Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.

Via

Access Paper or Ask Questions

Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion

Mar 29, 2022

Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, Jason Weston

Figure 1 for Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion

Figure 2 for Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion

Figure 3 for Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion

Figure 4 for Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion

Abstract:Language models (LMs) have recently been shown to generate more factual responses by employing modularity (Zhou et al., 2021) in combination with retrieval (Adolphs et al., 2021). We extend the recent approach of Adolphs et al. (2021) to include internet search as a module. Our SeeKeR (Search engine->Knowledge->Response) method thus applies a single LM to three modular tasks in succession: search, generating knowledge, and generating a final response. We show that, when using SeeKeR as a dialogue model, it outperforms the state-of-the-art model BlenderBot 2 (Chen et al., 2021) on open-domain knowledge-grounded conversations for the same number of parameters, in terms of consistency, knowledge and per-turn engagingness. SeeKeR applied to topical prompt completions as a standard language model outperforms GPT2 (Radford et al., 2019) and GPT3 (Brown et al., 2020) in terms of factuality and topicality, despite GPT3 being a vastly larger model. Our code and models are made publicly available.

Via

Access Paper or Ask Questions

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Jan 12, 2022

Eric Michael Smith, Orion Hsu, Rebecca Qian, Stephen Roller, Y-Lan Boureau, Jason Weston

Figure 1 for Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Figure 2 for Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Figure 3 for Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Figure 4 for Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Abstract:At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al., 2016, arXiv:1603.08023), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.

Via

Access Paper or Ask Questions

Teaching Models new APIs: Domain-Agnostic Simulators for Task Oriented Dialogue

Oct 13, 2021

Moya Chen, Paul A. Crook, Stephen Roller

Figure 1 for Teaching Models new APIs: Domain-Agnostic Simulators for Task Oriented Dialogue

Figure 2 for Teaching Models new APIs: Domain-Agnostic Simulators for Task Oriented Dialogue

Figure 3 for Teaching Models new APIs: Domain-Agnostic Simulators for Task Oriented Dialogue

Figure 4 for Teaching Models new APIs: Domain-Agnostic Simulators for Task Oriented Dialogue

Abstract:We demonstrate that large language models are able to simulate Task Oriented Dialogues in novel domains, provided only with an API implementation and a list of goals. We show these simulations can formulate online, automatic metrics that correlate well with human evaluations. Furthermore, by checking for whether the User's goals are met, we can use simulation to repeatedly generate training data and improve the quality of simulations themselves. With no human intervention or domain-specific training data, our simulations bootstrap end-to-end models which achieve a 37\% error reduction in previously unseen domains. By including as few as 32 domain-specific conversations, bootstrapped models can match the performance of a fully-supervised model with $10\times$ more data. To our knowledge, this is the first time simulations have been shown to be effective at bootstrapping models without explicitly requiring any domain-specific training data, rule-engineering, or humans-in-the-loop.

Via

Access Paper or Ask Questions

Hash Layers For Large Sparse Models

Jun 16, 2021

Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston

Figure 1 for Hash Layers For Large Sparse Models

Figure 2 for Hash Layers For Large Sparse Models

Figure 3 for Hash Layers For Large Sparse Models

Figure 4 for Hash Layers For Large Sparse Models

Abstract:We investigate the training of sparse layers that use different parameters for different inputs based on hashing in large Transformer models. Specifically, we modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence. We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods such as Switch Transformers and BASE Layers, while requiring no routing parameters or extra terms in the objective function such as a load balancing loss, and no sophisticated assignment algorithm. We study the performance of different hashing techniques, hash sizes and input features, and show that balanced and random hashes focused on the most local features work best, compared to either learning clusters or using longer-range context. We show our approach works well both on large language modeling and dialogue tasks, and on downstream fine-tuning tasks.

Via

Access Paper or Ask Questions