Alert button
Picture for Richard Yuanzhe Pang

Richard Yuanzhe Pang

Alert button

Leveraging Implicit Feedback from Deployment Data in Dialogue

Jul 26, 2023
Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, Jason Weston

Figure 1 for Leveraging Implicit Feedback from Deployment Data in Dialogue
Figure 2 for Leveraging Implicit Feedback from Deployment Data in Dialogue
Figure 3 for Leveraging Implicit Feedback from Deployment Data in Dialogue
Figure 4 for Leveraging Implicit Feedback from Deployment Data in Dialogue

We study improving social conversational agents by learning from natural dialogue between users and a deployed model, without extra annotations. To implicitly measure the quality of a machine-generated utterance, we leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes. Our experiments use the publicly released deployment data from BlenderBot (Xu et al., 2023). Human evaluation indicates improvements in our new models over baseline responses; however, we find that some proxy signals can lead to more generations with undesirable properties as well. For example, optimizing for conversation length can lead to more controversial or unfriendly generations compared to the baseline, whereas optimizing for positive sentiment or reaction can decrease these behaviors.

Viaarxiv icon

Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

May 24, 2023
Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Seyed Mehran Kazemi, Najoung Kim, He He

Figure 1 for Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Figure 2 for Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Figure 3 for Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Figure 4 for Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to longer and compositional proofs. However, they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.

Viaarxiv icon

Extrapolative Controlled Sequence Generation via Iterative Refinement

Mar 08, 2023
Vishakh Padmakumar, Richard Yuanzhe Pang, He He, Ankur P. Parikh

Figure 1 for Extrapolative Controlled Sequence Generation via Iterative Refinement
Figure 2 for Extrapolative Controlled Sequence Generation via Iterative Refinement
Figure 3 for Extrapolative Controlled Sequence Generation via Iterative Refinement
Figure 4 for Extrapolative Controlled Sequence Generation via Iterative Refinement

We study the problem of extrapolative controlled generation, i.e., generating sequences with attribute values beyond the range seen in training. This task is of significant importance in automated design, especially drug discovery, where the goal is to design novel proteins that are \textit{better} (e.g., more stable) than existing sequences. Thus, by definition, the target sequences and their attribute values are out of the training distribution, posing challenges to existing methods that aim to directly generate the target sequence. Instead, in this work, we propose Iterative Controlled Extrapolation (ICE) which iteratively makes local edits to a sequence to enable extrapolation. We train the model on synthetically generated sequence pairs that demonstrate small improvement in the attribute value. Results on one natural language task (sentiment analysis) and two protein engineering tasks (ACE2 stability and AAV fitness) show that ICE considerably outperforms state-of-the-art approaches despite its simplicity. Our code and models are available at: https://github.com/vishakhpk/iter-extrapolation.

* Preprint 
Viaarxiv icon

Reward Gaming in Conditional Text Generation

Nov 16, 2022
Richard Yuanzhe Pang, Vishakh Padmakumar, Thibault Sellam, Ankur P. Parikh, He He

Figure 1 for Reward Gaming in Conditional Text Generation
Figure 2 for Reward Gaming in Conditional Text Generation
Figure 3 for Reward Gaming in Conditional Text Generation
Figure 4 for Reward Gaming in Conditional Text Generation

To align conditional text generation model outputs with desired behaviors, there has been an increasing focus on training the model using reinforcement learning (RL) with reward functions learned from human annotations. Under this framework, we identify three common cases where high rewards are incorrectly assigned to undesirable patterns: noise-induced spurious correlation, naturally occurring spurious correlation, and covariate shift. We show that even though learned metrics achieve high performance on the distribution of the data used to train the reward function, the undesirable patterns may be amplified during RL training of the text generation model. While there has been discussion about reward gaming in the RL or safety community, in this short discussion piece, we would like to highlight reward gaming in the NLG community using concrete conditional text generation examples and discuss potential fixes and areas for future work.

Viaarxiv icon

What Do NLP Researchers Believe? Results of the NLP Community Metasurvey

Aug 26, 2022
Julian Michael, Ari Holtzman, Alicia Parrish, Aaron Mueller, Alex Wang, Angelica Chen, Divyam Madaan, Nikita Nangia, Richard Yuanzhe Pang, Jason Phang, Samuel R. Bowman

Figure 1 for What Do NLP Researchers Believe? Results of the NLP Community Metasurvey
Figure 2 for What Do NLP Researchers Believe? Results of the NLP Community Metasurvey
Figure 3 for What Do NLP Researchers Believe? Results of the NLP Community Metasurvey
Figure 4 for What Do NLP Researchers Believe? Results of the NLP Community Metasurvey

We present the results of the NLP Community Metasurvey. Run from May to June 2022, the survey elicited opinions on controversial issues, including industry influence in the field, concerns about AGI, and ethics. Our results put concrete numbers to several controversies: For example, respondents are split almost exactly in half on questions about the importance of artificial general intelligence, whether language models understand language, and the necessity of linguistic structure and inductive bias for solving NLP problems. In addition, the survey posed meta-questions, asking respondents to predict the distribution of survey responses. This allows us not only to gain insight on the spectrum of beliefs held by NLP researchers, but also to uncover false sociological beliefs where the community's predictions don't match reality. We find such mismatches on a wide range of issues. Among other results, the community greatly overestimates its own belief in the usefulness of benchmarks and the potential for scaling to solve real-world problems, while underestimating its own belief in the importance of linguistic structure, inductive bias, and interdisciplinary science.

* 31 pages, 19 figures, 3 tables; more information at https://nlpsurvey.net 
Viaarxiv icon

SQuALITY: Building a Long-Document Summarization Dataset the Hard Way

May 23, 2022
Alex Wang, Richard Yuanzhe Pang, Angelica Chen, Jason Phang, Samuel R. Bowman

Figure 1 for SQuALITY: Building a Long-Document Summarization Dataset the Hard Way
Figure 2 for SQuALITY: Building a Long-Document Summarization Dataset the Hard Way
Figure 3 for SQuALITY: Building a Long-Document Summarization Dataset the Hard Way
Figure 4 for SQuALITY: Building a Long-Document Summarization Dataset the Hard Way

Summarization datasets are often assembled either by scraping naturally occurring public-domain summaries -- which are nearly always in difficult-to-work-with technical domains -- or by using approximate heuristics to extract them from everyday text -- which frequently yields unfaithful summaries. In this work, we turn to a slower but more straightforward approach to developing summarization benchmark data: We hire highly-qualified contractors to read stories and write original summaries from scratch. To amortize reading time, we collect five summaries per document, with the first giving an overview and the subsequent four addressing specific questions. We use this protocol to collect SQuALITY, a dataset of question-focused summaries built on the same public-domain short stories as the multiple-choice dataset QuALITY (Pang et al., 2021). Experiments with state-of-the-art summarization systems show that our dataset is challenging and that existing automatic evaluation metrics are weak indicators of quality.

Viaarxiv icon

Token Dropping for Efficient BERT Pretraining

Mar 24, 2022
Le Hou, Richard Yuanzhe Pang, Tianyi Zhou, Yuexin Wu, Xinying Song, Xiaodan Song, Denny Zhou

Figure 1 for Token Dropping for Efficient BERT Pretraining
Figure 2 for Token Dropping for Efficient BERT Pretraining
Figure 3 for Token Dropping for Efficient BERT Pretraining
Figure 4 for Token Dropping for Efficient BERT Pretraining

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT, without degrading its performance on downstream tasks. In short, we drop unimportant tokens starting from an intermediate layer in the model to make the model focus on important tokens; the dropped tokens are later picked up by the last layer of the model so that the model still produces full-length sequences. We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead. In our experiments, this simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.

* ACL 2022 
Viaarxiv icon

Amortized Noisy Channel Neural Machine Translation

Dec 16, 2021
Richard Yuanzhe Pang, He He, Kyunghyun Cho

Figure 1 for Amortized Noisy Channel Neural Machine Translation
Figure 2 for Amortized Noisy Channel Neural Machine Translation
Figure 3 for Amortized Noisy Channel Neural Machine Translation
Figure 4 for Amortized Noisy Channel Neural Machine Translation

Noisy channel models have been especially effective in neural machine translation (NMT). However, recent approaches like "beam search and rerank" (BSR) incur significant computation overhead during inference, making real-world application infeasible. We aim to build an amortized noisy channel NMT model such that greedily decoding from it would generate translations that maximize the same reward as translations generated using BSR. We attempt three approaches: knowledge distillation, 1-step-deviation imitation learning, and Q learning. The first approach obtains the noisy channel signal from a pseudo-corpus, and the latter two approaches aim to optimize toward a noisy-channel MT reward directly. All three approaches speed up inference by 1-2 orders of magnitude. For all three approaches, the generated translations fail to achieve rewards comparable to BSR, but the translation quality approximated by BLEU is similar to the quality of BSR-produced translations.

Viaarxiv icon

QuALITY: Question Answering with Long Input Texts, Yes!

Dec 16, 2021
Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, Samuel R. Bowman

Figure 1 for QuALITY: Question Answering with Long Input Texts, Yes!
Figure 2 for QuALITY: Question Answering with Long Input Texts, Yes!
Figure 3 for QuALITY: Question Answering with Long Input Texts, Yes!
Figure 4 for QuALITY: Question Answering with Long Input Texts, Yes!

To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Current models perform poorly on this task (55.4%) and significantly lag behind human performance (93.5%).

Viaarxiv icon

AgreeSum: Agreement-Oriented Multi-Document Summarization

Jun 04, 2021
Richard Yuanzhe Pang, Adam D. Lelkes, Vinh Q. Tran, Cong Yu

Figure 1 for AgreeSum: Agreement-Oriented Multi-Document Summarization
Figure 2 for AgreeSum: Agreement-Oriented Multi-Document Summarization
Figure 3 for AgreeSum: Agreement-Oriented Multi-Document Summarization
Figure 4 for AgreeSum: Agreement-Oriented Multi-Document Summarization

We aim to renew interest in a particular multi-document summarization (MDS) task which we call AgreeSum: agreement-oriented multi-document summarization. Given a cluster of articles, the goal is to provide abstractive summaries that represent information common and faithful to all input articles. Given the lack of existing datasets, we create a dataset for AgreeSum, and provide annotations on article-summary entailment relations for a subset of the clusters in the dataset. We aim to create strong baselines for the task by applying the top-performing pretrained single-document summarization model PEGASUS onto AgreeSum, leveraging both annotated clusters by supervised losses, and unannotated clusters by T5-based entailment-related and language-related losses. Compared to other baselines, both automatic evaluation and human evaluation show better article-summary and cluster-summary entailment in generated summaries. On a separate note, we hope that our article-summary entailment annotations contribute to the community's effort in improving abstractive summarization faithfulness.

* Findings of ACL 2021 
Viaarxiv icon