Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Greg Durrett

QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing

Nov 01, 2023

Yating Wu, Ritika Mangla, Greg Durrett, Junyi Jessy Li

Figure 1 for QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing

Figure 2 for QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing

Figure 3 for QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing

Figure 4 for QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing

Abstract:Questions Under Discussion (QUD) is a versatile linguistic framework in which discourse progresses as continuously asking questions and answering them. Automatic parsing of a discourse to produce a QUD structure thus entails a complex question generation task: given a document and an answer sentence, generate a question that satisfies linguistic constraints of QUD and can be grounded in an anchor sentence in prior context. These questions are known to be curiosity-driven and open-ended. This work introduces the first framework for the automatic evaluation of QUD parsing, instantiating the theoretical constraints of QUD in a concrete protocol. We present QUDeval, a dataset of fine-grained evaluation of 2,190 QUD questions generated from both fine-tuned systems and LLMs. Using QUDeval, we show that satisfying all constraints of QUD is still challenging for modern LLMs, and that existing evaluation metrics poorly approximate parser quality. Encouragingly, human-authored QUDs are scored highly by our human evaluators, suggesting that there is headroom for further progress on language modeling to improve both QUD parsing and QUD evaluation.

* Camera Ready for EMNLP Main Conference

Via

Access Paper or Ask Questions

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

Oct 24, 2023

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, Greg Durrett

Figure 1 for MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

Figure 2 for MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

Figure 3 for MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

Figure 4 for MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

Abstract:While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning.

Via

Access Paper or Ask Questions

A Long Way to Go: Investigating Length Correlations in RLHF

Oct 05, 2023

Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett

Figure 1 for A Long Way to Go: Investigating Length Correlations in RLHF

Figure 2 for A Long Way to Go: Investigating Length Correlations in RLHF

Figure 3 for A Long Way to Go: Investigating Length Correlations in RLHF

Figure 4 for A Long Way to Go: Investigating Length Correlations in RLHF

Abstract:Great successes have been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models. Open-source preference datasets and reward models have enabled wider experimentation beyond generic chat settings, particularly to make systems more "helpful" for tasks like web question answering, summarization, and multi-turn dialogue. When optimizing for helpfulness, RLHF has been consistently observed to drive models to produce longer outputs. This paper demonstrates that optimizing for response length is a significant factor behind RLHF's reported improvements in these settings. First, we study the relationship between reward and length for reward models trained on three open-source preference datasets for helpfulness. Here, length correlates strongly with reward, and improvements in reward score are driven in large part by shifting the distribution over output lengths. We then explore interventions during both RL and reward model learning to see if we can achieve the same downstream improvements as RLHF without increasing length. While our interventions mitigate length increases, they aren't uniformly effective across settings. Furthermore, we find that even running RLHF with a reward based solely on length can reproduce most of the downstream improvements over the initial policy model, showing that reward models in these settings have a long way to go.

* 20 pages, 12 figures

Via

Access Paper or Ask Questions

X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs

Sep 16, 2023

Juan Diego Rodriguez, Katrin Erk, Greg Durrett

Figure 1 for X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs

Figure 2 for X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs

Figure 3 for X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs

Figure 4 for X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs

Abstract:Understanding when two pieces of text convey the same information is a goal touching many subproblems in NLP, including textual entailment and fact-checking. This problem becomes more complex when those two pieces of text are in different languages. Here, we introduce X-PARADE (Cross-lingual Paragraph-level Analysis of Divergences and Entailments), the first cross-lingual dataset of paragraph-level information divergences. Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language, indicating whether a given piece of information is the same, new, or new but can be inferred. This last notion establishes a link with cross-language NLI. Aligned paragraphs are sourced from Wikipedia pages in different languages, reflecting real information divergences observed in the wild. Armed with our dataset, we investigate a diverse set of approaches for this problem, including classic token alignment from machine translation, textual entailment methods that localize their decisions, and prompting of large language models. Our results show that these methods vary in their capability to handle inferable information, but they all fall short of human performance.

Via

Access Paper or Ask Questions

Deductive Additivity for Planning of Natural Language Proofs

Jul 06, 2023

Zayne Sprague, Kaj Bostrom, Swarat Chaudhuri, Greg Durrett

Figure 1 for Deductive Additivity for Planning of Natural Language Proofs

Figure 2 for Deductive Additivity for Planning of Natural Language Proofs

Figure 3 for Deductive Additivity for Planning of Natural Language Proofs

Figure 4 for Deductive Additivity for Planning of Natural Language Proofs

Abstract:Current natural language systems designed for multi-step claim validation typically operate in two phases: retrieve a set of relevant premise statements using heuristics (planning), then generate novel conclusions from those statements using a large language model (deduction). The planning step often requires expensive Transformer operations and does not scale to arbitrary numbers of premise statements. In this paper, we investigate whether an efficient planning heuristic is possible via embedding spaces compatible with deductive reasoning. Specifically, we evaluate whether embedding spaces exhibit a property we call deductive additivity: the sum of premise statement embeddings should be close to embeddings of conclusions based on those premises. We explore multiple sources of off-the-shelf dense embeddings in addition to fine-tuned embeddings from GPT3 and sparse embeddings from BM25. We study embedding models both intrinsically, evaluating whether the property of deductive additivity holds, and extrinsically, using them to assist planning in natural language proof generation. Lastly, we create a dataset, Single-Step Reasoning Contrast (SSRC), to further probe performance on various reasoning types. Our findings suggest that while standard embedding methods frequently embed conclusions near the sums of their premises, they fall short of being effective heuristics and lack the ability to model certain categories of reasoning.

Via

Access Paper or Ask Questions

Propagating Knowledge Updates to LMs Through Distillation

Jun 15, 2023

Shankar Padmanabhan, Yasumasa Onoe, Michael J. Q. Zhang, Greg Durrett, Eunsol Choi

Figure 1 for Propagating Knowledge Updates to LMs Through Distillation

Figure 2 for Propagating Knowledge Updates to LMs Through Distillation

Figure 3 for Propagating Knowledge Updates to LMs Through Distillation

Figure 4 for Propagating Knowledge Updates to LMs Through Distillation

Abstract:Modern language models have the capacity to store and use immense amounts of knowledge about real-world entities, but it remains unclear how to update their implicit "knowledge bases.'' While prior methods for updating knowledge in LMs successfully inject facts, updated LMs then fail to make inferences based on these injected facts. In this work, we demonstrate that a context distillation-based approach can both impart knowledge about entities and propagate that knowledge to enable broader inferences. Our approach consists of two stages: transfer set generation and distillation on the transfer set. We first generate a transfer set by simply prompting a language model to generate a continuation from the entity definition. Then, we update the model parameters so that the distribution of the LM (the student) matches the distribution of the LM conditioned on the definition (the teacher) on the transfer set. Our experiments demonstrate that this approach is more effective in propagating knowledge updates compared to fine-tuning and other gradient-based knowledge-editing methods without compromising performance in other contexts, even when injecting the definitions of up to 150 entities at once.

Via

Access Paper or Ask Questions

EEL: Efficiently Encoding Lattices for Reranking

Jun 01, 2023

Prasann Singhal, Jiacheng Xu, Xi Ye, Greg Durrett

Figure 1 for EEL: Efficiently Encoding Lattices for Reranking

Figure 2 for EEL: Efficiently Encoding Lattices for Reranking

Figure 3 for EEL: Efficiently Encoding Lattices for Reranking

Figure 4 for EEL: Efficiently Encoding Lattices for Reranking

Abstract:Standard decoding approaches for conditional text generation tasks typically search for an output hypothesis with high model probability, but this may not yield the best hypothesis according to human judgments of quality. Reranking to optimize for "downstream" metrics can better optimize for quality, but many metrics of interest are computed with pre-trained language models, which are slow to apply to large numbers of hypotheses. We explore an approach for reranking hypotheses by using Transformers to efficiently encode lattices of generated outputs, a method we call EEL. With a single Transformer pass over the entire lattice, we can approximately compute a contextualized representation of each token as if it were only part of a single hypothesis in isolation. We combine this approach with a new class of token-factored rerankers (TFRs) that allow for efficient extraction of high reranker-scoring hypotheses from the lattice. Empirically, our approach incurs minimal degradation error compared to the exponentially slower approach of encoding each hypothesis individually. When applying EEL with TFRs across three text generation tasks, our results show both substantial speedup compared to naive reranking and often better performance on downstream metrics than comparable approaches.

* ACL 2023 (16 pages), code available at https://github.com/PrasannS/eel-reranking

Via

Access Paper or Ask Questions

Less Likely Brainstorming: Using Language Models to Generate Alternative Hypotheses

May 30, 2023

Liyan Tang, Yifan Peng, Yanshan Wang, Ying Ding, Greg Durrett, Justin F. Rousseau

Abstract:A human decision-maker benefits the most from an AI assistant that corrects for their biases. For problems such as generating interpretation of a radiology report given findings, a system predicting only highly likely outcomes may be less useful, where such outcomes are already obvious to the user. To alleviate biases in human decision-making, it is worth considering a broad differential diagnosis, going beyond the most likely options. We introduce a new task, "less likely brainstorming," that asks a model to generate outputs that humans think are relevant but less likely to happen. We explore the task in two settings: a brain MRI interpretation generation setting and an everyday commonsense reasoning setting. We found that a baseline approach of training with less likely hypotheses as targets generates outputs that humans evaluate as either likely or irrelevant nearly half of the time; standard MLE training is not effective. To tackle this problem, we propose a controlled text generation method that uses a novel contrastive learning strategy to encourage models to differentiate between generating likely and less likely outputs according to humans. We compare our method with several state-of-the-art controlled text generation models via automatic and human evaluations and show that our models' capability of generating less likely outputs is improved.

* Accepted to ACL (Findings) 2023

Via

Access Paper or Ask Questions

Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing

May 29, 2023

Jiayi Wei, Greg Durrett, Isil Dillig

Figure 1 for Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing

Figure 2 for Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing

Figure 3 for Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing

Figure 4 for Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing

Abstract:Developers often dedicate significant time to maintaining and refactoring existing code. However, most prior work on generative models for code focuses solely on creating new code, neglecting the unique requirements of editing existing code. In this work, we explore a multi-round code auto-editing setting, aiming to predict edits to a code region based on recent changes within the same codebase. Our model, Coeditor, is a fine-tuned CodeT5 model with enhancements specifically designed for code editing tasks. We encode code changes using a line diff format and employ static analysis to form large customized model contexts, ensuring appropriate information for prediction. We collect a code editing dataset from the commit histories of 1650 open-source Python projects for training and evaluation. In a simplified single-round, single-edit task, Coeditor significantly outperforms the best code completion approach -- nearly doubling its exact-match accuracy, despite using a much smaller model -- demonstrating the benefits of incorporating editing history for code completion. In a multi-round, multi-edit setting, we observe substantial gains by iteratively prompting the model with additional user edits. We open-source our code, data, and model weights to encourage future research and release a VSCode extension powered by our model for interactive usage.

Via

Access Paper or Ask Questions

Drafting Event Schemas using Language Models

May 24, 2023

Anisha Gunjal, Greg Durrett

Figure 1 for Drafting Event Schemas using Language Models

Figure 2 for Drafting Event Schemas using Language Models

Figure 3 for Drafting Event Schemas using Language Models

Figure 4 for Drafting Event Schemas using Language Models

Abstract:Past work has studied event prediction and event language modeling, sometimes mediated through structured representations of knowledge in the form of event schemas. Such schemas can lead to explainable predictions and forecasting of unseen events given incomplete information. In this work, we look at the process of creating such schemas to describe complex events. We use large language models (LLMs) to draft schemas directly in natural language, which can be further refined by human curators as necessary. Our focus is on whether we can achieve sufficient diversity and recall of key events and whether we can produce the schemas in a sufficiently descriptive style. We show that large language models are able to achieve moderate recall against schemas taken from two different datasets, with even better results when multiple prompts and multiple samples are combined. Moreover, we show that textual entailment methods can be used for both matching schemas to instances of events as well as evaluating overlap between gold and predicted schemas. Our method paves the way for easier distillation of event knowledge from large language model into schemas.

Via

Access Paper or Ask Questions