Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Khashabi

Shammie

How Effective Is Self-Consistency for Long-Context Problems?

Nov 02, 2024

Adam Byerly, Daniel Khashabi

Abstract:Self-consistency (SC) has been demonstrated to enhance the performance of large language models (LLMs) across various tasks and domains involving short content. However, does this evidence support its effectiveness for long-context problems? This study examines the role of SC in long-context scenarios, where LLMs often struggle with position bias, hindering their ability to utilize information effectively from all parts of their long input context. We examine a range of design parameters, including different models, context lengths, prompt formats, and types of datasets and tasks. Our findings demonstrate that SC, while effective for short-context problems, fundamentally fails for long-context tasks -- not only does it fail to mitigate position bias, but it can also actively degrade performance. We observe that the effectiveness of SC varies with context length and model size but remains mainly unaffected by prompt format or task type. These results provide valuable insight into the limitations of current LLMs in long-context understanding and highlight the need for more sophisticated approaches to address position bias in these models.

* 12 pages, 4 figures

Via

Access Paper or Ask Questions

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Oct 11, 2024

Jingyu Zhang, Ahmed Elgohary, Ahmed Magooda, Daniel Khashabi, Benjamin Van Durme

Figure 1 for Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Figure 2 for Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Figure 3 for Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Figure 4 for Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Abstract:The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned. We propose Controllable Safety Alignment (CoSA), a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow safety configs -- free-form natural language descriptions of the desired safety behaviors -- that are provided as part of the system prompt. To adjust model safety behavior, authorized users only need to modify such safety configs at inference time. To enable that, we propose CoSAlign, a data-centric method for aligning LLMs to easily adapt to diverse safety configs. Furthermore, we devise a novel controllability evaluation protocol that considers both helpfulness and configured safety, summarizing them into CoSA-Score, and construct CoSApien, a human-authored benchmark that consists of real-world LLM use cases with diverse safety requirements and corresponding evaluation prompts. We show that CoSAlign leads to substantial gains of controllability over strong baselines including in-context alignment. Our framework encourages better representation and adaptation to pluralistic human values in LLMs, and thereby increasing their practicality.

Via

Access Paper or Ask Questions

Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Oct 06, 2024

Tianjian Li, Haoran Xu, Weiting Tan, Dongwei Jiang, Kenton Murray, Daniel Khashabi

Figure 1 for Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Figure 2 for Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Figure 3 for Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Figure 4 for Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Abstract:Data availability across domains often follows a long-tail distribution: a few domains have abundant data, while most face data scarcity. This imbalance poses challenges in training language models uniformly across all domains. In our study, we focus on multilingual settings, where data sizes vary significantly between high- and low-resource languages. Common strategies to address this include upsampling low-resource languages (Temperature Sampling) or upweighting their loss (Scalarization). Although often considered equivalent, this assumption has not been proven, which motivates our study. Through both theoretical and empirical analysis, we identify the conditions under which these approaches are equivalent and when they diverge. Specifically, we demonstrate that these two methods are equivalent under full gradient descent, but this equivalence breaks down with stochastic gradient descent. Empirically, we observe that Temperature Sampling converges more quickly but is prone to overfitting. We argue that this faster convergence is likely due to the lower variance in gradient estimations, as shown theoretically. Based on these insights, we propose Cooldown, a strategy that reduces sampling temperature during training, accelerating convergence without overfitting to low-resource languages. Our method is competitive with existing data re-weighting and offers computational efficiency.

* 18 pages

Via

Access Paper or Ask Questions

RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

Oct 01, 2024

Dongwei Jiang, Guoxuan Wang, Yining Lu, Andrew Wang, Jingyu Zhang, Chuyu Liu, Benjamin Van Durme, Daniel Khashabi

Figure 1 for RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

Figure 2 for RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

Figure 3 for RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

Figure 4 for RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

Abstract:The reasoning steps generated by LLMs might be incomplete, as they mimic logical leaps common in everyday communication found in their pre-training data: underlying rationales are frequently left implicit (unstated). To address this challenge, we introduce RATIONALYST, a model for process-supervision of reasoning based on pre-training on a vast collection of rationale annotations extracted from unlabeled data. We extract 79k rationales from web-scale unlabelled dataset (the Pile) and a combination of reasoning datasets with minimal human intervention. This web-scale pre-training for reasoning allows RATIONALYST to consistently generalize across diverse reasoning tasks, including mathematical, commonsense, scientific, and logical reasoning. Fine-tuned from LLaMa-3-8B, RATIONALYST improves the accuracy of reasoning by an average of 3.9% on 7 representative reasoning benchmarks. It also demonstrates superior performance compared to significantly larger verifiers like GPT-4 and similarly sized models fine-tuned on matching training sets.

* Our code, data, and model can be found at this repository: https://github.com/JHU-CLSP/Rationalyst

Via

Access Paper or Ask Questions

Benchmarking Language Model Creativity: A Case Study on Code Generation

Jul 12, 2024

Yining Lu, Dixuan Wang, Tianjian Li, Dongwei Jiang, Daniel Khashabi

Figure 1 for Benchmarking Language Model Creativity: A Case Study on Code Generation

Figure 2 for Benchmarking Language Model Creativity: A Case Study on Code Generation

Figure 3 for Benchmarking Language Model Creativity: A Case Study on Code Generation

Figure 4 for Benchmarking Language Model Creativity: A Case Study on Code Generation

Abstract:As LLMs become increasingly prevalent, it is interesting to consider how ``creative'' these models can be. From cognitive science, creativity consists of at least two key characteristics: \emph{convergent} thinking (purposefulness to achieve a given goal) and \emph{divergent} thinking (adaptability to new environments or constraints) \citep{runco2003critical}. In this work, we introduce a framework for quantifying LLM creativity that incorporates the two characteristics. This is achieved by (1) Denial Prompting pushes LLMs to come up with more creative solutions to a given problem by incrementally imposing new constraints on the previous solution, compelling LLMs to adopt new strategies, and (2) defining and computing the NeoGauge metric which examines both convergent and divergent thinking in the generated creative responses by LLMs. We apply the proposed framework on Codeforces problems, a natural data source for collecting human coding solutions. We quantify NeoGauge for various proprietary and open-source models and find that even the most creative model, GPT-4, still falls short of demonstrating human-like creativity. We also experiment with advanced reasoning strategies (MCTS, self-correction, etc.) and observe no significant improvement in creativity. As a by-product of our analysis, we release NeoCoder dataset for reproducing our results on future models.

Via

Access Paper or Ask Questions

WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment

Jul 10, 2024

Jiefu Ou, Arda Uzunoglu, Benjamin Van Durme, Daniel Khashabi

Abstract:AI systems make decisions in physical environments through primitive actions or affordances that are accessed via API calls. While deploying AI agents in the real world involves numerous high-level actions, existing embodied simulators offer a limited set of domain-salient APIs. This naturally brings up the questions: how many primitive actions (APIs) are needed for a versatile embodied agent, and what should they look like? We explore this via a thought experiment: assuming that wikiHow tutorials cover a wide variety of human-written tasks, what is the space of APIs needed to cover these instructions? We propose a framework to iteratively induce new APIs by grounding wikiHow instruction to situated agent policies. Inspired by recent successes in large language models (LLMs) for embodied planning, we propose a few-shot prompting to steer GPT-4 to generate Pythonic programs as agent policies and bootstrap a universe of APIs by 1) reusing a seed set of APIs; and then 2) fabricate new API calls when necessary. The focus of this thought experiment is on defining these APIs rather than their executability. We apply the proposed pipeline on instructions from wikiHow tutorials. On a small fraction (0.5%) of tutorials, we induce an action space of 300+ APIs necessary for capturing the rich variety of tasks in the physical world. A detailed automatic and human analysis of the induction output reveals that the proposed pipeline enables effective reuse and creation of APIs. Moreover, a manual review revealed that existing simulators support only a small subset of the induced APIs (9 of the top 50 frequent APIs), motivating the development of action-rich embodied environments.

* ACL 2024 NLRSE, 8 pages

Via

Access Paper or Ask Questions

Core: Robust Factual Precision Scoring with Informative Sub-Claim Identification

Jul 04, 2024

Zhengping Jiang, Jingyu Zhang, Nathaniel Weir, Seth Ebner, Miriam Wanner, Kate Sanders, Daniel Khashabi, Anqi Liu, Benjamin Van Durme

Figure 1 for Core: Robust Factual Precision Scoring with Informative Sub-Claim Identification

Figure 2 for Core: Robust Factual Precision Scoring with Informative Sub-Claim Identification

Figure 3 for Core: Robust Factual Precision Scoring with Informative Sub-Claim Identification

Figure 4 for Core: Robust Factual Precision Scoring with Informative Sub-Claim Identification

Abstract:Hallucinations -- the generation of untrue claims -- pose a challenge to the application of large language models (LLMs) [1] thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as FActScore [2], can be manipulated by adding obvious or repetitive claims to artificially inflate scores. We expand the FActScore dataset to design and analyze factual precision metrics, demonstrating that models can be trained to achieve high scores under existing metrics through exploiting the issues we identify. This motivates our new customizable plug-and-play subclaim selection component called Core, which filters down individual subclaims according to their uniqueness and informativeness. Metrics augmented by Core are substantially more robust as shown in head-to-head comparisons. We release an evaluation framework supporting the modular use of Core (https://github.com/zipJiang/Core) and various decomposition strategies, and we suggest its adoption by the LLM community. [1] Hong et al., "The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models", arXiv:2404.05904v2 [cs.CL]. [2] Min et al., "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation", arXiv:2305.14251v2 [cs.CL].

Via

Access Paper or Ask Questions

LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression

Jun 28, 2024

Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille

Figure 1 for LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression

Figure 2 for LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression

Figure 3 for LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression

Figure 4 for LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression

Abstract:While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in large multi-modal models (LMMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens during training to enhance training efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a lite training scheme. LLaVolta incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly, and finally no compression at the end of training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs. Code is available at https://github.com/Beckschen/LLaVolta

* Code is available at https://github.com/Beckschen/LLaVolta

Via

Access Paper or Ask Questions

Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell

Jun 20, 2024

Taiming Lu, Muhan Gao, Kuai Yu, Adam Byerly, Daniel Khashabi

Abstract:Large Language Models (LLMs) exhibit positional bias, struggling to utilize information from the middle or end of long contexts. Our study explores LLMs' long-context reasoning by probing their hidden representations. We find that while LLMs encode the position of target information, they often fail to leverage this in generating accurate responses. This reveals a disconnect between information retrieval and utilization, a "know but don't tell" phenomenon. We further analyze the relationship between extraction time and final accuracy, offering insights into the underlying mechanics of transformer models.

Via

Access Paper or Ask Questions

DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

May 22, 2024

Weiting Tan, Jingyu Zhang, Lingfeng Shen, Daniel Khashabi, Philipp Koehn

Figure 1 for DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

Figure 2 for DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

Figure 3 for DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

Figure 4 for DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

Abstract:Non-autoregressive Transformers (NATs) are recently applied in direct speech-to-speech translation systems, which convert speech across different languages without intermediate text data. Although NATs generate high-quality outputs and offer faster inference than autoregressive models, they tend to produce incoherent and repetitive results due to complex data distribution (e.g., acoustic and linguistic variations in speech). In this work, we introduce DiffNorm, a diffusion-based normalization strategy that simplifies data distributions for training NAT models. After training with a self-supervised noise estimation objective, DiffNorm constructs normalized target data by denoising synthetically corrupted speech features. Additionally, we propose to regularize NATs with classifier-free guidance, improving model robustness and translation quality by randomly dropping out source information during training. Our strategies result in a notable improvement of about +7 ASR-BLEU for English-Spanish (En-Es) and +2 ASR-BLEU for English-French (En-Fr) translations on the CVSS benchmark, while attaining over 14x speedup for En-Es and 5x speedup for En-Fr translations compared to autoregressive baselines.

Via

Access Paper or Ask Questions