Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

John Heyer

Assessing Robustness to Spurious Correlations in Post-Training Language Models

May 09, 2025

Julia Shuieh, Prasann Singhal, Apaar Shanker, John Heyer, George Pu, Samuel Denton

Figure 1 for Assessing Robustness to Spurious Correlations in Post-Training Language Models

Figure 2 for Assessing Robustness to Spurious Correlations in Post-Training Language Models

Figure 3 for Assessing Robustness to Spurious Correlations in Post-Training Language Models

Figure 4 for Assessing Robustness to Spurious Correlations in Post-Training Language Models

Abstract:Supervised and preference-based fine-tuning techniques have become popular for aligning large language models (LLMs) with user intent and correctness criteria. However, real-world training data often exhibits spurious correlations -- arising from biases, dataset artifacts, or other "shortcut" features -- that can compromise a model's performance or generalization. In this paper, we systematically evaluate three post-training algorithms -- Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and KTO (Kahneman-Tversky Optimization) -- across a diverse set of synthetic tasks and spuriousness conditions. Our tasks span mathematical reasoning, constrained instruction-following, and document-grounded question answering. We vary the degree of spurious correlation (10% vs. 90%) and investigate two forms of artifacts: "Feature Ambiguity" and "Distributional Narrowness." Our results show that the models often but not always degrade under higher spuriousness. The preference-based methods (DPO/KTO) can demonstrate relative robustness in mathematical reasoning tasks. By contrast, SFT maintains stronger performance in complex, context-intensive tasks. These findings highlight that no single post-training strategy universally outperforms in all scenarios; the best choice depends on the type of target task and the nature of spurious correlations.

* ICLR '25 Workshop on Spurious Correlation and Shortcut Learning

Via

Access Paper or Ask Questions

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Sep 29, 2024

Yung-Chieh Chan, George Pu, Apaar Shanker, Parth Suresh, Penn Jenks, John Heyer, Sam Denton

Figure 1 for Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Figure 2 for Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Figure 3 for Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Figure 4 for Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Abstract:As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.

Via

Access Paper or Ask Questions

SoftQE: Learned Representations of Queries Expanded by LLMs

Feb 20, 2024

Varad Pimpalkhute, John Heyer, Xusen Yin, Sameer Gupta

Figure 1 for SoftQE: Learned Representations of Queries Expanded by LLMs

Figure 2 for SoftQE: Learned Representations of Queries Expanded by LLMs

Figure 3 for SoftQE: Learned Representations of Queries Expanded by LLMs

Figure 4 for SoftQE: Learned Representations of Queries Expanded by LLMs

Abstract:We investigate the integration of Large Language Models (LLMs) into query encoders to improve dense retrieval without increasing latency and cost, by circumventing the dependency on LLMs at inference time. SoftQE incorporates knowledge from LLMs by mapping embeddings of input queries to those of the LLM-expanded queries. While improvements over various strong baselines on in-domain MS-MARCO metrics are marginal, SoftQE improves performance by 2.83 absolute percentage points on average on five out-of-domain BEIR tasks.

* To be published in ECIR 2024 proceedings

Via

Access Paper or Ask Questions