Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Prakhar Gupta

Consistency Training Along the Transformer Stack

Jun 04, 2026

Sukrati Gautam, Neil Shah, Arav Dhoot, Bryan Maruyama, Caroline Wei, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa

Abstract:Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal consistency targets: MLP Consistency Training (MLPCT), which matches post-activation MLP states, and Attention Consistency Training (AttCT), which matches per-head attention distributions. Second, we apply consistency training to four additional safety threats: persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. Across several models and threat settings, we find that consistency training reduces misalignment well beyond the sycophancy and jailbreak settings studied in prior work. We also find cases of cross-threat generalization, where training against one failure mode improves robustness to another, and identify a shared residual-stream mechanism underlying ACT, MLPCT, and AttCT, while distinguishing BCT as mechanistically distinct. Our results suggest that consistency training is a flexible and extensible framework for alignment, capable of unifying defenses against a broader class of model pathologies.

* Submitted to EMNLP 2026

Via

Access Paper or Ask Questions

Consistency Training while Mitigating Obfuscation via Rate Matching

Jun 01, 2026

Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri Africa

Abstract:Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency training reduces this influence by training models to behave similarly across inputs with and without the extraneous feature. However, existing methods train for consistency over entire responses or internal activations, which also constrains whether the model verbalises said extraneous features. We show this leads to obfuscation, where the model learns not to mention a cue while remaining influenced by it, which may undermine monitorability. To address this, we introduce Rate Matching Consistency Training (RMCT), which trains for consistency over selected behavioural properties without constraining how this behaviour is expressed. RMCT matches the rate at which the model exhibits a target behaviour (e.g., following a bias cue) across input perturbations, rather than requiring paired inputs with and without the extraneous feature, extending consistency training to settings where the extraneous features cannot be removed. We evaluate RMCT on sycophancy reduction in two open-weight language models, achieving reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types, while largely preserving the model's tendency to verbalise the bias cue. Further, we find that RMCT is more data-efficient at the expense of being less compute-efficient in our experiments. Overall, RMCT shows that consistency training can improve behavioural robustness without directly trading off against monitorability.

Via

Access Paper or Ask Questions

Sparse Memory Finetuning as a Low-Forgetting Alternative to LoRA and Full Finetuning

May 04, 2026

Prakhar Gupta, Garv Shah, Satyam Goyal, Anirudh Kanchi

Abstract:Adapting a pretrained language model to a new task often hurts the general capabilities it already had, a problem known as catastrophic forgetting. Sparse Memory Finetuning (SMF) tries to avoid this by adding key-value memory layers to the model and, on each training step, updating only the small set of memory rows that the current batch reads most heavily. We re-implement SMF on Qwen-2.5-0.5B-Instruct and compare it with LoRA and full finetuning on MedMCQA, a 4-choice medical exam task, using WikiText perplexity and TriviaQA accuracy as forgetting probes. SMF improves MedMCQA by 2.5 percentage points while keeping both forgetting probes within roughly 1 point of the base model, whereas LoRA and full finetuning achieve larger gains but with clear drift on both. We also compare two row-selection rules (KL-divergence and TF-IDF), which balance the two forgetting metrics differently.

Via

Access Paper or Ask Questions

Self-Mined Hardness for Safety Fine-Tuning

May 04, 2026

Prakhar Gupta, Garv Shah, Donghua Zhang

Abstract:Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts the remaining ASR by 35-50% (about 3 percentage points) on both models.

Via

Access Paper or Ask Questions

Improving Sparse Memory Finetuning

Apr 06, 2026

Satyam Goyal, Anirudh Kanchi, Garv Shah, Prakhar Gupta

Abstract:Large Language Models (LLMs) are typically static after training, yet real-world applications require continual adaptation to new knowledge without degrading existing capabilities. Standard approaches to updating models, like full finetuning or parameter-efficient methods (e.g., LoRA), face a fundamental trade-off: catastrophic forgetting. They modify shared dense representations, causing interference across tasks. Sparse Memory Finetuning (SMF) offers a promising alternative by localizing updates to a small subset of parameters in explicit memory layers. In this work, we present an open-source pipeline to retrofit existing pretrained models (Qwen-2.5-0.5B) with sparse memory modules, enabling effective continual learning on consumer hardware. We extend prior work by introducing a theoretically grounded slot-selection mechanism based on Kullback-Leibler (KL) divergence, which prioritizes memory updates for informationally "surprising" tokens relative to a background distribution. Our experiments demonstrate that our retrofitted models can acquire new factual knowledge with minimal forgetting of held-out capabilities, validating the sparse update hypothesis in a practical setting.

Via

Access Paper or Ask Questions

Revisiting In-Context Learning with Long Context Language Models

Dec 22, 2024

Jinheon Baek, Sun Jae Lee, Prakhar Gupta, Geunseob, Oh, Siddharth Dalmia, Prateek Kolhar

Figure 1 for Revisiting In-Context Learning with Long Context Language Models

Figure 2 for Revisiting In-Context Learning with Long Context Language Models

Figure 3 for Revisiting In-Context Learning with Long Context Language Models

Figure 4 for Revisiting In-Context Learning with Long Context Language Models

Abstract:In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making example selection techniques crucial for identifying the maximally effective set of examples. However, the recent advent of Long Context Language Models (LCLMs) has significantly increased the number of examples that can be included in context, raising an important question of whether ICL performance in a many-shot regime is still sensitive to the method of sample selection. To answer this, we revisit these approaches in the context of LCLMs through extensive experiments on 18 datasets spanning 4 tasks. Surprisingly, we observe that sophisticated example selection techniques do not yield significant improvements over a simple random sample selection method. Instead, we find that the advent of LCLMs has fundamentally shifted the challenge of ICL from that of selecting the most effective examples to that of collecting sufficient examples to fill the context window. Specifically, in certain datasets, including all available examples does not fully utilize the context window; however, by augmenting the examples in context with a simple data augmentation approach, we substantially improve ICL performance by 5%.

Via

Access Paper or Ask Questions

Reinforcement Learning Compensated Model Predictive Control for Off-road Driving on Unknown Deformable Terrain

Aug 17, 2024

Prakhar Gupta, Jonathon M. Smereka, Yunyi Jia

Figure 1 for Reinforcement Learning Compensated Model Predictive Control for Off-road Driving on Unknown Deformable Terrain

Figure 2 for Reinforcement Learning Compensated Model Predictive Control for Off-road Driving on Unknown Deformable Terrain

Figure 3 for Reinforcement Learning Compensated Model Predictive Control for Off-road Driving on Unknown Deformable Terrain

Figure 4 for Reinforcement Learning Compensated Model Predictive Control for Off-road Driving on Unknown Deformable Terrain

Abstract:This study presents an Actor-Critic reinforcement learning Compensated Model Predictive Controller (AC2MPC) designed for high-speed, off-road autonomous driving on deformable terrains. Addressing the difficulty of modeling unknown tire-terrain interaction and ensuring real-time control feasibility and performance, this framework integrates deep reinforcement learning with a model predictive controller to manage unmodeled nonlinear dynamics. We evaluate the controller framework over constant and varying velocity profiles using high-fidelity simulator Project Chrono. Our findings demonstrate that our controller statistically outperforms standalone model-based and learning-based controllers over three unknown terrains that represent sandy deformable track, sandy and rocky track and cohesive clay-like deformable soil track. Despite varied and previously unseen terrain characteristics, this framework generalized well enough to track longitudinal reference speeds with the least error. Furthermore, this framework required significantly less training data compared to purely learning based controller, converging in fewer steps while delivering better performance. Even when under-trained, this controller outperformed the standalone controllers, highlighting its potential for safer and more efficient real-world deployment.

* Submitted to IEEE Transactions on Intelligent Vehicles as a Regular Paper

Via

Access Paper or Ask Questions

Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations

Jun 27, 2024

Ritam Dutt, Zhen Wu, Kelly Shi, Divyanshu Sheth, Prakhar Gupta, Carolyn Penstein Rose

Figure 1 for Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations

Figure 2 for Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations

Figure 3 for Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations

Figure 4 for Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations

Abstract:We present a generalizable classification approach that leverages Large Language Models (LLMs) to facilitate the detection of implicitly encoded social meaning in conversations. We design a multi-faceted prompt to extract a textual explanation of the reasoning that connects visible cues to underlying social meanings. These extracted explanations or rationales serve as augmentations to the conversational text to facilitate dialogue understanding and transfer. Our empirical results over 2,340 experimental settings demonstrate the significant positive impact of adding these rationales. Our findings hold true for in-domain classification, zero-shot, and few-shot domain transfer for two different social meaning detection tasks, each spanning two different corpora.

* To appear at The Proceedings of the Association for Computational Linguistics, 2024

Via

Access Paper or Ask Questions

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Feb 19, 2024

Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C. Wallace, Zachary C. Lipton, Jeffrey P. Bigham

Figure 1 for GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Figure 2 for GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Figure 3 for GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Figure 4 for GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Abstract:LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision. We will release our tool (GenAudit) and fact-checking model for public use.

Via

Access Paper or Ask Questions

USB: A Unified Summarization Benchmark Across Tasks and Domains

May 23, 2023

Kundan Krishna, Prakhar Gupta, Sanjana Ramprasad, Byron C. Wallace, Jeffrey P. Bigham, Zachary C. Lipton

Figure 1 for USB: A Unified Summarization Benchmark Across Tasks and Domains

Figure 2 for USB: A Unified Summarization Benchmark Across Tasks and Domains

Figure 3 for USB: A Unified Summarization Benchmark Across Tasks and Domains

Figure 4 for USB: A Unified Summarization Benchmark Across Tasks and Domains

Abstract:An abundance of datasets exist for training and evaluating models on the task of summary generation.However, these datasets are often derived heuristically, and lack sufficient annotations to support research into all aspects of summarization, such as evidence extraction and controllable summarization. We introduce a benchmark comprising 8 tasks that require multi-dimensional understanding of summarization, e.g., surfacing evidence for a summary, assessing its correctness, and gauging its relevance to different topics. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models. For factuality related tasks, we also evaluate existing heuristics to create training data and find that training on them performs worse than training on $20\times$ less human-labeled data. Our benchmark consists of data from 6 different domains, allowing us to study cross-domain performance of trained models. We find that for some tasks, the amount of training data matters more than the domain where it comes from, while for other tasks training specifically on data from the target domain, even if limited, is more beneficial. Our work fulfills the need for a well-annotated summarization benchmark with diverse tasks, and provides useful insights about the impact of the quality, size and domain of training data.

Via

Access Paper or Ask Questions