Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Brian King

Can Editing 1 Neuron Fix Repetition Loops in LLMs?

Jun 09, 2026

Aristotelis Lazaridis, Aman Sharma, Dylan Bates, Brian King, Vincent Lu, Jack FitzGerald

Abstract:Yes. Can it cure doom loops? Probably not. The Gemma 4 instruction-tuned models share a reproducible failure: on long factual enumeration prompts, such as listing every episode of a TV series, the 88 IAU constellations, or the 151 original Pokemon, they collapse into repetition, either a tight verbatim loop or a list whose entries decay onto a single answer. These loops occur at rates as high as 95% and survive prompt rewording, inference-engine changes, and most sampling adjustments. In this paper we explore whether this behavior is localized enough to remove by weight edits. To localize the cause, we use per-layer ablation and per-neuron attribution, then confirm the strongest candidates with full-generation sweeps. The loops trace to a small set of MLP neurons (or, in the 26B-A4B Mixture-of-Experts model, a few routed experts) which we suppress with static weight edits. These "surgeries" can be as small as a single sign-inverted neuron (in the E2B model). The size of the effective edits grows with model scale, but in all cases, the loop patterns can be addressed at normal generation budgets while preserving general-purpose benchmark scores. However, the edits do not solve everything: we also study longer thinking budgets, where the two larger models most visibly enter doom looping, i.e. a non-convergent regime in which the model self-corrects in circles over a fact it cannot recall, exhausting the budget without committing to a final answer. We show this residual failure is reduced but not eliminated by the same edits, and argue it is fundamentally a knowledge-precision problem rather than a removable circuit; weight surgery can delete a loop, but it cannot supply a missing fact. Our results are both a feasibility demonstration, that is, evidence that a concrete generation pathology can be localized to a few parameters and edited out, and a delineation of where that approach stops.

Via

Access Paper or Ask Questions

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

May 22, 2026

Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald

Abstract:On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an efficient use-case of OPD, which is appealing as it requires only a single model as a student and teacher, and it also has the benefit of providing privileged context that is a absent at inference time (e.g. a persona, a private fact, or a worked solution) to the teacher during the training process. The challenge in this approach is that the privileged information can change model behavior more than intended: it can modify reasoning, degrade general capabilities, and affect performance indicators like response length, style, or local token preferences. Consequently, OPSD may train the student on side effects rather than a desired, transferable behavior. In this paper, we study this problem in a rare-token/identity setting and propose EviDence GuidEd On-Policy Distillation (EDGE-OPD), a modification of OPSD with two distinct characteristics: a) it uses guided rollouts to inject privileged-context behavior to the student at sampling time, so that the rare target behavior is actually present in the on-policy data, and b) it applies an evidence mask: the student is updated only at token positions where the privileged context supports the sampled token, rather than on every token in the rollout. We empirically show that OPSD (and its variant RLSD, with and without a verifier) completely fail to learn a target identity, while the integration of guided rollouts allows them to succeed. Additionally, mask-region ablations show that the persona signal is localized to the positive-evidence tail, allows us to draw valuable insights about efficient knowledge transfer and preservation of general purpose capabilities.

Via

Access Paper or Ask Questions

Lookahead When It Matters: Adaptive Non-causal Transformers for Streaming Neural Transducers

May 09, 2023

Grant P. Strimel, Yi Xie, Brian King, Martin Radfar, Ariya Rastrow, Athanasios Mouchtaris

Figure 1 for Lookahead When It Matters: Adaptive Non-causal Transformers for Streaming Neural Transducers

Figure 2 for Lookahead When It Matters: Adaptive Non-causal Transformers for Streaming Neural Transducers

Figure 3 for Lookahead When It Matters: Adaptive Non-causal Transformers for Streaming Neural Transducers

Figure 4 for Lookahead When It Matters: Adaptive Non-causal Transformers for Streaming Neural Transducers

Abstract:Streaming speech recognition architectures are employed for low-latency, real-time applications. Such architectures are often characterized by their causality. Causal architectures emit tokens at each frame, relying only on current and past signal, while non-causal models are exposed to a window of future frames at each step to increase predictive accuracy. This dichotomy amounts to a trade-off for real-time Automatic Speech Recognition (ASR) system design: profit from the low-latency benefit of strictly-causal architectures while accepting predictive performance limitations, or realize the modeling benefits of future-context models accompanied by their higher latency penalty. In this work, we relax the constraints of this choice and present the Adaptive Non-Causal Attention Transducer (ANCAT). Our architecture is non-causal in the traditional sense, but executes in a low-latency, streaming manner by dynamically choosing when to rely on future context and to what degree within the audio stream. The resulting mechanism, when coupled with our novel regularization algorithms, delivers comparable accuracy to non-causal configurations while improving significantly upon latency, closing the gap with their causal counterparts. We showcase our design experimentally by reporting comparative ASR task results with measures of accuracy and latency on both publicly accessible and production-scale, voice-assistant datasets.

* Accepted to ICML 2023

Via

Access Paper or Ask Questions

Leveraging Redundancy in Multiple Audio Signals for Far-Field Speech Recognition

Mar 01, 2023

Feng-Ju Chang, Anastasios Alexandridis, Rupak Vignesh Swaminathan, Martin Radfar, Harish Mallidi, Maurizio Omologo, Athanasios Mouchtaris, Brian King, Roland Maas

Figure 1 for Leveraging Redundancy in Multiple Audio Signals for Far-Field Speech Recognition

Figure 2 for Leveraging Redundancy in Multiple Audio Signals for Far-Field Speech Recognition

Figure 3 for Leveraging Redundancy in Multiple Audio Signals for Far-Field Speech Recognition

Figure 4 for Leveraging Redundancy in Multiple Audio Signals for Far-Field Speech Recognition

Abstract:To achieve robust far-field automatic speech recognition (ASR), existing techniques typically employ an acoustic front end (AFE) cascaded with a neural transducer (NT) ASR model. The AFE output, however, could be unreliable, as the beamforming output in AFE is steered to a wrong direction. A promising way to address this issue is to exploit the microphone signals before the beamforming stage and after the acoustic echo cancellation (post-AEC) in AFE. We argue that both, post-AEC and AFE outputs, are complementary and it is possible to leverage the redundancy between these signals to compensate for potential AFE processing errors. We present two fusion networks to explore this redundancy and aggregate these multi-channel (MC) signals: (1) Frequency-LSTM based, and (2) Convolutional Neural Network based fusion networks. We augment the MC fusion networks to a conformer transducer model and train it in an end-to-end fashion. Our experimental results on commercial virtual assistant tasks demonstrate that using the AFE output and two post-AEC signals with fusion networks offers up to 25.9% word error rate (WER) relative improvement over the model using the AFE output only, at the cost of <= 2% parameter increase.

Via

Access Paper or Ask Questions

Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Jul 22, 2022

Pranav Dheram, Murugesan Ramakrishnan, Anirudh Raju, I-Fan Chen, Brian King, Katherine Powell, Melissa Saboowala, Karan Shetty, Andreas Stolcke

Figure 1 for Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Figure 2 for Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Figure 3 for Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Figure 4 for Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Abstract:As for other forms of AI, speech recognition has recently been examined with respect to performance disparities across different user cohorts. One approach to achieve fairness in speech recognition is to (1) identify speaker cohorts that suffer from subpar performance and (2) apply fairness mitigation measures targeting the cohorts discovered. In this paper, we report on initial findings with both discovery and mitigation of performance disparities using data from a product-scale AI assistant speech recognition system. We compare cohort discovery based on geographic and demographic information to a more scalable method that groups speakers without human labels, using speaker embedding technology. For fairness mitigation, we find that oversampling of underrepresented cohorts, as well as modeling speaker cohort membership by additional input variables, reduces the gap between top- and bottom-performing cohorts, without deteriorating overall recognition accuracy.

* Proc. Interspeech 2022

Via

Access Paper or Ask Questions

Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation

Jul 16, 2022

Viet Anh Trinh, Pegah Ghahremani, Brian King, Jasha Droppo, Andreas Stolcke, Roland Maas

Figure 1 for Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation

Figure 2 for Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation

Abstract:We present an approach to reduce the performance disparity between geographic regions without degrading performance on the overall user population for ASR. A popular approach is to fine-tune the model with data from regions where the ASR model has a higher word error rate (WER). However, when the ASR model is adapted to get better performance on these high-WER regions, its parameters wander from the previous optimal values, which can lead to worse performance in other regions. In our proposed method, we utilize the elastic weight consolidation (EWC) regularization loss to identify directions in parameters space along which the ASR weights can vary to improve for high-error regions, while still maintaining performance on the speaker population overall. Our results demonstrate that EWC can reduce the word error rate (WER) in the region with highest WER by 3.2% relative while reducing the overall WER by 1.3% relative. We also evaluate the role of language and acoustic models in ASR fairness and propose a clustering algorithm to identify WER disparities based on geographic region.

* Accepted for publication at Interspeech 2022

Via

Access Paper or Ask Questions

Compute Cost Amortized Transformer for Streaming ASR

Jul 05, 2022

Yi Xie, Jonathan Macoskey, Martin Radfar, Feng-Ju Chang, Brian King, Ariya Rastrow, Athanasios Mouchtaris, Grant P. Strimel

Figure 1 for Compute Cost Amortized Transformer for Streaming ASR

Figure 2 for Compute Cost Amortized Transformer for Streaming ASR

Figure 3 for Compute Cost Amortized Transformer for Streaming ASR

Figure 4 for Compute Cost Amortized Transformer for Streaming ASR

Abstract:We present a streaming, Transformer-based end-to-end automatic speech recognition (ASR) architecture which achieves efficient neural inference through compute cost amortization. Our architecture creates sparse computation pathways dynamically at inference time, resulting in selective use of compute resources throughout decoding, enabling significant reductions in compute with minimal impact on accuracy. The fully differentiable architecture is trained end-to-end with an accompanying lightweight arbitrator mechanism operating at the frame-level to make dynamic decisions on each input while a tunable loss function is used to regularize the overall level of compute against predictive performance. We report empirical results from experiments using the compute amortized Transformer-Transducer (T-T) model conducted on LibriSpeech data. Our best model can achieve a 60% compute cost reduction with only a 3% relative word error rate (WER) increase.

Via

Access Paper or Ask Questions

Investigation of Training Label Error Impact on RNN-T

Dec 01, 2021

I-Fan Chen, Brian King, Jasha Droppo

Figure 1 for Investigation of Training Label Error Impact on RNN-T

Figure 2 for Investigation of Training Label Error Impact on RNN-T

Figure 3 for Investigation of Training Label Error Impact on RNN-T

Figure 4 for Investigation of Training Label Error Impact on RNN-T

Abstract:In this paper, we propose an approach to quantitatively analyze impacts of different training label errors to RNN-T based ASR models. The result shows deletion errors are more harmful than substitution and insertion label errors in RNN-T training data. We also examined label error impact mitigation approaches on RNN-T and found that, though all the methods mitigate the label-error-caused degradation to some extent, they could not remove the performance gap between the models trained with and without the presence of label errors. Based on the analysis results, we suggest to design data pipelines for RNN-T with higher priority on reducing deletion label errors. We also find that ensuring high-quality training labels remains important, despite of the existence of the label error mitigation approaches.

* 6 pages

Via

Access Paper or Ask Questions

Warped Dynamic Linear Models for Time Series of Counts

Oct 27, 2021

Brian King, Daniel R. Kowal

Figure 1 for Warped Dynamic Linear Models for Time Series of Counts

Figure 2 for Warped Dynamic Linear Models for Time Series of Counts

Figure 3 for Warped Dynamic Linear Models for Time Series of Counts

Figure 4 for Warped Dynamic Linear Models for Time Series of Counts

Abstract:Dynamic Linear Models (DLMs) are commonly employed for time series analysis due to their versatile structure, simple recursive updating, and probabilistic forecasting. However, the options for count time series are limited: Gaussian DLMs require continuous data, while Poisson-based alternatives often lack sufficient modeling flexibility. We introduce a novel methodology for count time series by warping a Gaussian DLM. The warping function has two components: a transformation operator that provides distributional flexibility and a rounding operator that ensures the correct support for the discrete data-generating process. Importantly, we develop conjugate inference for the warped DLM, which enables analytic and recursive updates for the state space filtering and smoothing distributions. We leverage these results to produce customized and efficient computing strategies for inference and forecasting, including Monte Carlo simulation for offline analysis and an optimal particle filter for online inference. This framework unifies and extends a variety of discrete time series models and is valid for natural counts, rounded values, and multivariate observations. Simulation studies illustrate the excellent forecasting capabilities of the warped DLM. The proposed approach is applied to a multivariate time series of daily overdose counts and demonstrates both modeling and computational successes.

Via

Access Paper or Ask Questions

Do You Listen with One or Two Microphones? A Unified ASR Model for Single and Multi-Channel Audio

Jun 28, 2021

Gokce Keskin, Minhua Wu, Brian King, Harish Mallidi, Yang Gao, Jasha Droppo, Ariya Rastrow, Roland Maas

Figure 1 for Do You Listen with One or Two Microphones? A Unified ASR Model for Single and Multi-Channel Audio

Figure 2 for Do You Listen with One or Two Microphones? A Unified ASR Model for Single and Multi-Channel Audio

Figure 3 for Do You Listen with One or Two Microphones? A Unified ASR Model for Single and Multi-Channel Audio

Figure 4 for Do You Listen with One or Two Microphones? A Unified ASR Model for Single and Multi-Channel Audio

Abstract:Automatic speech recognition (ASR) models are typically designed to operate on a single input data type, e.g. a single or multi-channel audio streamed from a device. This design decision assumes the primary input data source does not change and if an additional (auxiliary) data source is occasionally available, it cannot be used. An ASR model that operates on both primary and auxiliary data can achieve better accuracy compared to a primary-only solution; and a model that can serve both primary-only (PO) and primary-plus-auxiliary (PPA) modes is highly desirable. In this work, we propose a unified ASR model that can serve both modes. We demonstrate its efficacy in a realistic scenario where a set of devices typically stream a single primary audio channel, and two additional auxiliary channels only when upload bandwidth allows it. The architecture enables a unique methodology that uses both types of input audio during training time. Our proposed approach achieves up to 12.5% relative word-error-rate reduction (WERR) compared to a PO baseline, and up to 16.0% relative WERR in low-SNR conditions. The unique training methodology achieves up to 2.5% relative WERR compared to a PPA baseline.

Via

Access Paper or Ask Questions