Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Will Schwarzer

Training ML Models with Predictable Failures

May 14, 2026

Will Schwarzer, Scott Niekum

Abstract:Estimating how often an ML model will fail at deployment scale is central to pre-deployment safety assessment, but a feasible evaluation set is rarely large enough to observe the failures that matter. Jones et al. (2025) address this by extrapolating from the largest k failure scores in an evaluation set to predict deployment-scale failure rates. We give a finite-k decomposition of this estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode. In two proof-of-concept experiments, a language-model password game and an RL gridworld, fine-tuning substantially reduces held-out forecast error while preserving primary-task capability and achieving safety similar to that of supervised baselines.

* 32 pages, 9 figures

Via

Access Paper or Ask Questions

Are Deep Speech Denoising Models Robust to Adversarial Noise?

Mar 14, 2025

Will Schwarzer, Philip S. Thomas, Andrea Fanelli, Xiaoyu Liu

Figure 1 for Are Deep Speech Denoising Models Robust to Adversarial Noise?

Figure 2 for Are Deep Speech Denoising Models Robust to Adversarial Noise?

Figure 3 for Are Deep Speech Denoising Models Robust to Adversarial Noise?

Figure 4 for Are Deep Speech Denoising Models Robust to Adversarial Noise?

Abstract:Deep noise suppression (DNS) models enjoy widespread use throughout a variety of high-stakes speech applications. However, in this paper, we show that four recent DNS models can each be reduced to outputting unintelligible gibberish through the addition of imperceptible adversarial noise. Furthermore, our results show the near-term plausibility of targeted attacks, which could induce models to output arbitrary utterances, and over-the-air attacks. While the success of these attacks varies by model and setting, and attacks appear to be strongest when model-specific (i.e., white-box and non-transferable), our results highlight a pressing need for practical countermeasures in DNS systems.

* 13 pages, 5 figures

Via

Access Paper or Ask Questions

Supervised Reward Inference

Feb 25, 2025

Will Schwarzer, Jordan Schneider, Philip S. Thomas, Scott Niekum

Figure 1 for Supervised Reward Inference

Figure 2 for Supervised Reward Inference

Figure 3 for Supervised Reward Inference

Figure 4 for Supervised Reward Inference

Abstract:Existing approaches to reward inference from behavior typically assume that humans provide demonstrations according to specific models of behavior. However, humans often indicate their goals through a wide range of behaviors, from actions that are suboptimal due to poor planning or execution to behaviors which are intended to communicate goals rather than achieve them. We propose that supervised learning offers a unified framework to infer reward functions from any class of behavior, and show that such an approach is asymptotically Bayes-optimal under mild assumptions. Experiments on simulated robotic manipulation tasks show that our method can efficiently infer rewards from a wide variety of arbitrarily suboptimal demonstrations.

* 16 pages, 4 figures

Via

Access Paper or Ask Questions