Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aaron Broukhim

Same Words, Different Judgments: Modality Effects on Preference Alignment

Feb 26, 2026

Aaron Broukhim, Nadir Weibel, Eshin Jolly

Abstract:Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored. We present a controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. Audio preferences prove as reliable as text, with inter-rater agreement reaching good levels (ICC(2,k) $\approx$ .80) at $\sim$9 raters -- the first ICC-based reliability characterization in the preference annotation literature for either modality. However, modality reshapes how people judge: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. Synthetic ratings further align with human judgments and predict inter-rater agreement, supporting their use both for triaging ambiguous pairs and as full replacements for human annotations.

* Submitted to Interspeech 2026 for review

Via

Access Paper or Ask Questions

Preference-Based Learning in Audio Applications: A Systematic Analysis

Nov 17, 2025

Aaron Broukhim, Yiran Shen, Prithviraj Ammanabrolu, Nadir Weibel

Figure 1 for Preference-Based Learning in Audio Applications: A Systematic Analysis

Figure 2 for Preference-Based Learning in Audio Applications: A Systematic Analysis

Figure 3 for Preference-Based Learning in Audio Applications: A Systematic Analysis

Figure 4 for Preference-Based Learning in Audio Applications: A Systematic Analysis

Abstract:Despite the parallel challenges that audio and text domains face in evaluating generative model outputs, preference learning remains remarkably underexplored in audio applications. Through a PRISMA-guided systematic review of approximately 500 papers, we find that only 30 (6%) apply preference learning to audio tasks. Our analysis reveals a field in transition: pre-2021 works focused on emotion recognition using traditional ranking methods (rankSVM), while post-2021 studies have pivoted toward generation tasks employing modern RLHF frameworks. We identify three critical patterns: (1) the emergence of multi-dimensional evaluation strategies combining synthetic, automated, and human preferences; (2) inconsistent alignment between traditional metrics (WER, PESQ) and human judgments across different contexts; and (3) convergence on multi-stage training pipelines that combine reward signals. Our findings suggest that while preference learning shows promise for audio, particularly in capturing subjective qualities like naturalness and musicality, the field requires standardized benchmarks, higher-quality datasets, and systematic investigation of how temporal factors unique to audio impact preference learning frameworks.

Via

Access Paper or Ask Questions

What Did My Car Say? Impact of Autonomous Vehicle Explanation Errors and Driving Context On Comfort, Reliance, Satisfaction, and Driving Confidence

Sep 10, 2024

Robert Kaufman, Aaron Broukhim, David Kirsh, Nadir Weibel

Figure 1 for What Did My Car Say? Impact of Autonomous Vehicle Explanation Errors and Driving Context On Comfort, Reliance, Satisfaction, and Driving Confidence

Figure 2 for What Did My Car Say? Impact of Autonomous Vehicle Explanation Errors and Driving Context On Comfort, Reliance, Satisfaction, and Driving Confidence

Figure 3 for What Did My Car Say? Impact of Autonomous Vehicle Explanation Errors and Driving Context On Comfort, Reliance, Satisfaction, and Driving Confidence

Figure 4 for What Did My Car Say? Impact of Autonomous Vehicle Explanation Errors and Driving Context On Comfort, Reliance, Satisfaction, and Driving Confidence

Abstract:Explanations for autonomous vehicle (AV) decisions may build trust, however, explanations can contain errors. In a simulated driving study (n = 232), we tested how AV explanation errors, driving context characteristics (perceived harm and driving difficulty), and personal traits (prior trust and expertise) affected a passenger's comfort in relying on an AV, preference for control, confidence in the AV's ability, and explanation satisfaction. Errors negatively affected all outcomes. Surprisingly, despite identical driving, explanation errors reduced ratings of the AV's driving ability. Severity and potential harm amplified the negative impact of errors. Contextual harm and driving difficulty directly impacted outcome ratings and influenced the relationship between errors and outcomes. Prior trust and expertise were positively associated with outcome ratings. Results emphasize the need for accurate, contextually adaptive, and personalized AV explanations to foster trust, reliance, satisfaction, and confidence. We conclude with design, research, and deployment recommendations for trustworthy AV explanation systems.

* 23 pages, 4 figures

Via

Access Paper or Ask Questions