Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hongyu Jin

Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models

Jan 21, 2026

Wenda Zhang, Hongyu Jin, Siyi Wang, Zhiqiang Wei, Ting Dang

Abstract:Speech Emotion Recognition models typically use single categorical labels, overlooking the inherent ambiguity of human emotions. Ambiguous Emotion Recognition addresses this by representing emotions as probability distributions, but progress is limited by unreliable ground-truth distributions inferred from sparse human annotations. This paper explores whether Large Audio-Language Models (ALMs) can mitigate the annotation bottleneck by generating high-quality synthetic annotations. We introduce a framework leveraging ALMs to create Synthetic Perceptual Proxies, augmenting human annotations to improve ground-truth distribution reliability. We validate these proxies through statistical analysis of their alignment with human distributions and evaluate their impact by fine-tuning ALMs with the augmented emotion distributions. Furthermore, to address class imbalance and enable unbiased evaluation, we propose DiME-Aug, a Distribution-aware Multimodal Emotion Augmentation strategy. Experiments on IEMOCAP and MSP-Podcast show that synthetic annotations enhance emotion distribution, especially in low-ambiguity regions where annotation agreement is high. However, benefits diminish for highly ambiguous emotions with greater human disagreement. This work provides the first evidence that ALMs could address annotation scarcity in ambiguous emotion recognition, but highlights the need for more advanced prompting or generation strategies to handle highly ambiguous cases.

* Accepted by ICASSP 2026

Via

Access Paper or Ask Questions

Deformable-Heatmap-Segmentation for Automobile Visual Perception

Jul 10, 2024

Hongyu Jin

Figure 1 for Deformable-Heatmap-Segmentation for Automobile Visual Perception

Figure 2 for Deformable-Heatmap-Segmentation for Automobile Visual Perception

Figure 3 for Deformable-Heatmap-Segmentation for Automobile Visual Perception

Figure 4 for Deformable-Heatmap-Segmentation for Automobile Visual Perception

Abstract:Semantic segmentation of road elements in 2D images is a crucial task in the recognition of some static objects such as lane lines and free space. In this paper, we propose DHSNet,which extracts the objects features with a end-to-end architecture along with a heatmap proposal. Deformable convolutions are also utilized in the proposed network. The DHSNet finely combines low-level feature maps with high-level ones by using upsampling operators as well as downsampling operators in a U-shape manner. Besides, DHSNet also aims to capture static objects of various shapes and scales. We also predict a proposal heatmap to detect the proposal points for more accurate target aiming in the network.

Via

Access Paper or Ask Questions