Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Naoki Sawada

PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation

Sep 04, 2025

Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda

Abstract:Automatic speech recognition (ASR) systems struggle with domain-specific named entities, especially homophones. Contextual ASR improves recognition but often fails to capture fine-grained phoneme variations due to limited entity diversity. Moreover, prior methods treat entities as independent tokens, leading to incomplete multi-token biasing. To address these issues, we propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO), which integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering. These components enhance phonetic discrimination, ensure complete entity retrieval, and reduce false positives under uncertainty. Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. PARCO also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech.

* Accepted by ASRU 2025

Via

Access Paper or Ask Questions

Audio Classification of Bit-Representation Waveform

Apr 08, 2019

Masaki Okawa, Takuya Saito, Naoki Sawada, Hiromitsu Nishizaki

Figure 1 for Audio Classification of Bit-Representation Waveform

Figure 2 for Audio Classification of Bit-Representation Waveform

Figure 3 for Audio Classification of Bit-Representation Waveform

Figure 4 for Audio Classification of Bit-Representation Waveform

Abstract:This paper investigates waveform representation for audio signal classification. Recently, many studies on audio waveform classification such as acoustic event detection and music genre classification have been increasing. Most studies on audio waveform classification proposed to use a deep learning (neural network) framework. Generally, a frequency analysis method like the Fourier transform is applied to extract frequency or spectral information of the input audio waveform before inputting the raw audio waveform into a neural network. As against to these previous studies, in this paper, we propose a novel waveform representation method, in which audio waveforms are represented as bit-sequence, for audio classification. In our experiment, we compare the proposed bit-representation waveform, which is directly given to a neural network, to other representation of audio waveforms such as raw audio waveform and power spectrum on two classification tasks: one is an acoustic event classification task, the other is a sound/music classification task. The experimental results showed that the bit-representation waveform got the best classification performances on both the tasks.

* Submitted to INTERSPEECH2019

Via

Access Paper or Ask Questions