Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hyeon-Kyeong Shin

Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation

Sep 18, 2025

Miseul Kim, Soo Jin Park, Kyungguen Byun, Hyeon-Kyeong Shin, Sunkuk Moon, Shuhua Zhang, Erik Visser

Abstract:Speaker diarization systems often struggle with high intrinsic intra-speaker variability, such as shifts in emotion, health, or content. This can cause segments from the same speaker to be misclassified as different individuals, for example, when one raises their voice or speaks faster during conversation. To address this, we propose a style-controllable speech generation model that augments speech across diverse styles while preserving the target speaker's identity. The proposed system starts with diarized segments from a conventional diarizer. For each diarized segment, it generates augmented speech samples enriched with phonetic and stylistic diversity. And then, speaker embeddings from both the original and generated audio are blended to enhance the system's robustness in grouping segments with high intrinsic intra-speaker variability. We validate our approach on a simulated emotional speech dataset and the truncated AMI dataset, demonstrating significant improvements, with error rate reductions of 49% and 35% on each dataset, respectively.

* Submitted to ICASSP 2026

Via

Access Paper or Ask Questions

Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

Jul 01, 2022

Hyeon-Kyeong Shin, Hyewon Han, Doyeon Kim, Soo-Whan Chung, Hong-Goo Kang

Figure 1 for Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

Figure 2 for Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

Figure 3 for Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

Figure 4 for Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting

Abstract:In this paper, we propose a novel end-to-end user-defined keyword spotting method that utilizes linguistically corresponding patterns between speech and text sequences. Unlike previous approaches requiring speech keyword enrollment, our method compares input queries with an enrolled text keyword sequence. To place the audio and text representations within a common latent space, we adopt an attention-based cross-modal matching approach that is trained in an end-to-end manner with monotonic matching loss and keyword classification loss. We also utilize a de-noising loss for the acoustic embedding network to improve robustness in noisy environments. Additionally, we introduce the LibriPhrase dataset, a new short-phrase dataset based on LibriSpeech for efficiently training keyword spotting models. Our proposed method achieves competitive results on various evaluation sets compared to other single-modal and cross-modal baselines.

* Accepted to Interspeech 2022

Via

Access Paper or Ask Questions

Phase Continuity: Learning Derivatives of Phase Spectrum for Speech Enhancement

Feb 24, 2022

Doyeon Kim, Hyewon Han, Hyeon-Kyeong Shin, Soo-Whan Chung, Hong-Goo Kang

Figure 1 for Phase Continuity: Learning Derivatives of Phase Spectrum for Speech Enhancement

Figure 2 for Phase Continuity: Learning Derivatives of Phase Spectrum for Speech Enhancement

Figure 3 for Phase Continuity: Learning Derivatives of Phase Spectrum for Speech Enhancement

Figure 4 for Phase Continuity: Learning Derivatives of Phase Spectrum for Speech Enhancement

Abstract:Modern neural speech enhancement models usually include various forms of phase information in their training loss terms, either explicitly or implicitly. However, these loss terms are typically designed to reduce the distortion of phase spectrum values at specific frequencies, which ensures they do not significantly affect the quality of the enhanced speech. In this paper, we propose an effective phase reconstruction strategy for neural speech enhancement that can operate in noisy environments. Specifically, we introduce a phase continuity loss that considers relative phase variations across the time and frequency axes. By including this phase continuity loss in a state-of-the-art neural speech enhancement system trained with reconstruction loss and a number of magnitude spectral losses, we show that our proposed method further improves the quality of enhanced speech signals over the baseline, especially when training is done jointly with a magnitude spectrum loss.

* Accepted by ICASSP 2022

Via

Access Paper or Ask Questions