Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jan Cernocky

Endpoint Anticipation for Low-Latency Spoken Dialogue

Jun 11, 2026

Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocky

Abstract:While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.

* Accepted at Interspeech 2026

Via

Access Paper or Ask Questions

BUT Systems for Environmental Sound Deepfake Detection in the ESDD 2026 Challenge

Dec 09, 2025

Junyi Peng, Lin Zhang, Jin Li, Oldrich Plchot, Jan Cernocky

Abstract:This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a robust ensemble framework leveraging diverse Self-Supervised Learning (SSL) models. We conduct a comprehensive analysis of general audio SSL models (including BEATs, EAT, and Dasheng) and speech-specific SSLs. These front-ends are coupled with a lightweight Multi-Head Factorized Attention (MHFA) back-end to capture discriminative representations. Furthermore, we introduce a feature domain augmentation strategy based on distribution uncertainty modeling to enhance model robustness against unseen spectral distortions. All models are trained exclusively on the official EnvSDD data, without using any external resources. Experimental results demonstrate the effectiveness of our approach: our best single system achieved Equal Error Rates (EER) of 0.00\%, 4.60\%, and 4.80\% on the Development, Progress (Track 1), and Final Evaluation sets, respectively. The fusion system further improved generalization, yielding EERs of 0.00\%, 3.52\%, and 4.38\% across the same partitions.

Via

Access Paper or Ask Questions

Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training

Jun 08, 2025

Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocky

Abstract:Accurate, low-latency endpointing is crucial for effective spoken dialogue systems. While traditional endpointers often rely on spectrum-based audio features, this work proposes real-time speech endpointing for multi-turn dialogues using streaming, low-bitrate Neural Audio Codec (NAC) features, building upon recent advancements in neural audio codecs. To further reduce cutoff errors, we introduce a novel label delay training scheme. At a fixed median latency of 160 ms, our combined NAC and label delay approach achieves significant relative cutoff error reductions: 42.7% for a single-stream endpointer and 37.5% for a two-stream configuration, compared to baseline methods. Finally, we demonstrate efficient integration with a codec-based pretrained speech large language model, improving its median response time by 1200 ms and reducing its cutoff error by 35%.

Via

Access Paper or Ask Questions

Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization

May 30, 2025

Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, Jan Cernocky, Lukas Burget

Figure 1 for Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization

Figure 2 for Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization

Figure 3 for Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization

Figure 4 for Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization

Abstract:Self-supervised learning (SSL) models like WavLM can be effectively utilized when building speaker diarization systems but are often large and slow, limiting their use in resource constrained scenarios. Previous studies have explored compression techniques, but usually for the price of degraded performance at high pruning ratios. In this work, we propose to compress SSL models through structured pruning by introducing knowledge distillation. Different from the existing works, we emphasize the importance of fine-tuning SSL models before pruning. Experiments on far-field single-channel AMI, AISHELL-4, and AliMeeting datasets show that our method can remove redundant parameters of WavLM Base+ and WavLM Large by up to 80% without any performance degradation. After pruning, the inference speeds on a single GPU for the Base+ and Large models are 4.0 and 2.6 times faster, respectively. Our source code is publicly available.

* Accepted by INTERSPEECH 2025

Via

Access Paper or Ask Questions

Probing Self-supervised Learning Models with Target Speech Extraction

Feb 17, 2024

Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Takanori Ashihara, Shoko Araki, Jan Cernocky

Abstract:Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction capabilities of pre-trained SSL models. TSE uniquely requires both speaker identification and speech separation, distinguishing it from other tasks in the Speech processing Universal PERformance Benchmark (SUPERB) evaluation. Specifically, we propose a TSE downstream model composed of two lightweight task-oriented modules based on the same frozen SSL model. One module functions as a speaker encoder to obtain target speaker information from an enrollment speech, while the other estimates the target speaker's mask to extract its speech from the mixture. Experimental results on the Libri2mix datasets reveal the relevance of the TSE downstream task to probe SSL models, as its performance cannot be simply deduced from other related tasks such as speaker verification and separation.

* Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop

Via

Access Paper or Ask Questions

Target Speech Extraction with Pre-trained Self-supervised Learning Models

Feb 17, 2024

Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Cernocky

Figure 1 for Target Speech Extraction with Pre-trained Self-supervised Learning Models

Figure 2 for Target Speech Extraction with Pre-trained Self-supervised Learning Models

Figure 3 for Target Speech Extraction with Pre-trained Self-supervised Learning Models

Figure 4 for Target Speech Extraction with Pre-trained Self-supervised Learning Models

Abstract:Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixture and to derive speaker embeddings from the enrollment. In this paper, we focus on how to effectively use SSL models for TSE. We first introduce a novel TSE downstream task following the SUPERB principles. This simple experiment shows the potential of SSL models for TSE, but extraction performance remains far behind the state-of-the-art. We then extend a powerful TSE architecture by incorporating two SSL-based modules: an Adaptive Input Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes intermediate representations from the CNN encoder by adjusting the time resolution of CNN encoder and transformer blocks through progressive upsampling, capturing both fine-grained and hierarchical features. Our method outperforms current TSE systems achieving a SI-SDR improvement of 14.0 dB on LibriMix. Moreover, we can further improve performance by 0.7 dB by fine-tuning the whole model including the SSL model parameters.

* Accepted to ICASSP 2024

Via

Access Paper or Ask Questions

DiaCorrect: Error Correction Back-end For Speaker Diarization

Sep 15, 2023

Jiangyu Han, Federico Landini, Johan Rohdin, Mireia Diez, Lukas Burget, Yuhang Cao, Heng Lu, Jan Cernocky

Figure 1 for DiaCorrect: Error Correction Back-end For Speaker Diarization

Figure 2 for DiaCorrect: Error Correction Back-end For Speaker Diarization

Figure 3 for DiaCorrect: Error Correction Back-end For Speaker Diarization

Figure 4 for DiaCorrect: Error Correction Back-end For Speaker Diarization

Abstract:In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initial system's outputs, DiaCorrect can automatically correct the initial speaker activities to minimize the diarization errors. Experiments on 2-speaker telephony data show that the proposed DiaCorrect can effectively improve the initial model's results. Our source code is publicly available at https://github.com/BUTSpeechFIT/diacorrect.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

Aug 15, 2023

Bolaji Yusuf, Jan Cernocky, Murat Saraclar

Figure 1 for End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

Figure 2 for End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

Figure 3 for End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

Figure 4 for End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

Abstract:Conventional keyword search systems operate on automatic speech recognition (ASR) outputs, which causes them to have a complex indexing and search pipeline. This has led to interest in ASR-free approaches to simplify the search procedure. We recently proposed a neural ASR-free keyword search model which achieves competitive performance while maintaining an efficient and simplified pipeline, where queries and documents are encoded with a pair of recurrent neural network encoders and the encodings are combined with a dot-product. In this article, we extend this work with multilingual pretraining and detailed analysis of the model. Our experiments show that the proposed multilingual training significantly improves the model performance and that despite not matching a strong ASR-based conventional keyword search system for short queries and queries comprising in-vocabulary words, the proposed model outperforms the ASR-based system for long queries and queries that do not appear in the training data.

* in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3070-3080, 2023
* Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2023

Via

Access Paper or Ask Questions

Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Oct 15, 2022

Themos Stafylakis, Ladislav Mosner, Sofoklis Kakouros, Oldrich Plchot, Lukas Burget, Jan Cernocky

Figure 1 for Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Figure 2 for Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Figure 3 for Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Figure 4 for Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Abstract:Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alternative way of extracting speaker and emotion information from self-supervised trained models, based on the correlations between the coefficients of the representations - correlation pooling. We show improvements over mean pooling and further gains when the pooling methods are combined via fusion. The code is available at github.com/Lamomal/s3prl_correlation.

* Accepted at IEEE-SLT 2022

Via

Access Paper or Ask Questions

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Oct 03, 2022

Junyi Peng, Oldrich Plchot, Themos Stafylakis, Ladislav Mosner, Lukas Burget, Jan Cernocky

Figure 1 for An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Figure 2 for An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Figure 3 for An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Figure 4 for An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Abstract:In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization and learning rate schedule to stabilize the fine-tuning process and further boost performance: multi-head factorized attentive pooling is proposed to factorize the comparison of speaker representations into multiple phonetic clusters. We regularize towards the parameters of the pre-trained model and we set different learning rates for each layer of the pre-trained model during fine-tuning. The experimental results show our method can significantly shorten the training time to 4 hours and achieve SOTA performance: 0.59%, 0.79% and 1.77% EER on Vox1-O, Vox1-E and Vox1-H, respectively.

* Accepted by SLT2022

Via

Access Paper or Ask Questions