Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Elio Gruttadauria

IP Paris, LTCI, IDS, S2A

O-EENC-SD: Efficient Online End-to-End Neural Clustering for Speaker Diarization

Dec 17, 2025

Elio Gruttadauria, Mathieu Fontaine, Jonathan Le Roux, Slim Essid

Abstract:We introduce O-EENC-SD: an end-to-end online speaker diarization system based on EEND-EDA, featuring a novel RNN-based stitching mechanism for online prediction. In particular, we develop a novel centroid refinement decoder whose usefulness is assessed through a rigorous ablation study. Our system provides key advantages over existing methods: a hyperparameter-free solution compared to unsupervised clustering approaches, and a more efficient alternative to current online end-to-end methods, which are computationally costly. We demonstrate that O-EENC-SD is competitive with the state of the art in the two-speaker conversational telephone speech domain, as tested on the CallHome dataset. Our results show that O-EENC-SD provides a great trade-off between DER and complexity, even when working on independent chunks with no overlap, making the system extremely efficient.

* IEEE International Conference on Acoustics, Speech, and Signal Processing, Apr 2025, Hyderabad, India, India

Via

Access Paper or Ask Questions

Online speaker diarization of meetings guided by speech separation

Jan 30, 2024

Elio Gruttadauria, Mathieu Fontaine, Slim Essid

Figure 1 for Online speaker diarization of meetings guided by speech separation

Figure 2 for Online speaker diarization of meetings guided by speech separation

Figure 3 for Online speaker diarization of meetings guided by speech separation

Figure 4 for Online speaker diarization of meetings guided by speech separation

Abstract:Overlapped speech is notoriously problematic for speaker diarization systems. Consequently, the use of speech separation has recently been proposed to improve their performance. Although promising, speech separation models struggle with realistic data because they are trained on simulated mixtures with a fixed number of speakers. In this work, we introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings with a variable number of speakers, as present in the AMI corpus. We envisage ConvTasNet and DPRNN as alternatives for the separation networks, with two or three output sources. To obtain the speaker diarization result, voice activity detection is applied on each estimated source. The final model is fine-tuned end-to-end, after first adapting the separation to real data using AMI. The system operates on short segments, and inference is performed by stitching the local predictions using speaker embeddings and incremental clustering. The results show that our system improves the state-of-the-art on the AMI headset mix, using no oracle information and under full evaluation (no collar and including overlapped speech). Finally, we show the strength of our system particularly on overlapped speech sections.

* IEEE International Conference on Acoustics, Speech, and Signal Processing, Apr 2024, Seoul (Korea), South Korea
* Accepted at ICASSP 2024

Via

Access Paper or Ask Questions