Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Reinhold Haeb-Umbach

On the Application of Diffusion Models for Simultaneous Denoising and Dereverberation

Aug 26, 2025

Adrian Meise, Tobias Cord-Landwehr, Reinhold Haeb-Umbach

Abstract:Diffusion models have been shown to achieve natural-sounding enhancement of speech degraded by noise or reverberation. However, their simultaneous denoising and dereverberation capability has so far not been studied much, although this is arguably the most common scenario in a practical application. In this work, we investigate different approaches to enhance noisy and/or reverberant speech. We examine the cascaded application of models, each trained on only one of the distortions, and compare it with a single model, trained either solely on data that is both noisy and reverberated, or trained on data comprising subsets of purely noisy, of purely reverberated, and of noisy reverberant speech. Tests are performed both on artificially generated and real recordings of noisy and/or reverberant data. The results show that, when using the cascade of models, satisfactory results are only achieved if they are applied in the order of the dominating distortion. If only a single model is desired that can operate on all distortion scenarios, the best compromise appears to be a model trained on the aforementioned three subsets of degraded speech data.

* Accepted at 16th ITG Conference on Speech Communication 2025

Via

Access Paper or Ask Questions

Towards Frame-level Quality Predictions of Synthetic Speech

Aug 14, 2025

Michael Kuhlmann, Fritz Seebauer, Petra Wagner, Reinhold Haeb-Umbach

Abstract:While automatic subjective speech quality assessment has witnessed much progress, an open question is whether an automatic quality assessment at frame resolution is possible. This would be highly desirable, as it adds explainability to the assessment of speech synthesis systems. Here, we take first steps towards this goal by identifying issues of existing quality predictors that prevent sensible frame-level prediction. Further, we define criteria that a frame-level predictor should fulfill. We also suggest a chunk-based processing that avoids the impact of a localized distortion on the score of neighboring frames. Finally, we measure in experiments with localized artificial distortions the localization performance of a set of frame-level quality predictors and show that they can outperform detection performance of human annotations obtained from a crowd-sourced perception experiment.

* Accepted at Interspeech 2025

Via

Access Paper or Ask Questions

30+ Years of Source Separation Research: Achievements and Future Challenges

Jan 21, 2025

Shoko Araki, Nobutaka Ito, Reinhold Haeb-Umbach, Gordon Wichern, Zhong-Qiu Wang, Yuki Mitsufuji

Abstract:Source separation (SS) of acoustic signals is a research field that emerged in the mid-1990s and has flourished ever since. On the occasion of ICASSP's 50th anniversary, we review the major contributions and advancements in the past three decades in the speech, audio, and music SS research field. We will cover both single- and multi-channel SS approaches. We will also look back on key efforts to foster a culture of scientific evaluation in the research field, including challenges, performance metrics, and datasets. We will conclude by discussing current trends and future research directions.

* Accepted by IEEE ICASSP 2025

Via

Access Paper or Ask Questions

Speech Synthesis along Perceptual Voice Quality Dimensions

Jan 15, 2025

Frederik Rautenberg, Michael Kuhlmann, Fritz Seebauer, Jana Wiechmann, Petra Wagner, Reinhold Haeb-Umbach

Figure 1 for Speech Synthesis along Perceptual Voice Quality Dimensions

Figure 2 for Speech Synthesis along Perceptual Voice Quality Dimensions

Figure 3 for Speech Synthesis along Perceptual Voice Quality Dimensions

Figure 4 for Speech Synthesis along Perceptual Voice Quality Dimensions

Abstract:While expressive speech synthesis or voice conversion systems mainly focus on controlling or manipulating abstract prosodic characteristics of speech, such as emotion or accent, we here address the control of perceptual voice qualities (PVQs) recognized by phonetic experts, which are speech properties at a lower level of abstraction. The ability to manipulate PVQs can be a valuable tool for teaching speech pathologists in training or voice actors. In this paper, we integrate a Conditional Continuous-Normalizing-Flow-based method into a Text-to-Speech system to modify perceptual voice attributes on a continuous scale. Unlike previous approaches, our system avoids direct manipulation of acoustic correlates and instead learns from examples. We demonstrate the system's capability by manipulating four voice qualities: Roughness, breathiness, resonance and weight. Phonetic experts evaluated these modifications, both for seen and unseen speaker conditions. The results highlight both the system's strengths and areas for improvement.

* Accepted by ICASSP 2025

Via

Access Paper or Ask Questions

Microphone Array Signal Processing and Deep Learning for Speech Enhancement

Jan 13, 2025

Reinhold Haeb-Umbach, Tomohiro Nakatani, Marc Delcroix, Christoph Boeddeker, Tsubasa Ochiai

Figure 1 for Microphone Array Signal Processing and Deep Learning for Speech Enhancement

Figure 2 for Microphone Array Signal Processing and Deep Learning for Speech Enhancement

Figure 3 for Microphone Array Signal Processing and Deep Learning for Speech Enhancement

Figure 4 for Microphone Array Signal Processing and Deep Learning for Speech Enhancement

Abstract:Multi-channel acoustic signal processing is a well-established and powerful tool to exploit the spatial diversity between a target signal and non-target or noise sources for signal enhancement. However, the textbook solutions for optimal data-dependent spatial filtering rest on the knowledge of second-order statistical moments of the signals, which have traditionally been difficult to acquire. In this contribution, we compare model-based, purely data-driven, and hybrid approaches to parameter estimation and filtering, where the latter tries to combine the benefits of model-based signal processing and data-driven deep learning to overcome their individual deficiencies. We illustrate the underlying design principles with examples from noise reduction, source separation, and dereverberation.

* IEEE Signal Processing Magazine, vol. 41, no. 6, pp. 12-23, Nov. 2024
* Accepted for IEEE Signal Processing Magazine

Via

Access Paper or Ask Questions

Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models

Oct 28, 2024

Tobias Cord-Landwehr, Christoph Boeddeker, Reinhold Haeb-Umbach

Figure 1 for Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models

Figure 2 for Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models

Figure 3 for Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models

Abstract:We propose an approach for simultaneous diarization and separation of meeting data. It consists of a complex Angular Central Gaussian Mixture Model (cACGMM) for speech source separation, and a von-Mises-Fisher Mixture Model (VMFMM) for diarization in a joint statistical framework. Through the integration, both spatial and spectral information are exploited for diarization and separation. We also develop a method for counting the number of active speakers in a segment of a meeting to support block-wise processing. While the total number of speakers in a meeting may be known, it is usually not known on a per-segment level. With the proposed speaker counting, joint diarization and source separation can be done segment-by-segment, and the permutation problem across segments is solved, thus allowing for block-online processing in the future. Experimental results on the LibriCSS meeting corpus show that the integrated approach outperforms a cascaded approach of diarization and speech enhancement in terms of WER, both on a per-segment and on a per-meeting level.

* Submitted to ICASSP2025

Via

Access Paper or Ask Questions

Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder

Sep 05, 2024

Yuying Xie, Michael Kuhlmann, Frederik Rautenberg, Zheng-Hua Tan, Reinhold Haeb-Umbach

Abstract:Speech signals encompass various information across multiple levels including content, speaker, and style. Disentanglement of these information, although challenging, is important for applications such as voice conversion. The contrastive predictive coding supported factorized variational autoencoder achieves unsupervised disentanglement of a speech signal into speaker and content embeddings by assuming speaker info to be temporally more stable than content-induced variations. However, this assumption may introduce other temporal stable information into the speaker embeddings, like environment or emotion, which we call style. In this work, we propose a method to further disentangle non-content features into distinct speaker and style features, notably by leveraging readily accessible and well-defined speaker labels without the necessity for style labels. Experimental results validate the proposed method's effectiveness on extracting disentangled features, thereby facilitating speaker, style, or combined speaker-style conversion.

* Accepted by EUSIPCO 2024

Via

Access Paper or Ask Questions

Diminishing Domain Mismatch for DNN-Based Acoustic Distance Estimation via Stochastic Room Reverberation Models

Aug 26, 2024

Tobias Gburrek, Adrian Meise, Joerg Schmalenstroeer, Reinhold Haeb-Umbach

Figure 1 for Diminishing Domain Mismatch for DNN-Based Acoustic Distance Estimation via Stochastic Room Reverberation Models

Figure 2 for Diminishing Domain Mismatch for DNN-Based Acoustic Distance Estimation via Stochastic Room Reverberation Models

Figure 3 for Diminishing Domain Mismatch for DNN-Based Acoustic Distance Estimation via Stochastic Room Reverberation Models

Figure 4 for Diminishing Domain Mismatch for DNN-Based Acoustic Distance Estimation via Stochastic Room Reverberation Models

Abstract:The room impulse response (RIR) encodes, among others, information about the distance of an acoustic source from the sensors. Deep neural networks (DNNs) have been shown to be able to extract that information for acoustic distance estimation. Since there exists only a very limited amount of annotated data, e.g., RIRs with distance information, training a DNN for acoustic distance estimation has to rely on simulated RIRs, resulting in an unavoidable mismatch to RIRs of real rooms. In this contribution, we show that this mismatch can be reduced by a novel combination of geometric and stochastic modeling of RIRs, resulting in a significantly improved distance estimation accuracy.

* Accepted for publication IWAENC 2024

Via

Access Paper or Ask Questions

Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment

Jun 05, 2024

Christoph Boeddeker, Tobias Cord-Landwehr, Reinhold Haeb-Umbach

Abstract:Diarization is a crucial component in meeting transcription systems to ease the challenges of speech enhancement and attribute the transcriptions to the correct speaker. Particularly in the presence of overlapping or noisy speech, these systems have problems reliably assigning the correct speaker labels, leading to a significant amount of speaker confusion errors. We propose to add segment-level speaker reassignment to address this issue. By revisiting, after speech enhancement, the speaker attribution for each segment, speaker confusion errors from the initial diarization stage are significantly reduced. Through experiments across different system configurations and datasets, we further demonstrate the effectiveness and applicability in various domains. Our results show that segment-level speaker reassignment successfully rectifies at least 40% of speaker confusion word errors, highlighting its potential for enhancing diarization accuracy in meeting transcription systems.

* Accepted for Interspeech 2024

Via

Access Paper or Ask Questions

Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios

Jan 08, 2024

Tobias Cord-Landwehr, Christoph Boeddeker, Cătălin Zorilă, Rama Doddipatla, Reinhold Haeb-Umbach

Figure 1 for Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios

Figure 2 for Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios

Figure 3 for Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios

Figure 4 for Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios

Abstract:We propose a modified teacher-student training for the extraction of frame-wise speaker embeddings that allows for an effective diarization of meeting scenarios containing partially overlapping speech. To this end, a geodesic distance loss is used that enforces the embeddings computed from regions with two active speakers to lie on the shortest path on a sphere between the points given by the d-vectors of each of the active speakers. Using those frame-wise speaker embeddings in clustering-based diarization outperforms segment-level clustering-based diarization systems such as VBx and Spectral Clustering. By extending our approach to a mixture-model-based diarization, the performance can be further improved, approaching the diarization error rates of diarization systems that use a dedicated overlap detection, and outperforming these systems when also employing an additional overlap detection.

* Accepted at ICASSP 2024

Via

Access Paper or Ask Questions