Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Éric Bavu

Bringing Interpretability to Neural Audio Codecs

Jun 04, 2025

Samir Sadok, Julien Hauret, Éric Bavu

Abstract:The advent of neural audio codecs has increased in popularity due to their potential for efficiently modeling audio with transformers. Such advanced codecs represent audio from a highly continuous waveform to low-sampled discrete units. In contrast to semantic units, acoustic units may lack interpretability because their training objectives primarily focus on reconstruction performance. This paper proposes a two-step approach to explore the encoding of speech information within the codec tokens. The primary goal of the analysis stage is to gain deeper insight into how speech attributes such as content, identity, and pitch are encoded. The synthesis stage then trains an AnCoGen network for post-hoc explanation of codecs to extract speech attributes from the respective tokens directly.

* Submitted to Interspeech 2025 (accepted)

Via

Access Paper or Ask Questions

French Listening Tests for the Assessment of Intelligibility, Quality, and Identity of Body-Conducted Speech Enhancement

Jun 04, 2025

Thomas Joubaud, Julien Hauret, Véronique Zimpfer, Éric Bavu

Figure 1 for French Listening Tests for the Assessment of Intelligibility, Quality, and Identity of Body-Conducted Speech Enhancement

Figure 2 for French Listening Tests for the Assessment of Intelligibility, Quality, and Identity of Body-Conducted Speech Enhancement

Figure 3 for French Listening Tests for the Assessment of Intelligibility, Quality, and Identity of Body-Conducted Speech Enhancement

Figure 4 for French Listening Tests for the Assessment of Intelligibility, Quality, and Identity of Body-Conducted Speech Enhancement

Abstract:This study evaluates the Extreme Bandwidth Extension Network (EBEN) model on body-conduction sensors through listening tests. Using the Vibravox dataset, we assess intelligibility with a French Modified Rhyme Test, speech quality with a MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor) protocol and speaker identity preservation with an A/B identification task. The experiments involved male and female speakers recorded with a forehead accelerometer, rigid in-ear and throat microphones. The results confirm that EBEN enhances both speech quality and intelligibility. It slightly degrades speaker identification performance when applied to female speakers' throat microphone recordings. The findings also demonstrate a correlation between Short-Time Objective Intelligibility (STOI) and perceived quality in body-conducted speech, while speaker verification using ECAPA2-TDNN aligns well with identification performance. No tested metric reliably predicts EBEN's effect on intelligibility.

* Submitted to Interspeech 2025 (accepted)

Via

Access Paper or Ask Questions

Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors

Jul 16, 2024

Julien Hauret, Malo Olivier, Thomas Joubaud, Christophe Langrenne, Sarah Poirée, Véronique Zimpfer, Éric Bavu

Figure 1 for Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors

Figure 2 for Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors

Figure 3 for Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors

Figure 4 for Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors

Abstract:Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors : two in-ear microphones, two bone conduction vibration pickups and a laryngophone. The data set also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 38 hours of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by an high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.

* 19 pages, 15 figures

Via

Access Paper or Ask Questions

Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture

Mar 17, 2023

Julien Hauret, Thomas Joubaud, Véronique Zimpfer, Éric Bavu

Figure 1 for Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture

Figure 2 for Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture

Figure 3 for Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture

Figure 4 for Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture

Abstract:This paper presents a configurable version of Extreme Bandwidth Extension Network (EBEN), a Generative Adversarial Network (GAN) designed to improve audio captured with body-conduction microphones. We show that these microphones significantly reduce environmental noise. However, this insensitivity to ambient noise is at the expense of the bandwidth of the voice signal acquired from the wearer of the devices. The obtained captured signals therefore require the use of signal enhancement techniques to recover the full-bandwidth speech. EBEN leverages a configurable multiband decomposition of the raw captured signal. This decomposition allows the data time domain dimensions to be reduced and the full band signal to be better controlled. The multiband representation of the captured signal is processed through a U-Net-like model, which combines feature and adversarial losses to generate an enhanced speech signal. We also benefit from this original representation in the proposed configurable discriminator architecture. The configurable EBEN approach can achieve state-of-the-art enhancement results on synthetic data with a lightweight generator that allows real-time processing.

* 12 pages, 10 figures, 3 tables, submitted to IEEE TASLP

Via

Access Paper or Ask Questions

EBEN: Extreme bandwidth extension network applied to speech signals captured with noise-resilient microphones

Oct 25, 2022

Julien Hauret, Thomas Joubaud, Véronique Zimpfer, Éric Bavu

Figure 1 for EBEN: Extreme bandwidth extension network applied to speech signals captured with noise-resilient microphones

Figure 2 for EBEN: Extreme bandwidth extension network applied to speech signals captured with noise-resilient microphones

Figure 3 for EBEN: Extreme bandwidth extension network applied to speech signals captured with noise-resilient microphones

Figure 4 for EBEN: Extreme bandwidth extension network applied to speech signals captured with noise-resilient microphones

Abstract:In this paper, we present Extreme Bandwidth Extension Network (EBEN), a generative adversarial network (GAN) that enhances audio measured with noise-resilient microphones. This type of capture equipment suppresses ambient noise at the expense of speech bandwidth, thereby requiring signal enhancement techniques to recover the wideband speech signal. EBEN leverages a multiband decomposition of the raw captured speech to decrease the data time-domain dimensions, and give better control over the full-band signal. This multiband representation is fed to a U-Net-like model, which adopts a combination of feature and adversarial losses to recover an enhanced audio signal. We also benefit from this original representation in the proposed discriminator architecture. Our approach can achieve state-of-the-art results with a lightweight generator and real-time compatible operation.

* 5 pages, 5 figures, submitted to ICASSP 2023

Via

Access Paper or Ask Questions

BeamLearning: an end-to-end Deep Learning approach for the angular localization of sound sources using raw multichannel acoustic pressure data

Apr 27, 2021

Hadrien Pujol, Éric Bavu, Alexandre Garcia

Abstract:Sound sources localization using multichannel signal processing has been a subject of active research for decades. In recent years, the use of deep learning in audio signal processing has allowed to drastically improve performances for machine hearing. This has motivated the scientific community to also develop machine learning strategies for source localization applications. In this paper, we present BeamLearning, a multi-resolution deep learning approach that allows to encode relevant information contained in unprocessed time domain acoustic signals captured by microphone arrays. The use of raw data aims at avoiding simplifying hypothesis that most traditional model-based localization methods rely on. Benefits of its use are shown for realtime sound source 2D-localization tasks in reverberating and noisy environments. Since supervised machine learning approaches require large-sized, physically realistic, precisely labelled datasets, we also developed a fast GPU-based computation of room impulse responses using fractional delays for image source models. A thorough analysis of the network representation and extensive performance tests are carried out using the BeamLearning network with synthetic and experimental datasets. Obtained results demonstrate that the BeamLearning approach significantly outperforms the wideband MUSIC and SRP-PHAT methods in terms of localization accuracy and computational efficiency in presence of heavy measurement noise and reverberation.

* The following article has been submitted to the special issue on Machine Learning in Acoustics in JASA. After it is published, it will be found at http://asa.scitation.org/journal/jas

Via

Access Paper or Ask Questions