Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David Doukhan

Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts

Apr 10, 2026

Valentin Pelloin, Lina Bekkali, Reda Dehak, David Doukhan

Abstract:Audio and speech self-supervised encoder models are now widely used for a lot of different tasks. Many of these models are often trained on clean segmented speech content such as LibriSpeech. In this paper, we look into how the pretraining datasets of such SSL (Self-Supervised Learning) models impact their downstream results. We build a large pretraining corpus of highly diverse TV and Radio broadcast audio content, which we describe with automatic tools. We use these annotations to build smaller subsets, which we use to train audio SSL models. Then, we evaluate the models on multiple downstream tasks such as automatic speech recognition, voice activity and music detection, or speaker recognition. The results show the potential of pretraining SSL models on diverse audio content without restricting it to speech. We also perform a membership inference attack to evaluate the encoder ability to memorize their training datasets, which highlight the importance of data deduplication. This unified training could bridge speech and music machine learning communities.

* To be published in the Fifteenth International Conference on Language Resources and Evaluation (LREC 2026)

Via

Access Paper or Ask Questions

spINAch: A Diachronic Corpus of French Broadcast Speech Controlled for Speakers' Age and Gender

Mar 16, 2026

Simon Devauchelle, David Doukhan, Rémi Uro, Lucas Ondel Yang, Valentin Pelloin, Olympia Imbert-Brégégère, Véronique Lefort, Kévin Picard, Emeline Seignobos, Albert Rilliard

Abstract:We present spINAch, a large diachronic corpus of French speech from radio and television archives, balanced by speakers' gender, age (20-95 years old), and spanning 60 years from 1955 to 2015. The dataset includes over 320 hours of recordings from more than two thousand speakers. The methodology for building the corpus is described, focusing on the quality of collected samples in acoustic terms. The data were automatically transcribed and phonetically aligned to allow studies at a phonemic level. More than 3 million oral vowels have been analyzed to propose their fundamental frequency and formants. The corpus, available to the community for research purposes, is valuable for describing the evolution of Parisian French through the representation of gender and age. The presented analyses also demonstrate that the diachronic nature of the corpus allows the observation of various phonetic phenomena, such as the evolution of voice pitch over time (which does not differ by gender in our data) and the neutralization of the /a/-/$a$/ opposition in Parisian French during this period.

* 16 pages, 3 figures, to be published in the Fifteenth International Conference on Language Resources and Evaluation (LREC 2026)

Via

Access Paper or Ask Questions

Articulatory Configurations across Genders and Periods in French Radio and TV archives

Aug 08, 2024

Benjamin Elie, David Doukhan, Rémi Uro, Lucas Ondel-Yang, Albert Rilliard, Simon Devauchelle

Abstract:This paper studies changes in articulatory configurations across genders and periods using an inversion from acoustic to articulatory parameters. From a diachronic corpus based on French media archives spanning 60 years from 1955 to 2015, automatic transcription and forced alignment allowed extracting the central frame of each vowel. More than one million frames were obtained from over a thousand speakers across gender and age categories. Their formants were used from these vocalic frames to fit the parameters of Maeda's articulatory model. Evaluations of the quality of these processes are provided. We focus here on two parameters of Maeda's model linked to total vocal tract length: the relative position of the larynx (higher for females) and the lips protrusion (more protruded for males). Implications for voice quality across genders are discussed. The effect across periods seems gender independent; thus, the assertion that females lowered their pitch with time is not supported.

* accepted to InterSpeech 2024, Kos Island, Greece keywords : acoustic to articulatory inversion, diachrony, gender, French, media

Via

Access Paper or Ask Questions

Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis

Jul 19, 2024

Valentin Pelloin, Lena Dodson, Émile Chapuis, Nicolas Hervé, David Doukhan

Abstract:This paper introduces a computational framework designed to delineate gender distribution biases in topics covered by French TV and radio news. We transcribe a dataset of 11.7k hours, broadcasted in 2023 on 21 French channels. A Large Language Model (LLM) is used in few-shot conversation mode to obtain a topic classification on those transcriptions. Using the generated LLM annotations, we explore the finetuning of a specialized smaller classification model, to reduce the computational cost. To evaluate the performances of these models, we construct and annotate a dataset of 804 dialogues. This dataset is made available free of charge for research purposes. We show that women are notably underrepresented in subjects such as sports, politics and conflicts. Conversely, on topics such as weather, commercials and health, women have more speaking time than their overall average across all subjects. We also observe representations differences between private and public service channels.

* Accepted to Interspeech 2024

Via

Access Paper or Ask Questions

Gender Representation in TV and Radio: Automatic Information Extraction methods versus Manual Analyses

Jun 14, 2024

David Doukhan, Lena Dodson, Manon Conan, Valentin Pelloin, Aurélien Clamouse, Mélina Lepape, Géraldine Van Hille, Cécile Méadel, Marlène Coulomb-Gully

Figure 1 for Gender Representation in TV and Radio: Automatic Information Extraction methods versus Manual Analyses

Figure 2 for Gender Representation in TV and Radio: Automatic Information Extraction methods versus Manual Analyses

Figure 3 for Gender Representation in TV and Radio: Automatic Information Extraction methods versus Manual Analyses

Figure 4 for Gender Representation in TV and Radio: Automatic Information Extraction methods versus Manual Analyses

Abstract:This study investigates the relationship between automatic information extraction descriptors and manual analyses to describe gender representation disparities in TV and Radio. Automatic descriptors, including speech time, facial categorization and speech transcriptions are compared with channel reports on a vast 32,000-hour corpus of French broadcasts from 2023. Findings reveal systemic gender imbalances, with women underrepresented compared to men across all descriptors. Notably, manual channel reports show higher women's presence than automatic estimates and references to women are lower than their speech time. Descriptors share common dynamics during high and low audiences, war coverage, or private versus public channels. While women are more visible than audible in French TV, this trend is inverted in news with unseen journalists depicting male protagonists. A statistical test shows 3 main effects influencing references to women: program category, channel and speaker gender.

* keywords : Gender representation, computational humanities, TV, Radio, face classification, speaker traits, ASR, media, SLU. Accepted to InterSpeech 2024, Kos Island, Greece, september 2024

Via

Access Paper or Ask Questions

Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content

Jun 14, 2024

Rémi Uro, Marie Tahon, David Doukhan, Antoine Laurent, Albert Rilliard

Figure 1 for Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content

Figure 2 for Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content

Figure 3 for Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content

Figure 4 for Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content

Abstract:Transition Relevance Places are defined as the end of an utterance where the interlocutor may take the floor without interrupting the current speaker --i.e., a place where the turn is terminal. Analyzing turn terminality is useful to study the dynamic of turn-taking in spontaneous conversations. This paper presents an automatic classification of spoken utterances as Terminal or Non-Terminal in multi-speaker settings. We compared audio, text, and fusions of both approaches on a French corpus of TV and Radio extracts annotated with turn-terminality information at each speaker change. Our models are based on pre-trained self-supervised representations. We report results for different fusion strategies and varying context sizes. This study also questions the problem of performance variability by analyzing the differences in results for multiple training runs with random initialization. The measured accuracy would allow the use of these models for large-scale analysis of turn-taking.

* keywords : Spoken interaction, Media, TV, Radio, Transition-Relevance Places, Turn Taking, Interruption. Accepted to InterSpeech 2024, Kos Island, Greece

Via

Access Paper or Ask Questions

InaGVAD : a Challenging French TV and Radio Corpus Annotated for Speech Activity Detection and Speaker Gender Segmentation

Jun 06, 2024

David Doukhan, Christine Maertens, William Le Personnic, Ludovic Speroni, Reda Dehak

Figure 1 for InaGVAD : a Challenging French TV and Radio Corpus Annotated for Speech Activity Detection and Speaker Gender Segmentation

Figure 2 for InaGVAD : a Challenging French TV and Radio Corpus Annotated for Speech Activity Detection and Speaker Gender Segmentation

Figure 3 for InaGVAD : a Challenging French TV and Radio Corpus Annotated for Speech Activity Detection and Speaker Gender Segmentation

Figure 4 for InaGVAD : a Challenging French TV and Radio Corpus Annotated for Speech Activity Detection and Speaker Gender Segmentation

Abstract:InaGVAD is an audio corpus collected from 10 French radio and 18 TV channels categorized into 4 groups: generalist radio, music radio, news TV, and generalist TV. It contains 277 1-minute-long annotated recordings aimed at representing the acoustic diversity of French audiovisual programs and was primarily designed to build systems able to monitor men's and women's speaking time in media. inaGVAD is provided with Voice Activity Detection (VAD) and Speaker Gender Segmentation (SGS) annotations extended with overlap, speaker traits (gender, age, voice quality), and 10 non-speech event categories. Annotation distributions are detailed for each channel category. This dataset is partitioned into a 1h development and a 3h37 test subset, allowing fair and reproducible system evaluation. A benchmark of 6 freely available VAD software is presented, showing diverse abilities based on channel and non-speech event categories. Two existing SGS systems are evaluated on the corpus and compared against a baseline X-vector transfer learning strategy, trained on the development subset. Results demonstrate that our proposal, trained on a single - but diverse - hour of data, achieved competitive SGS results. The entire inaGVAD package; including corpus, annotations, evaluation scripts, and baseline training code; is made freely accessible, fostering future advancement in the domain.

* Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 8963-8974, Torino, Italia. ELRA and ICCL
* Voice Activity Detection (VAD), Speaker Gender Segmentation, Audiovisual Speech Resource, Speaker Traits, Speech Overlap, Benchmark, X-vector, Gender Representation in the Media, Dataset

Via

Access Paper or Ask Questions

A Semi-Automatic Approach to Create Large Gender- and Age-Balanced Speaker Corpora: Usefulness of Speaker Diarization & Identification

Apr 26, 2024

Rémi Uro, David Doukhan, Albert Rilliard, Laëtitia Larcher, Anissa-Claire Adgharouamane, Marie Tahon, Antoine Laurent

Abstract:This paper presents a semi-automatic approach to create a diachronic corpus of voices balanced for speaker's age, gender, and recording period, according to 32 categories (2 genders, 4 age ranges and 4 recording periods). Corpora were selected at French National Institute of Audiovisual (INA) to obtain at least 30 speakers per category (a total of 960 speakers; only 874 have be found yet). For each speaker, speech excerpts were extracted from audiovisual documents using an automatic pipeline consisting of speech detection, background music and overlapped speech removal and speaker diarization, used to present clean speaker segments to human annotators identifying target speakers. This pipeline proved highly effective, cutting down manual processing by a factor of ten. Evaluation of the quality of the automatic processing and of the final output is provided. It shows the automatic processing compare to up-to-date process, and that the output provides high quality speech for most of the selected excerpts. This method shows promise for creating large corpora of known target speakers.

* Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pages 3271-3280, Marseille, 20-25 June 2022. European Language Resources Association (ELRA)
* Keywords:, semi-automatic processing, corpus creation, diarization, speaker identification, gender-balanced, age-balanced, speaker corpus, diachrony

Via

Access Paper or Ask Questions

Evolution of Voices in French Audiovisual Media Across Genders and Age in a Diachronic Perspective

Apr 24, 2024

Albert Rilliard, David Doukhan, Rémi Uro, Simon Devauchelle

Abstract:We present a diachronic acoustic analysis of the voice of 1023 speakers from French media archives. The speakers are spread across 32 categories based on four periods (years 1955/56, 1975/76, 1995/96, 2015/16), four age groups (20-35; 36-50; 51-65, >65), and two genders. The fundamental frequency ($F_0$) and the first four formants (F1-4) were estimated. Procedures used to ensure the quality of these estimations on heterogeneous data are described. From each speaker's $F_0$ distribution, the base-$F_0$ value was calculated to estimate the register. Average vocal tract length was estimated from formant frequencies. Base-$F_0$ and vocal tract length were fit by linear mixed models to evaluate how they may have changed across time periods and genders, corrected for age effects. Results show an effect of the period with a tendency to lower voices, independently of gender. A lowering of pitch is observed with age for female but not male speakers.

* Radek Skarnitzl & Jan Vol\'in (Eds.), Proceedings of the 20th International Congress of Phonetic Sciences (ICPhS), Prague 2023, pp. 753-757. Guarant International. ISBN 978-80-908 114-2-3
* 5 pages, 2 figures, keywords:, Gender, Diachrony, Vocal Tract Resonance, Vocal register, Broadcast speech

Via

Access Paper or Ask Questions

Voice Passing : a Non-Binary Voice Gender Prediction System for evaluating Transgender voice transition

Apr 23, 2024

David Doukhan, Simon Devauchelle, Lucile Girard-Monneron, Mía Chávez Ruz, V. Chaddouk, Isabelle Wagner, Albert Rilliard

Abstract:This paper presents a software allowing to describe voices using a continuous Voice Femininity Percentage (VFP). This system is intended for transgender speakers during their voice transition and for voice therapists supporting them in this process. A corpus of 41 French cis- and transgender speakers was recorded. A perceptual evaluation allowed 57 participants to estimate the VFP for each voice. Binary gender classification models were trained on external gender-balanced data and used on overlapping windows to obtain average gender prediction estimates, which were calibrated to predict VFP and obtained higher accuracy than $F_0$ or vocal track length-based models. Training data speaking style and DNN architecture were shown to impact VFP estimation. Accuracy of the models was affected by speakers' age. This highlights the importance of style, age, and the conception of gender as binary or not, to build adequate statistical representations of cultural concepts.

* Proc. INTERSPEECH 2023, 5207-5211
* 5 pages, 1 figure, keywords: Transgender voice, Gender perception, Speaker gender classification, CNN, X-Vector

Via

Access Paper or Ask Questions