Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Emmanuel Dupoux

LSCP, CoML

Evaluating the reliability of acoustic speech embeddings

Jul 27, 2020

Robin Algayres, Mohamed Salah Zaiem, Benoit Sagot, Emmanuel Dupoux

Figure 1 for Evaluating the reliability of acoustic speech embeddings

Figure 2 for Evaluating the reliability of acoustic speech embeddings

Figure 3 for Evaluating the reliability of acoustic speech embeddings

Abstract:Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to unsupervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimise the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsupervised, and using different loss functions (autoencoders, correspondence autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency estimation. However, substantial discrepancies appear in the fine-grained distinctions across languages and/or embedding methods. This makes it unrealistic at present to propose a task-independent silver bullet method for computing the intrinsic quality of speech embeddings. There is a need for more detailed analysis of the metrics currently used to evaluate such embeddings.

* Conference paper at Interspeech 2020

Via

Access Paper or Ask Questions

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Jul 02, 2020

Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux

Figure 1 for Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Figure 2 for Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Figure 3 for Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Figure 4 for Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Abstract:Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative.

Via

Access Paper or Ask Questions

Vocal markers from sustained phonation in Huntington's Disease

Jun 09, 2020

Rachid Riad, Hadrien Titeux, Laurie Lemoine, Justine Montillot, Jennifer Hamet Bagnou, Xuan Nga Cao, Emmanuel Dupoux, Anne-Catherine Bachoud-Lévi

Figure 1 for Vocal markers from sustained phonation in Huntington's Disease

Figure 2 for Vocal markers from sustained phonation in Huntington's Disease

Figure 3 for Vocal markers from sustained phonation in Huntington's Disease

Figure 4 for Vocal markers from sustained phonation in Huntington's Disease

Abstract:Disease-modifying treatments are currently assessed in neurodegenerative diseases. Huntington's Disease represents a unique opportunity to design automatic sub-clinical markers, even in premanifest gene carriers. We investigated phonatory impairments as potential clinical markers and propose them for both diagnosis and gene carriers follow-up. We used two sets of features: Phonatory features and Modulation Power Spectrum Features. We found that phonation is not sufficient for the identification of sub-clinical disorders of premanifest gene carriers. According to our regression results, Phonatory features are suitable for the predictions of clinical performance in Huntington's Disease.

* submitted to INTERSPEECH 2020. 1 pages of supplementary material appear only in the arxiv version

Via

Access Paper or Ask Questions

Occlusion resistant learning of intuitive physics from videos

Apr 30, 2020

Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux

Figure 1 for Occlusion resistant learning of intuitive physics from videos

Figure 2 for Occlusion resistant learning of intuitive physics from videos

Figure 3 for Occlusion resistant learning of intuitive physics from videos

Figure 4 for Occlusion resistant learning of intuitive physics from videos

Abstract:To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to the case where no, or only limited, occlusions occur. In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions. In our formulation, object positions are modeled as latent variables enabling the reconstruction of the scene. We then propose a series of approximations that make this problem tractable. Object proposals are linked across frames using a combination of a recurrent interaction network, modeling the physics in object space, and a compositional renderer, modeling the way in which objects project onto pixel space. We demonstrate significant improvements over state-of-the-art in the intuitive physics benchmark of IntPhys. We apply our method to a second dataset with increasing levels of occlusions, showing it realistically predicts segmentation masks up to 30 frames in the future. Finally, we also show results on predicting motion of objects in real videos.

Via

Access Paper or Ask Questions

Compositionality and Generalization in Emergent Languages

Apr 20, 2020

Rahma Chaabouni, Eugene Kharitonov, Diane Bouchacourt, Emmanuel Dupoux, Marco Baroni

Figure 1 for Compositionality and Generalization in Emergent Languages

Figure 2 for Compositionality and Generalization in Emergent Languages

Figure 3 for Compositionality and Generalization in Emergent Languages

Figure 4 for Compositionality and Generalization in Emergent Languages

Abstract:Natural language allows us to refer to novel composite concepts by combining expressions denoting their parts according to systematic rules, a property known as \emph{compositionality}. In this paper, we study whether the language emerging in deep multi-agent simulations possesses a similar ability to refer to novel primitive combinations, and whether it accomplishes this feat by strategies akin to human-language compositionality. Equipped with new ways to measure compositionality in emergent languages inspired by disentanglement in representation learning, we establish three main results. First, given sufficiently large input spaces, the emergent language will naturally develop the ability to refer to novel composite concepts. Second, there is no correlation between the degree of compositionality of an emergent language and its ability to generalize. Third, while compositionality is not necessary for generalization, it provides an advantage in terms of language transmission: The more compositional a language is, the more easily it will be picked up by new learners, even when the latter differ in architecture from the original agents. We conclude that compositionality does not arise from simple generalization pressure, but if an emergent language does chance upon it, it will be more likely to survive and thrive.

Via

Access Paper or Ask Questions

Seshat: A tool for managing and verifying annotation campaigns of audio data

Mar 03, 2020

Hadrien Titeux, Rachid Riad, Xuan-Nga Cao, Nicolas Hamilakis, Kris Madden, Alejandrina Cristia, Anne-Catherine Bachoud-Lévi, Emmanuel Dupoux

Figure 1 for Seshat: A tool for managing and verifying annotation campaigns of audio data

Figure 2 for Seshat: A tool for managing and verifying annotation campaigns of audio data

Figure 3 for Seshat: A tool for managing and verifying annotation campaigns of audio data

Figure 4 for Seshat: A tool for managing and verifying annotation campaigns of audio data

Abstract:We introduce Seshat, a new, simple and open-source software to efficiently manage annotations of speech corpora. The Seshat software allows users to easily customise and manage annotations of large audio corpora while ensuring compliance with the formatting and naming conventions of the annotated output files. In addition, it includes procedures for checking the content of annotations following specific rules are implemented in personalised parsers. Finally, we propose a double-annotation mode, for which Seshat computes automatically an associated inter-annotator agreement with the $\gamma$ measure taking into account the categorisation and segmentation discrepancies.

* LREC, May 2020, Marseilles, France

Via

Access Paper or Ask Questions

Identification of primary and collateral tracks in stuttered speech

Mar 02, 2020

Rachid Riad, Anne-Catherine Bachoud-Lévi, Frank Rudzicz, Emmanuel Dupoux

Figure 1 for Identification of primary and collateral tracks in stuttered speech

Figure 2 for Identification of primary and collateral tracks in stuttered speech

Figure 3 for Identification of primary and collateral tracks in stuttered speech

Figure 4 for Identification of primary and collateral tracks in stuttered speech

Abstract:Disfluent speech has been previously addressed from two main perspectives: the clinical perspective focusing on diagnostic, and the Natural Language Processing (NLP) perspective aiming at modeling these events and detect them for downstream tasks. In addition, previous works often used different metrics depending on whether the input features are text or speech, making it difficult to compare the different contributions. Here, we introduce a new evaluation framework for disfluency detection inspired by the clinical and NLP perspective together with the theory of performance from \cite{clark1996using} which distinguishes between primary and collateral tracks. We introduce a novel forced-aligned disfluency dataset from a corpus of semi-directed interviews, and present baseline results directly comparing the performance of text-based features (word and span information) and speech-based (acoustic-prosodic information). Finally, we introduce new audio features inspired by the word-based span features. We show experimentally that using these features outperformed the baselines for speech-based predictions on the present dataset.

* To be published in LREC 2020

Via

Access Paper or Ask Questions

Unsupervised pretraining transfers well across languages

Feb 07, 2020

Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, Emmanuel Dupoux

Figure 1 for Unsupervised pretraining transfers well across languages

Figure 2 for Unsupervised pretraining transfers well across languages

Figure 3 for Unsupervised pretraining transfers well across languages

Figure 4 for Unsupervised pretraining transfers well across languages

Abstract:Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.

* ICASSP 2020
* 6 pages. Accepted at ICASSP 2020. However the 2 pages of supplementary materials will appear only in the arxiv version

Via

Access Paper or Ask Questions

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Dec 17, 2019

Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen(+5 more)

Figure 1 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 2 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 3 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 4 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Abstract:We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

Via

Access Paper or Ask Questions

Modeling German Verb Argument Structures: LSTMs vs. Humans

Nov 30, 2019

Charlotte Rochereau, Benoît Sagot, Emmanuel Dupoux

Figure 1 for Modeling German Verb Argument Structures: LSTMs vs. Humans

Figure 2 for Modeling German Verb Argument Structures: LSTMs vs. Humans

Figure 3 for Modeling German Verb Argument Structures: LSTMs vs. Humans

Figure 4 for Modeling German Verb Argument Structures: LSTMs vs. Humans

Abstract:LSTMs have proven very successful at language modeling. However, it remains unclear to what extent they are able to capture complex morphosyntactic structures. In this paper, we examine whether LSTMs are sensitive to verb argument structures. We introduce a German grammaticality dataset in which ungrammatical sentences are constructed by manipulating case assignments (eg substituting nominative by accusative or dative). We find that LSTMs are better than chance in detecting incorrect argument structures and slightly worse than humans tested on the same dataset. Surprisingly, LSTMs are contaminated by heuristics not found in humans like a preference toward nominative noun phrases. In other respects they show human-similar results like biases for particular orders of case assignments.

Via

Access Paper or Ask Questions