Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Emmanuel Dupoux

IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning

Jun 26, 2018
Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, Emmanuel Dupoux

Figure 1 for IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning

Figure 2 for IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning

Figure 3 for IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning

Figure 4 for IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning

In order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. Inspired by work on intuitive physics in infants, we propose an evaluation framework which diagnoses how much a given system understands about physics by testing whether it can tell apart well matched videos of possible versus impossible events. The test requires systems to compute a physical plausibility score over an entire video. It is free of bias and can test a range of specific physical reasoning skills. We then describe the first release of a benchmark dataset aimed at learning intuitive physics in an unsupervised way, using videos constructed with a game engine. We describe two Deep Neural Network baseline systems trained with a future frame prediction objective and tested on the possible versus impossible discrimination task. The analysis of their results compared to human data gives novel insights in the potentials and limitations of next frame prediction architectures.

Via

Access Paper or Ask Questions

End-to-End Speech Recognition From the Raw Waveform

Jun 21, 2018
Neil Zeghidour, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert, Emmanuel Dupoux

Figure 1 for End-to-End Speech Recognition From the Raw Waveform

Figure 2 for End-to-End Speech Recognition From the Raw Waveform

Figure 3 for End-to-End Speech Recognition From the Raw Waveform

Figure 4 for End-to-End Speech Recognition From the Raw Waveform

State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015), and the second one by the scattering transform (Zeghidour et al., 2017). We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to the low-pass filter used in these approaches. These modifications consistently improve performances for both approaches, and remove the need for a careful initialization in scattering-based trainable filterbanks. In particular, we show a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel-filterbanks. It is the first time end-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions.

* Accepted for presentation at Interspeech 2018

Via

Access Paper or Ask Questions

Learning Filterbanks from Raw Speech for Phone Recognition

Apr 04, 2018
Neil Zeghidour, Nicolas Usunier, Iasonas Kokkinos, Thomas Schatz, Gabriel Synnaeve, Emmanuel Dupoux

Figure 1 for Learning Filterbanks from Raw Speech for Phone Recognition

Figure 2 for Learning Filterbanks from Raw Speech for Phone Recognition

Figure 3 for Learning Filterbanks from Raw Speech for Phone Recognition

Figure 4 for Learning Filterbanks from Raw Speech for Phone Recognition

We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD-filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic.

* Accepted at ICASSP 2018

Via

Access Paper or Ask Questions

Bayesian Models for Unit Discovery on a Very Low Resource Language

Feb 20, 2018
Lucas Ondel, Pierre Godard, Laurent Besacier, Elin Larsen, Mark Hasegawa-Johnson, Odette Scharenborg, Emmanuel Dupoux, Lukas Burget, François Yvon, Sanjeev Khudanpur

Figure 1 for Bayesian Models for Unit Discovery on a Very Low Resource Language

Figure 2 for Bayesian Models for Unit Discovery on a Very Low Resource Language

Figure 3 for Bayesian Models for Unit Discovery on a Very Low Resource Language

Figure 4 for Bayesian Models for Unit Discovery on a Very Low Resource Language

Developing speech technologies for low-resource languages has become a very active research field over the last decade. Among others, Bayesian models have shown some promising results on artificial examples but still lack of in situ experiments. Our work applies state-of-the-art Bayesian models to unsupervised Acoustic Unit Discovery (AUD) in a real low-resource language scenario. We also show that Bayesian models can naturally integrate information from other resourceful languages by means of informative prior leading to more consistent discovered units. Finally, discovered acoustic units are used, either as the 1-best sequence or as a lattice, to perform word segmentation. Word segmentation results show that this Bayesian approach clearly outperforms a Segmental-DTW baseline on the same corpus.

* Accepted to ICASSP 2018

Via

Access Paper or Ask Questions

Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner

Feb 14, 2018
Emmanuel Dupoux

Figure 1 for Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner

Figure 2 for Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner

Figure 3 for Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner

Figure 4 for Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner

During their first years of life, infants learn the language(s) of their environment at an amazing speed despite large cross cultural variations in amount and complexity of the available language input. Understanding this simple fact still escapes current cognitive and linguistic theories. Recently, spectacular progress in the engineering science, notably, machine learning and wearable technology, offer the promise of revolutionizing the study of cognitive development. Machine learning offers powerful learning algorithms that can achieve human-like performance on many linguistic tasks. Wearable sensors can capture vast amounts of data, which enable the reconstruction of the sensory experience of infants in their natural environment. The project of 'reverse engineering' language development, i.e., of building an effective system that mimics infant's achievements appears therefore to be within reach. Here, we analyze the conditions under which such a project can contribute to our scientific understanding of early language development. We argue that instead of defining a sub-problem or simplifying the data, computational models should address the full complexity of the learning situation, and take as input the raw sensory signals available to infants. This implies that (1) accessible but privacy-preserving repositories of home data be setup and widely shared, and (2) models be evaluated at different linguistic levels through a benchmark of psycholinguist tests that can be passed by machines and humans alike, (3) linguistically and psychologically plausible learning architectures be scaled up to real data using probabilistic/optimization principles from machine learning. We discuss the feasibility of this approach and present preliminary results.

* Dupoux, E. (2018). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language learner. Cognition, 173, 43-59
* 27 pages, 5 figures, 3 tables, supplementary materials

Via

Access Paper or Ask Questions

Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Feb 14, 2018
Odette Scharenborg, Laurent Besacier, Alan Black, Mark Hasegawa-Johnson, Florian Metze, Graham Neubig, Sebastian Stueker, Pierre Godard, Markus Mueller, Lucas Ondel, Shruti Palaskar, Philip Arthur, Francesco Ciannella, Mingxing Du, Elin Larsen, Danny Merkx, Rachid Riad, Liming Wang, Emmanuel Dupoux

Figure 1 for Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Figure 2 for Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Figure 3 for Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.

* Accepted to ICASSP 2018

Via

Access Paper or Ask Questions

Are words easier to learn from infant- than adult-directed speech? A quantitative corpus-based investigation

Dec 23, 2017
Adriana Guevara-Rukoz, Alejandrina Cristia, Bogdan Ludusan, Roland Thiollière, Andrew Martin, Reiko Mazuka, Emmanuel Dupoux

Figure 1 for Are words easier to learn from infant- than adult-directed speech? A quantitative corpus-based investigation

Figure 2 for Are words easier to learn from infant- than adult-directed speech? A quantitative corpus-based investigation

Figure 3 for Are words easier to learn from infant- than adult-directed speech? A quantitative corpus-based investigation

Figure 4 for Are words easier to learn from infant- than adult-directed speech? A quantitative corpus-based investigation

We investigate whether infant-directed speech (IDS) could facilitate word form learning when compared to adult-directed speech (ADS). To study this, we examine the distribution of word forms at two levels, acoustic and phonological, using a large database of spontaneous speech in Japanese. At the acoustic level we show that, as has been documented before for phonemes, the realizations of words are more variable and less discriminable in IDS than in ADS. At the phonological level, we find an effect in the opposite direction: the IDS lexicon contains more distinctive words (such as onomatopoeias) than the ADS counterpart. Combining the acoustic and phonological metrics together in a global discriminability score reveals that the bigger separation of lexical categories in the phonological space does not compensate for the opposite effect observed at the acoustic level. As a result, IDS word forms are still globally less discriminable than ADS word forms, even though the effect is numerically small. We discuss the implication of these findings for the view that the functional role of IDS is to improve language learnability.

* Draft

Via

Access Paper or Ask Questions

The Zero Resource Speech Challenge 2017

Dec 12, 2017
Ewan Dunbar, Xuan Nga Cao, Juan Benjumea, Julien Karadayi, Mathieu Bernard, Laurent Besacier, Xavier Anguera, Emmanuel Dupoux

Figure 1 for The Zero Resource Speech Challenge 2017

Figure 2 for The Zero Resource Speech Challenge 2017

Figure 3 for The Zero Resource Speech Challenge 2017

We describe a new challenge aimed at discovering subword and word units from raw speech. This challenge is the followup to the Zero Resource Speech Challenge 2015. It aims at constructing systems that generalize across languages and adapt to new speakers. The design features and evaluation metrics of the challenge are presented and the results of seventeen models are discussed.

* IEEE ASRU (Automatic Speech Recognition and Understanding) 2017. Okinawa, Japan

Via

Access Paper or Ask Questions

Learning weakly supervised multimodal phoneme embeddings

Oct 18, 2017
Rahma Chaabouni, Ewan Dunbar, Neil Zeghidour, Emmanuel Dupoux

Figure 1 for Learning weakly supervised multimodal phoneme embeddings

Figure 2 for Learning weakly supervised multimodal phoneme embeddings

Figure 3 for Learning weakly supervised multimodal phoneme embeddings

Figure 4 for Learning weakly supervised multimodal phoneme embeddings

Recent works have explored deep architectures for learning multimodal speech representation (e.g. audio and images, articulation and audio) in a supervised way. Here we investigate the role of combining different speech modalities, i.e. audio and visual information representing the lips movements, in a weakly supervised way using Siamese networks and lexical same-different side information. In particular, we ask whether one modality can benefit from the other to provide a richer representation for phone recognition in a weakly supervised setting. We introduce mono-task and multi-task methods for merging speech and visual modalities for phone recognition. The mono-task learning consists in applying a Siamese network on the concatenation of the two modalities, while the multi-task learning receives several different combinations of modalities at train time. We show that multi-task learning enhances discriminability for visual and multimodal inputs while minimally impacting auditory inputs. Furthermore, we present a qualitative analysis of the obtained phone embeddings, and show that cross-modal visual input can improve the discriminability of phonological features which are visually discernable (rounding, open/close, labial place of articulation), resulting in representations that are closer to abstract linguistic features than those based on audio only.

Via

Access Paper or Ask Questions

Blind phoneme segmentation with temporal prediction errors

May 27, 2017
Paul Michel, Okko Räsänen, Roland Thiollière, Emmanuel Dupoux

Figure 1 for Blind phoneme segmentation with temporal prediction errors

Figure 2 for Blind phoneme segmentation with temporal prediction errors

Figure 3 for Blind phoneme segmentation with temporal prediction errors

Figure 4 for Blind phoneme segmentation with temporal prediction errors

Phonemic segmentation of speech is a critical step of speech recognition systems. We propose a novel unsupervised algorithm based on sequence prediction models such as Markov chains and recurrent neural network. Our approach consists in analyzing the error profile of a model trained to predict speech features frame-by-frame. Specifically, we try to learn the dynamics of speech in the MFCC space and hypothesize boundaries from local maxima in the prediction error. We evaluate our system on the TIMIT dataset, with improvements over similar methods.

* 7 pages 3 figures. Presented at ACL SRW 2017

Via

Access Paper or Ask Questions