Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gabriel Synnaeve

Jack

Polygames: Improved Zero Learning

Jan 27, 2020

Tristan Cazenave, Yen-Chi Chen, Guan-Wei Chen, Shi-Yu Chen, Xian-Dong Chiu, Julien Dehos, Maria Elsa, Qucheng Gong, Hengyuan Hu, Vasil Khalidov(+14 more)

Figure 1 for Polygames: Improved Zero Learning

Figure 2 for Polygames: Improved Zero Learning

Abstract:Since DeepMind's AlphaZero, Zero learning quickly became the state-of-the-art method for many board games. It can be improved using a fully convolutional structure (no fully connected layer). Using such an architecture plus global pooling, we can create bots independent of the board size. The training can be made more robust by keeping track of the best checkpoints during the training and by training against them. Using these features, we release Polygames, our framework for Zero learning, with its library of games and its checkpoints. We won against strong humans at the game of Hex in 19x19, which was often said to be untractable for zero learning; and in Havannah. We also won several first places at the TAAI competitions.

Via

Access Paper or Ask Questions

Scaling Up Online Speech Recognition Using ConvNets

Jan 27, 2020

Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

Figure 1 for Scaling Up Online Speech Recognition Using ConvNets

Figure 2 for Scaling Up Online Speech Recognition Using ConvNets

Figure 3 for Scaling Up Online Speech Recognition Using ConvNets

Figure 4 for Scaling Up Online Speech Recognition Using ConvNets

Abstract:We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate. Also important to the efficiency of the recognizer is our highly optimized beam search decoder. To show the impact of our design choices, we analyze throughput, latency, accuracy, and discuss how these metrics can be tuned based on the user requirements.

Via

Access Paper or Ask Questions

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Dec 17, 2019

Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen(+5 more)

Figure 1 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 2 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 3 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 4 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Abstract:We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

Via

Access Paper or Ask Questions

End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Nov 19, 2019

Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Edouard Grave, Tatiana Likhomanenko, Vineel Pratap, Anuroop Sriram, Vitaliy Liptchinsky, Ronan Collobert

Figure 1 for End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Figure 2 for End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Abstract:We study ResNet-, Time-Depth Separable ConvNets-, and Transformer-based acoustic models, trained with CTC or Seq2Seq criterions. We perform experiments on the LibriSpeech dataset, with and without LM decoding, optionally with beam rescoring. We reach 5.18% WER with external language models for decoding and rescoring. Additionally, we leverage the unlabeled data from LibriVox by doing semi-supervised training and show that it is possible to reach 5.29% WER on test-other without decoding, and 4.11% WER with decoding and rescoring, with only the standard 960 hours from LibriSpeech as labeled data.

Via

Access Paper or Ask Questions

Deja-vu: Double Feature Presentation in Deep Transformer Networks

Oct 23, 2019

Andros Tjandra, Chunxi Liu, Frank Zhang, Xiaohui Zhang, Yongqiang Wang, Gabriel Synnaeve, Satoshi Nakamura, Geoffrey Zweig

Figure 1 for Deja-vu: Double Feature Presentation in Deep Transformer Networks

Figure 2 for Deja-vu: Double Feature Presentation in Deep Transformer Networks

Figure 3 for Deja-vu: Double Feature Presentation in Deep Transformer Networks

Figure 4 for Deja-vu: Double Feature Presentation in Deep Transformer Networks

Abstract:Deep acoustic models typically receive features in the first layer of the network, and process increasingly abstract representations in the subsequent layers. Here, we propose to feed the input features at multiple depths in the acoustic model. As our motivation is to allow acoustic models to re-examine their input features in light of partial hypotheses we introduce intermediate model heads and loss function. We study this architecture in the context of deep Transformer networks, and we use an attention mechanism over both the previous layer activations and the input features. To train this model's intermediate output hypothesis, we apply the objective function at each layer right before feature re-use. We find that the use of such intermediate losses significantly improves performance by itself, as well as enabling input feature re-use. We present results on both Librispeech, and a large scale video dataset, with relative improvements of 10 - 20% for Librispeech and 3.2 - 13% for videos.

Via

Access Paper or Ask Questions

A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

Oct 19, 2019

Nicolas Carion, Gabriel Synnaeve, Alessandro Lazaric, Nicolas Usunier

Figure 1 for A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

Figure 2 for A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

Figure 3 for A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

Figure 4 for A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

Abstract:Effective coordination is crucial to solve multi-agent collaborative (MAC) problems. While centralized reinforcement learning methods can optimally solve small MAC instances, they do not scale to large problems and they fail to generalize to scenarios different from those seen during training. In this paper, we consider MAC problems with some intrinsic notion of locality (e.g., geographic proximity) such that interactions between agents and tasks are locally limited. By leveraging this property, we introduce a novel structured prediction approach to assign agents to tasks. At each step, the assignment is obtained by solving a centralized optimization problem (the inference procedure) whose objective function is parameterized by a learned scoring model. We propose different combinations of inference procedures and scoring models able to represent coordination patterns of increasing complexity. The resulting assignment policy can be efficiently learned on small problem instances and readily reused in problems with more agents and tasks (i.e., zero-shot generalization). We report experimental results on a toy search and rescue problem and on several target selection scenarios in StarCraft: Brood War, in which our model significantly outperforms strong rule-based baselines on instances with 5 times more agents and tasks than those seen during training.

* NeurIPS 2019

Via

Access Paper or Ask Questions

Why Build an Assistant in Minecraft?

Jul 25, 2019

Arthur Szlam, Jonathan Gray, Kavya Srinet, Yacine Jernite, Armand Joulin, Gabriel Synnaeve, Douwe Kiela, Haonan Yu, Zhuoyuan Chen, Siddharth Goyal(+4 more)

Abstract:In this document we describe a rationale for a research program aimed at building an open "assistant" in the game Minecraft, in order to make progress on the problems of natural language understanding and learning from dialogue.

Via

Access Paper or Ask Questions

Growing Action Spaces

Jun 28, 2019

Gregory Farquhar, Laura Gustafson, Zeming Lin, Shimon Whiteson, Nicolas Usunier, Gabriel Synnaeve

Abstract:In complex tasks, such as those with large combinatorial action spaces, random exploration may be too inefficient to achieve meaningful learning progress. In this work, we use a curriculum of progressively growing action spaces to accelerate learning. We assume the environment is out of our control, but that the agent may set an internal curriculum by initially restricting its action space. Our approach uses off-policy reinforcement learning to estimate optimal value functions for multiple action spaces simultaneously and efficiently transfers data, value estimates, and state representations from restricted action spaces to the full task. We show the efficacy of our approach in proof-of-concept control tasks and on challenging large-scale StarCraft micromanagement tasks with large, multi-agent action spaces.

Via

Access Paper or Ask Questions

Word-level Speech Recognition with a Dynamic Lexicon

Jun 10, 2019

Ronan Collobert, Awni Hannun, Gabriel Synnaeve

Figure 1 for Word-level Speech Recognition with a Dynamic Lexicon

Figure 2 for Word-level Speech Recognition with a Dynamic Lexicon

Figure 3 for Word-level Speech Recognition with a Dynamic Lexicon

Figure 4 for Word-level Speech Recognition with a Dynamic Lexicon

Abstract:We propose a direct-to-word sequence model with a dynamic lexicon. Our word network constructs word embeddings dynamically from the character level tokens. The word network can be integrated seamlessly with arbitrary sequence models including Connectionist Temporal Classification and encoder-decoder models with attention. Sub-word units are commonly used in speech recognition yet are generated without the use of acoustic context. We show our direct-to-word model can achieve word error rate gains over sub-word level models for speech recognition. Furthermore, we empirically validate that the word-level embeddings we learn contain significant acoustic information, making them more suitable for use in speech recognition. We also show that our direct-to-word approach retains the ability to predict words not seen at training time without any retraining.

Via

Access Paper or Ask Questions

Who Needs Words? Lexicon-Free Speech Recognition

Apr 09, 2019

Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

Figure 1 for Who Needs Words? Lexicon-Free Speech Recognition

Figure 2 for Who Needs Words? Lexicon-Free Speech Recognition

Figure 3 for Who Needs Words? Lexicon-Free Speech Recognition

Figure 4 for Who Needs Words? Lexicon-Free Speech Recognition

Abstract:Lexicon-free speech recognition naturally deals with the problem of out-of-vocabulary (OOV) words. In this paper, we show that character-based language models (LM) can perform as well as word-based LMs for speech recognition, in word error rates (WER), even without restricting the decoding to a lexicon. We study character-based LMs and show that convolutional LMs can effectively leverage large (character) contexts, which is key for good speech recognition performance downstream. We specifically show that the lexicon-free decoding performance (WER) on utterances with OOV words using character-based LMs is better than lexicon-based decoding, both with character or word-based LMs.

* 8 pages, 1 figure

Via

Access Paper or Ask Questions