Alert button
Picture for Christian Puhrsch

Christian Puhrsch

Alert button

TorchAudio: Building Blocks for Audio and Speech Processing

Oct 28, 2021
Yao-Yuan Yang, Moto Hira, Zhaoheng Ni, Anjali Chourdia, Artyom Astafurov, Caroline Chen, Ching-Feng Yeh, Christian Puhrsch, David Pollack, Dmitriy Genzel, Donny Greenberg, Edward Z. Yang, Jason Lian, Jay Mahadeokar, Jeff Hwang, Ji Chen, Peter Goldsborough, Prabhat Roy, Sean Narenthiran, Shinji Watanabe, Soumith Chintala, Vincent Quenneville-Bélair, Yangyang Shi

Figure 1 for TorchAudio: Building Blocks for Audio and Speech Processing
Figure 2 for TorchAudio: Building Blocks for Audio and Speech Processing
Figure 3 for TorchAudio: Building Blocks for Audio and Speech Processing
Figure 4 for TorchAudio: Building Blocks for Audio and Speech Processing

This document describes version 0.10 of torchaudio: building blocks for machine learning applications in the audio and speech processing domain. The objective of torchaudio is to accelerate the development and deployment of machine learning applications for researchers and engineers by providing off-the-shelf building blocks. The building blocks are designed to be GPU-compatible, automatically differentiable, and production-ready. torchaudio can be easily installed from Python Package Index repository and the source code is publicly available under a BSD-2-Clause License (as of September 2021) at https://github.com/pytorch/audio. In this document, we provide an overview of the design principles, functionalities, and benchmarks of torchaudio. We also benchmark our implementation of several audio and speech operations and models. We verify through the benchmarks that our implementations of various operations and models are valid and perform similarly to other publicly available implementations.

* Submitted to ICASSP 2022 
Viaarxiv icon

Advances in Pre-Training Distributed Word Representations

Dec 26, 2017
Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, Armand Joulin

Figure 1 for Advances in Pre-Training Distributed Word Representations
Figure 2 for Advances in Pre-Training Distributed Word Representations
Figure 3 for Advances in Pre-Training Distributed Word Representations
Figure 4 for Advances in Pre-Training Distributed Word Representations

Many Natural Language Processing applications nowadays rely on pre-trained word representations estimated from large text corpora such as news collections, Wikipedia and Web Crawl. In this paper, we show how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together. The main result of our work is the new set of publicly available pre-trained models that outperform the current state of the art by a large margin on a number of tasks.

Viaarxiv icon

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Sep 13, 2016
Ronan Collobert, Christian Puhrsch, Gabriel Synnaeve

Figure 1 for Wav2Letter: an End-to-End ConvNet-based Speech Recognition System
Figure 2 for Wav2Letter: an End-to-End ConvNet-based Speech Recognition System
Figure 3 for Wav2Letter: an End-to-End ConvNet-based Speech Recognition System
Figure 4 for Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criterion for training from sequence annotation without alignment that is on par with CTC while being simpler. We show competitive results in word error rate on the Librispeech corpus with MFCC features, and promising results from raw waveform.

* 8 pages, 4 figures (7 plots/schemas), 2 tables (4 tabulars) 
Viaarxiv icon

Very Deep Multilingual Convolutional Neural Networks for LVCSR

Jan 23, 2016
Tom Sercu, Christian Puhrsch, Brian Kingsbury, Yann LeCun

Figure 1 for Very Deep Multilingual Convolutional Neural Networks for LVCSR
Figure 2 for Very Deep Multilingual Convolutional Neural Networks for LVCSR
Figure 3 for Very Deep Multilingual Convolutional Neural Networks for LVCSR
Figure 4 for Very Deep Multilingual Convolutional Neural Networks for LVCSR

Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in LVCSR have not kept pace with recent advances in other domains where deeper neural networks provide superior performance. In this paper we propose a number of architectural advances in CNNs for LVCSR. First, we introduce a very deep convolutional network architecture with up to 14 weight layers. There are multiple convolutional layers before each pooling layer, with small 3x3 kernels, inspired by the VGG Imagenet 2014 architecture. Then, we introduce multilingual CNNs with multiple untied layers. Finally, we introduce multi-scale input features aimed at exploiting more context at negligible computational cost. We evaluate the improvements first on a Babel task for low resource speech recognition, obtaining an absolute 5.77% WER improvement over the baseline PLP DNN by training our CNN on the combined data of six different languages. We then evaluate the very deep CNNs on the Hub5'00 benchmark (using the 262 hours of SWB-1 training data) achieving a word error rate of 11.8% after cross-entropy training, a 1.4% WER improvement (10.6% relative) over the best published CNN result so far.

* Accepted for publication at ICASSP 2016 
Viaarxiv icon

Learning Invariants using Decision Trees

Jan 20, 2015
Siddharth Krishna, Christian Puhrsch, Thomas Wies

Figure 1 for Learning Invariants using Decision Trees
Figure 2 for Learning Invariants using Decision Trees
Figure 3 for Learning Invariants using Decision Trees

The problem of inferring an inductive invariant for verifying program safety can be formulated in terms of binary classification. This is a standard problem in machine learning: given a sample of good and bad points, one is asked to find a classifier that generalizes from the sample and separates the two sets. Here, the good points are the reachable states of the program, and the bad points are those that reach a safety property violation. Thus, a learned classifier is a candidate invariant. In this paper, we propose a new algorithm that uses decision trees to learn candidate invariants in the form of arbitrary Boolean combinations of numerical inequalities. We have used our algorithm to verify C programs taken from the literature. The algorithm is able to infer safe invariants for a range of challenging benchmarks and compares favorably to other ML-based invariant inference techniques. In particular, it scales well to large sample sets.

* 15 pages, 2 figures 
Viaarxiv icon

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Jun 09, 2014
David Eigen, Christian Puhrsch, Rob Fergus

Figure 1 for Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
Figure 2 for Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
Figure 3 for Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
Figure 4 for Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation, finding depth relations from a single image is less straightforward, requiring integration of both global and local information from various cues. Moreover, the task is inherently ambiguous, with a large source of uncertainty coming from the overall scale. In this paper, we present a new method that addresses this task by employing two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. We also apply a scale-invariant error to help measure depth relations rather than scale. By leveraging the raw datasets as large sources of training data, our method achieves state-of-the-art results on both NYU Depth and KITTI, and matches detailed depth boundaries without the need for superpixelation.

Viaarxiv icon