Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Christian Puhrsch

TorchAudio: Building Blocks for Audio and Speech Processing

Oct 28, 2021

Yao-Yuan Yang, Moto Hira, Zhaoheng Ni, Anjali Chourdia, Artyom Astafurov, Caroline Chen, Ching-Feng Yeh, Christian Puhrsch, David Pollack, Dmitriy Genzel(+13 more)

Figure 1 for TorchAudio: Building Blocks for Audio and Speech Processing

Figure 2 for TorchAudio: Building Blocks for Audio and Speech Processing

Figure 3 for TorchAudio: Building Blocks for Audio and Speech Processing

Figure 4 for TorchAudio: Building Blocks for Audio and Speech Processing

Abstract:This document describes version 0.10 of torchaudio: building blocks for machine learning applications in the audio and speech processing domain. The objective of torchaudio is to accelerate the development and deployment of machine learning applications for researchers and engineers by providing off-the-shelf building blocks. The building blocks are designed to be GPU-compatible, automatically differentiable, and production-ready. torchaudio can be easily installed from Python Package Index repository and the source code is publicly available under a BSD-2-Clause License (as of September 2021) at https://github.com/pytorch/audio. In this document, we provide an overview of the design principles, functionalities, and benchmarks of torchaudio. We also benchmark our implementation of several audio and speech operations and models. We verify through the benchmarks that our implementations of various operations and models are valid and perform similarly to other publicly available implementations.

* Submitted to ICASSP 2022

Via

Access Paper or Ask Questions

Advances in Pre-Training Distributed Word Representations

Dec 26, 2017

Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, Armand Joulin

Figure 1 for Advances in Pre-Training Distributed Word Representations

Figure 2 for Advances in Pre-Training Distributed Word Representations

Figure 3 for Advances in Pre-Training Distributed Word Representations

Figure 4 for Advances in Pre-Training Distributed Word Representations

Abstract:Many Natural Language Processing applications nowadays rely on pre-trained word representations estimated from large text corpora such as news collections, Wikipedia and Web Crawl. In this paper, we show how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together. The main result of our work is the new set of publicly available pre-trained models that outperform the current state of the art by a large margin on a number of tasks.

Via

Access Paper or Ask Questions

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Sep 13, 2016

Ronan Collobert, Christian Puhrsch, Gabriel Synnaeve

Figure 1 for Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Figure 2 for Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Figure 3 for Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Figure 4 for Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Abstract:This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criterion for training from sequence annotation without alignment that is on par with CTC while being simpler. We show competitive results in word error rate on the Librispeech corpus with MFCC features, and promising results from raw waveform.

* 8 pages, 4 figures (7 plots/schemas), 2 tables (4 tabulars)

Via

Access Paper or Ask Questions

Very Deep Multilingual Convolutional Neural Networks for LVCSR

Jan 23, 2016

Tom Sercu, Christian Puhrsch, Brian Kingsbury, Yann LeCun

Figure 1 for Very Deep Multilingual Convolutional Neural Networks for LVCSR

Figure 2 for Very Deep Multilingual Convolutional Neural Networks for LVCSR

Figure 3 for Very Deep Multilingual Convolutional Neural Networks for LVCSR

Figure 4 for Very Deep Multilingual Convolutional Neural Networks for LVCSR

Abstract:Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in LVCSR have not kept pace with recent advances in other domains where deeper neural networks provide superior performance. In this paper we propose a number of architectural advances in CNNs for LVCSR. First, we introduce a very deep convolutional network architecture with up to 14 weight layers. There are multiple convolutional layers before each pooling layer, with small 3x3 kernels, inspired by the VGG Imagenet 2014 architecture. Then, we introduce multilingual CNNs with multiple untied layers. Finally, we introduce multi-scale input features aimed at exploiting more context at negligible computational cost. We evaluate the improvements first on a Babel task for low resource speech recognition, obtaining an absolute 5.77% WER improvement over the baseline PLP DNN by training our CNN on the combined data of six different languages. We then evaluate the very deep CNNs on the Hub5'00 benchmark (using the 262 hours of SWB-1 training data) achieving a word error rate of 11.8% after cross-entropy training, a 1.4% WER improvement (10.6% relative) over the best published CNN result so far.

* Accepted for publication at ICASSP 2016

Via

Access Paper or Ask Questions

Learning Invariants using Decision Trees

Jan 20, 2015

Siddharth Krishna, Christian Puhrsch, Thomas Wies

Figure 1 for Learning Invariants using Decision Trees

Figure 2 for Learning Invariants using Decision Trees

Figure 3 for Learning Invariants using Decision Trees

Abstract:The problem of inferring an inductive invariant for verifying program safety can be formulated in terms of binary classification. This is a standard problem in machine learning: given a sample of good and bad points, one is asked to find a classifier that generalizes from the sample and separates the two sets. Here, the good points are the reachable states of the program, and the bad points are those that reach a safety property violation. Thus, a learned classifier is a candidate invariant. In this paper, we propose a new algorithm that uses decision trees to learn candidate invariants in the form of arbitrary Boolean combinations of numerical inequalities. We have used our algorithm to verify C programs taken from the literature. The algorithm is able to infer safe invariants for a range of challenging benchmarks and compares favorably to other ML-based invariant inference techniques. In particular, it scales well to large sample sets.

* 15 pages, 2 figures

Via

Access Paper or Ask Questions

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Jun 09, 2014

David Eigen, Christian Puhrsch, Rob Fergus

Figure 1 for Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Figure 2 for Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Figure 3 for Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Figure 4 for Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Abstract:Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation, finding depth relations from a single image is less straightforward, requiring integration of both global and local information from various cues. Moreover, the task is inherently ambiguous, with a large source of uncertainty coming from the overall scale. In this paper, we present a new method that addresses this task by employing two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. We also apply a scale-invariant error to help measure depth relations rather than scale. By leveraging the raw datasets as large sources of training data, our method achieves state-of-the-art results on both NYU Depth and KITTI, and matches detailed depth boundaries without the need for superpixelation.

Via

Access Paper or Ask Questions