Alert button
Picture for Fei Jia

Fei Jia

Alert button

Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

Apr 13, 2023
Hainan Xu, Fei Jia, Somshubra Majumdar, He Huang, Shinji Watanabe, Boris Ginsburg

Figure 1 for Efficient Sequence Transduction by Jointly Predicting Tokens and Durations
Figure 2 for Efficient Sequence Transduction by Jointly Predicting Tokens and Durations
Figure 3 for Efficient Sequence Transduction by Jointly Predicting Tokens and Durations
Figure 4 for Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

This paper introduces a novel Token-and-Duration Transducer (TDT) architecture for sequence-to-sequence tasks. TDT extends conventional RNN-Transducer architectures by jointly predicting both a token and its duration, i.e. the number of input frames covered by the emitted token. This is achieved by using a joint network with two outputs which are independently normalized to generate distributions over tokens and durations. During inference, TDT models can skip input frames guided by the predicted duration output, which makes them significantly faster than conventional Transducers which process the encoder output frame by frame. TDT models achieve both better accuracy and significantly faster inference than conventional Transducers on different sequence transduction tasks. TDT models for Speech Recognition achieve better accuracy and up to 2.82X faster inference than RNN-Transducers. TDT models for Speech Translation achieve an absolute gain of over 1 BLEU on the MUST-C test compared with conventional Transducers, and its inference is 2.27X faster. In Speech Intent Classification and Slot Filling tasks, TDT models improve the intent accuracy up to over 1% (absolute) over conventional Transducers, while running up to 1.28X faster.

Viaarxiv icon

Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models

Nov 09, 2022
Travis M. Bartley, Fei Jia, Krishna C. Puvvada, Samuel Kriman, Boris Ginsburg

Figure 1 for Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models
Figure 2 for Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models
Figure 3 for Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models
Figure 4 for Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models

In this paper, we extend previous self-supervised approaches for language identification by experimenting with Conformer based architecture in a multilingual pre-training paradigm. We find that pre-trained speech models optimally encode language discriminatory information in lower layers. Further, we demonstrate that the embeddings obtained from these layers are significantly robust to classify unseen languages and different acoustic environments without additional training. After fine-tuning a pre-trained Conformer model on the VoxLingua107 dataset, we achieve results similar to current state-of-the-art systems for language identification. More, our model accomplishes this with 5x less parameters. We open-source the model through the NVIDIA NeMo toolkit.

* Submitted to ICASSP 2023 
Viaarxiv icon

Multi-blank Transducers for Speech Recognition

Nov 04, 2022
Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe, Boris Ginsburg

Figure 1 for Multi-blank Transducers for Speech Recognition
Figure 2 for Multi-blank Transducers for Speech Recognition
Figure 3 for Multi-blank Transducers for Speech Recognition
Figure 4 for Multi-blank Transducers for Speech Recognition

This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (\url{https://github.com/NVIDIA/NeMo}) toolkit.

* Submitted to ICASSP 2023 
Viaarxiv icon

AmberNet: A Compact End-to-End Model for Spoken Language Identification

Oct 27, 2022
Fei Jia, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg

Figure 1 for AmberNet: A Compact End-to-End Model for Spoken Language Identification
Figure 2 for AmberNet: A Compact End-to-End Model for Spoken Language Identification
Figure 3 for AmberNet: A Compact End-to-End Model for Spoken Language Identification
Figure 4 for AmberNet: A Compact End-to-End Model for Spoken Language Identification

We present AmberNet, a compact end-to-end neural network for Spoken Language Identification. AmberNet consists of 1D depth-wise separable convolutions and Squeeze-and-Excitation layers with global context, followed by statistics pooling and linear layers. AmberNet achieves performance similar to state-of-the-art(SOTA) models on VoxLingua107 dataset, while being 10x smaller. AmberNet can be adapted to unseen languages and new acoustic conditions with simple finetuning. It attains SOTA accuracy of 75.8% on FLEURS benchmark. We show the model is easily scalable to achieve a better trade-off between accuracy and speed. We further inspect the model's sensitivity to input length and show that AmberNet performs well even on short utterances.

* Submitted to ICASSP 2023 
Viaarxiv icon

Lessons from the AdKDD'21 Privacy-Preserving ML Challenge

Jan 31, 2022
Eustache Diemert, Romain Fabre, Alexandre Gilotte, Fei Jia, Basile Leparmentier, Jérémie Mary, Zhonghua Qu, Ugo Tanielian, Hui Yang

Figure 1 for Lessons from the AdKDD'21 Privacy-Preserving ML Challenge
Figure 2 for Lessons from the AdKDD'21 Privacy-Preserving ML Challenge
Figure 3 for Lessons from the AdKDD'21 Privacy-Preserving ML Challenge
Figure 4 for Lessons from the AdKDD'21 Privacy-Preserving ML Challenge

Designing data sharing mechanisms providing performance and strong privacy guarantees is a hot topic for the Online Advertising industry. Namely, a prominent proposal discussed under the Improving Web Advertising Business Group at W3C only allows sharing advertising signals through aggregated, differentially private reports of past displays. To study this proposal extensively, an open Privacy-Preserving Machine Learning Challenge took place at AdKDD'21, a premier workshop on Advertising Science with data provided by advertising company Criteo. In this paper, we describe the challenge tasks, the structure of the available datasets, report the challenge results, and enable its full reproducibility. A key finding is that learning models on large, aggregated data in the presence of a small set of unaggregated data points can be surprisingly efficient and cheap. We also run additional experiments to observe the sensitivity of winning methods to different parameters such as privacy budget or quantity of available privileged side information. We conclude that the industry needs either alternate designs for private data sharing or a breakthrough in learning with aggregated data only to keep ad relevance at a reasonable level.

Viaarxiv icon