Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xutai Ma

A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

Oct 21, 2020
Yun Tang, Juan Pino, Changhan Wang, Xutai Ma, Dmitriy Genzel

Figure 1 for A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

Figure 2 for A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

Figure 3 for A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

Figure 4 for A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

Attention-based sequence-to-sequence modeling provides a powerful and elegant solution for applications that need to map one sequence to a different sequence. Its success heavily relies on the availability of large amounts of training data. This presents a challenge for speech applications where labelled speech data is very expensive to obtain, such as automatic speech recognition (ASR) and speech translation (ST). In this study, we propose a general multi-task learning framework to leverage text data for ASR and ST tasks. Two auxiliary tasks, a denoising autoencoder task and machine translation task, are proposed to be co-trained with ASR and ST tasks respectively. We demonstrate that representing text input as phoneme sequences can reduce the difference between speech and text inputs, and enhance the knowledge transfer from text corpora to the speech to text tasks. Our experiments show that the proposed method achieves a relative 10~15% word error rate reduction on the English Librispeech task, and improves the speech translation quality on the MuST-C tasks by 4.2~11.1 BLEU.

Via

Access Paper or Ask Questions

fairseq S2T: Fast Speech-to-Text Modeling with fairseq

Oct 11, 2020
Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino

Figure 1 for fairseq S2T: Fast Speech-to-Text Modeling with fairseq

Figure 2 for fairseq S2T: Fast Speech-to-Text Modeling with fairseq

Figure 3 for fairseq S2T: Fast Speech-to-Text Modeling with fairseq

Figure 4 for fairseq S2T: Fast Speech-to-Text Modeling with fairseq

We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as end-to-end speech recognition and speech-to-text translation. It follows fairseq's careful design for scalability and extensibility. We provide end-to-end workflows from data pre-processing, model training to offline (online) inference. We implement state-of-the-art RNN-based as well as Transformer-based models and open-source detailed training recipes. Fairseq's machine translation models and language models can be seamlessly integrated into S2T workflows for multi-task learning or transfer learning. Fairseq S2T documentation and examples are available at https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text.

* Accepted to AACL 2020 Demo

Via

Access Paper or Ask Questions

SimulEval: An Evaluation Toolkit for Simultaneous Translation

Jul 31, 2020
Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Jiatao Gu, Juan Pino

Figure 1 for SimulEval: An Evaluation Toolkit for Simultaneous Translation

Figure 2 for SimulEval: An Evaluation Toolkit for Simultaneous Translation

Figure 3 for SimulEval: An Evaluation Toolkit for Simultaneous Translation

Figure 4 for SimulEval: An Evaluation Toolkit for Simultaneous Translation

Simultaneous translation on both text and speech focuses on a real-time and low-latency scenario where the model starts translating before reading the complete source input. Evaluating simultaneous translation models is more complex than offline models because the latency is another factor to consider in addition to translation quality. The research community, despite its growing focus on novel modeling approaches to simultaneous translation, currently lacks a universal evaluation procedure. Therefore, we present SimulEval, an easy-to-use and general evaluation toolkit for both simultaneous text and speech translation. A server-client scheme is introduced to create a simultaneous translation scenario, where the server sends source input and receives predictions for evaluation and the client executes customized policies. Given a policy, it automatically performs simultaneous decoding and collectively reports several popular latency metrics. We also adapt latency metrics from text simultaneous translation to the speech task. Additionally, SimulEval is equipped with a visualization interface to provide better understanding of the simultaneous decoding process of a system. SimulEval has already been extensively used for the IWSLT 2020 shared task on simultaneous speech translation. Code will be released upon publication.

Via

Access Paper or Ask Questions

Self-Training for End-to-End Speech Translation

Jun 03, 2020
Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad Dousti, Yun Tang

Figure 1 for Self-Training for End-to-End Speech Translation

Figure 2 for Self-Training for End-to-End Speech Translation

Figure 3 for Self-Training for End-to-End Speech Translation

Figure 4 for Self-Training for End-to-End Speech Translation

One of the main challenges for end-to-end speech translation is data scarcity. We leverage pseudo-labels generated from unlabeled audio by a cascade and an end-to-end speech translation model. This provides 8.3 and 5.7 BLEU gains over a strong semi-supervised baseline on the MuST-C English-French and English-German datasets, reaching state-of-the art performance. The effect of the quality of the pseudo-labels is investigated. Our approach is shown to be more effective than simply pre-training the encoder on the speech recognition task. Finally, we demonstrate the effectiveness of self-training by directly generating pseudo-labels with an end-to-end model instead of a cascade model.

* Submitted to INTERSPEECH 2020

Via

Access Paper or Ask Questions

Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade

Oct 22, 2019
Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. McCarthy, Deepak Gopinath

Figure 1 for Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade

Figure 2 for Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade

Figure 3 for Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade

Figure 4 for Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade

For automatic speech translation (AST), end-to-end approaches are outperformed by cascaded models that transcribe with automatic speech recognition (ASR), then translate with machine translation (MT). A major cause of the performance gap is that, while existing AST corpora are small, massive datasets exist for both the ASR and MT subsystems. In this work, we evaluate several data augmentation and pretraining approaches for AST, by comparing all on the same datasets. Simple data augmentation by translating ASR transcripts proves most effective on the English--French augmented LibriSpeech dataset, closing the performance gap from 8.2 to 1.4 BLEU, compared to a very strong cascade that could directly utilize copious ASR and MT data. The same end-to-end approach plus fine-tuning closes the gap on the English--Romanian MuST-C dataset from 6.7 to 3.7 BLEU. In addition to these results, we present practical recommendations for augmentation and pretraining approaches. Finally, we decrease the performance gap to 0.01 BLEU using a Transformer-based architecture.

* IWSLT 2019

Via

Access Paper or Ask Questions

Monotonic Multihead Attention

Sep 26, 2019
Xutai Ma, Juan Pino, James Cross, Liezl Puzon, Jiatao Gu

Figure 1 for Monotonic Multihead Attention

Figure 2 for Monotonic Multihead Attention

Figure 3 for Monotonic Multihead Attention

Figure 4 for Monotonic Multihead Attention

Simultaneous machine translation models start generating a target sequence before they have encoded or read the source sequence. Recent approaches for this task either apply a fixed policy on a state-of-the art Transformer model, or a learnable monotonic attention on a weaker recurrent neural network-based structure. In this paper, we propose a new attention mechanism, Monotonic Multihead Attention (MMA), which extends the monotonic attention mechanism to multihead attention. We also introduce two novel and interpretable approaches for latency control that are specifically designed for multiple attentions heads. We apply MMA to the simultaneous machine translation task and demonstrate better latency-quality tradeoffs compared to MILk, the previous state-of-the-art approach. We also analyze how the latency controls affect the attention span and we motivate the introduction of our model by analyzing the effect of the number of decoder layers and heads on quality and latency.

Via

Access Paper or Ask Questions

Leveraging Out-of-Task Data for End-to-End Automatic Speech Translation

Sep 14, 2019
Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. McCarthy, Deepak Gopinath

Figure 1 for Leveraging Out-of-Task Data for End-to-End Automatic Speech Translation

Figure 2 for Leveraging Out-of-Task Data for End-to-End Automatic Speech Translation

Figure 3 for Leveraging Out-of-Task Data for End-to-End Automatic Speech Translation

Figure 4 for Leveraging Out-of-Task Data for End-to-End Automatic Speech Translation

For automatic speech translation (AST), end-to-end approaches are outperformed by cascaded models that transcribe with automatic speech recognition (ASR), then translate with machine translation (MT). A major cause of the performance gap is that, while existing AST corpora are small, massive datasets exist for both the ASR and MT subsystems. In this work, we evaluate several data augmentation and pretraining approaches for AST, comparing all on the same datasets. Simple data augmentation by translating ASR transcripts proves most effective on the English--French augmented LibriSpeech dataset, closing the performance gap from 8.2 to 1.4 BLEU, compared to a very strong cascade that could directly utilize copious ASR and MT data. The same end-to-end approach plus fine-tuning closes the gap on the English--Romanian MuST-C dataset from 6.7 to 3.7 BLEU. In addition to these results, we present practical recommendations for augmentation and pretraining approaches. Finally, we decrease the performance gap to 0.01 BLEU using a Transformer-based architecture.

Via

Access Paper or Ask Questions

Broad-Coverage Semantic Parsing as Transduction

Sep 05, 2019
Sheng Zhang, Xutai Ma, Kevin Duh, Benjamin Van Durme

Figure 1 for Broad-Coverage Semantic Parsing as Transduction

Figure 2 for Broad-Coverage Semantic Parsing as Transduction

Figure 3 for Broad-Coverage Semantic Parsing as Transduction

Figure 4 for Broad-Coverage Semantic Parsing as Transduction

We unify different broad-coverage semantic parsing tasks under a transduction paradigm, and propose an attention-based neural framework that incrementally builds a meaning representation via a sequence of semantic relations. By leveraging multiple attention mechanisms, the transducer can be effectively trained without relying on a pre-trained aligner. Experiments conducted on three separate broad-coverage semantic parsing tasks -- AMR, SDP and UCCA -- demonstrate that our attention-based neural transducer improves the state of the art on both AMR and UCCA, and is competitive with the state of the art on SDP.

* Accepted at EMNLP 2019

Via

Access Paper or Ask Questions

AMR Parsing as Sequence-to-Graph Transduction

May 21, 2019
Sheng Zhang, Xutai Ma, Kevin Duh, Benjamin Van Durme

Figure 1 for AMR Parsing as Sequence-to-Graph Transduction

Figure 2 for AMR Parsing as Sequence-to-Graph Transduction

Figure 3 for AMR Parsing as Sequence-to-Graph Transduction

Figure 4 for AMR Parsing as Sequence-to-Graph Transduction

We propose an attention-based model that treats AMR parsing as sequence-to-graph transduction. Unlike most AMR parsers that rely on pre-trained aligners, external semantic resources, or data augmentation, our proposed parser is aligner-free, and it can be effectively trained with limited amounts of labeled AMR data. Our experimental results outperform all previously reported SMATCH scores, on both AMR 2.0 (76.3% F1 on LDC2017T10) and AMR 1.0 (70.2% F1 on LDC2014T12).

* Accepted at ACL 2019

Via

Access Paper or Ask Questions