Alert button
Picture for Rong Ye

Rong Ye

Alert button

Recent Advances in Direct Speech-to-text Translation

Jun 20, 2023
Chen Xu, Rong Ye, Qianqian Dong, Chengqi Zhao, Tom Ko, Mingxuan Wang, Tong Xiao, Jingbo Zhu

Figure 1 for Recent Advances in Direct Speech-to-text Translation
Figure 2 for Recent Advances in Direct Speech-to-text Translation

Recently, speech-to-text translation has attracted more and more attention and many studies have emerged rapidly. In this paper, we present a comprehensive survey on direct speech translation aiming to summarize the current state-of-the-art techniques. First, we categorize the existing research work into three directions based on the main challenges -- modeling burden, data scarcity, and application issues. To tackle the problem of modeling burden, two main structures have been proposed, encoder-decoder framework (Transformer and the variants) and multitask frameworks. For the challenge of data scarcity, recent work resorts to many sophisticated techniques, such as data augmentation, pre-training, knowledge distillation, and multilingual modeling. We analyze and summarize the application issues, which include real-time, segmentation, named entity, gender bias, and code-switching. Finally, we discuss some promising directions for future work.

* An expanded version of the paper accepted by IJCAI2023 survey track 
Viaarxiv icon

Improving speech translation by fusing speech and text

May 23, 2023
Wenbiao Yin, Zhicheng Liu, Chengqi Zhao, Tao Wang, Jian Tong, Rong Ye

Figure 1 for Improving speech translation by fusing speech and text
Figure 2 for Improving speech translation by fusing speech and text
Figure 3 for Improving speech translation by fusing speech and text
Figure 4 for Improving speech translation by fusing speech and text

In speech translation, leveraging multimodal data to improve model performance and address limitations of individual modalities has shown significant effectiveness. In this paper, we harness the complementary strengths of speech and text, which are disparate modalities. We observe three levels of modality gap between them, denoted by Modal input representation, Modal semantic, and Modal hidden states. To tackle these gaps, we propose \textbf{F}use-\textbf{S}peech-\textbf{T}ext (\textbf{FST}), a cross-modal model which supports three distinct input modalities for translation: speech, text, and fused speech-text. We leverage multiple techniques for cross-modal alignment and conduct a comprehensive analysis to assess its impact on speech translation, machine translation, and fused speech-text translation. We evaluate FST on MuST-C, GigaST, and newstest benchmark. Experiments show that the proposed FST achieves an average 34.0 BLEU on MuST-C En$\rightarrow$De/Es/Fr (vs SOTA +1.1 BLEU). Further experiments demonstrate that FST does not degrade on MT task, as observed in prior works. Instead, it yields an average improvement of 3.2 BLEU over the pre-trained MT model.

Viaarxiv icon

DUB: Discrete Unit Back-translation for Speech Translation

May 19, 2023
Dong Zhang, Rong Ye, Tom Ko, Mingxuan Wang, Yaqian Zhou

Figure 1 for DUB: Discrete Unit Back-translation for Speech Translation
Figure 2 for DUB: Discrete Unit Back-translation for Speech Translation
Figure 3 for DUB: Discrete Unit Back-translation for Speech Translation
Figure 4 for DUB: Discrete Unit Back-translation for Speech Translation

How can speech-to-text translation (ST) perform as well as machine translation (MT)? The key point is to bridge the modality gap between speech and text so that useful MT techniques can be applied to ST. Recently, the approach of representing speech with unsupervised discrete units yields a new way to ease the modality problem. This motivates us to propose Discrete Unit Back-translation (DUB) to answer two questions: (1) Is it better to represent speech with discrete units than with continuous features in direct ST? (2) How much benefit can useful MT techniques bring to ST? With DUB, the back-translation technique can successfully be applied on direct ST and obtains an average boost of 5.5 BLEU on MuST-C En-De/Fr/Es. In the low-resource language scenario, our method achieves comparable performance to existing methods that rely on large-scale external data. Code and models are available at https://github.com/0nutation/DUB.

* Accepted to Findings of ACL 2023 
Viaarxiv icon

WACO: Word-Aligned Contrastive Learning for Speech Translation

Dec 19, 2022
Siqi Ouyang, Rong Ye, Lei Li

Figure 1 for WACO: Word-Aligned Contrastive Learning for Speech Translation
Figure 2 for WACO: Word-Aligned Contrastive Learning for Speech Translation
Figure 3 for WACO: Word-Aligned Contrastive Learning for Speech Translation
Figure 4 for WACO: Word-Aligned Contrastive Learning for Speech Translation

End-to-end Speech Translation (E2E ST) aims to translate source speech into target translation without generating the intermediate transcript. However, existing approaches for E2E ST degrade considerably when only limited ST data are available. We observe that an ST model's performance strongly correlates with its embedding similarity from speech and transcript. In this paper, we propose Word-Aligned COntrastive learning (WACO), a novel method for few-shot speech-to-text translation. Our key idea is bridging word-level representations for both modalities via contrastive learning. We evaluate WACO and other methods on the MuST-C dataset, a widely used ST benchmark. Our experiments demonstrate that WACO outperforms the best baseline methods by 0.7-8.5 BLEU points with only 1-hour parallel data. Code is available at https://anonymous.4open.science/r/WACO .

Viaarxiv icon

On the Impact of Noises in Crowd-Sourced Data for Speech Translation

Jul 01, 2022
Siqi Ouyang, Rong Ye, Lei Li

Figure 1 for On the Impact of Noises in Crowd-Sourced Data for Speech Translation
Figure 2 for On the Impact of Noises in Crowd-Sourced Data for Speech Translation
Figure 3 for On the Impact of Noises in Crowd-Sourced Data for Speech Translation
Figure 4 for On the Impact of Noises in Crowd-Sourced Data for Speech Translation

Training speech translation (ST) models requires large and high-quality datasets. MuST-C is one of the most widely used ST benchmark datasets. It contains around 400 hours of speech-transcript-translation data for each of the eight translation directions. This dataset passes several quality-control filters during creation. However, we find that MuST-C still suffers from three major quality issues: audio-text misalignment, inaccurate translation, and unnecessary speaker's name. What are the impacts of these data quality issues for model development and evaluation? In this paper, we propose an automatic method to fix or filter the above quality issues, using English-German (En-De) translation as an example. Our experiments show that ST models perform better on clean test sets, and the rank of proposed models remains consistent across different test sets. Besides, simply removing misaligned data points from the training set does not lead to a better ST model.

* Accepted to IWSLT 2022 as a scientific paper 
Viaarxiv icon

Cross-modal Contrastive Learning for Speech Translation

May 05, 2022
Rong Ye, Mingxuan Wang, Lei Li

Figure 1 for Cross-modal Contrastive Learning for Speech Translation
Figure 2 for Cross-modal Contrastive Learning for Speech Translation
Figure 3 for Cross-modal Contrastive Learning for Speech Translation
Figure 4 for Cross-modal Contrastive Learning for Speech Translation

How can we learn unified representations for spoken utterances and their written text? Learning similar representations for semantically similar speech and text is important for speech translation. To this end, we propose ConST, a cross-modal contrastive learning method for end-to-end speech-to-text translation. We evaluate ConST and a variety of previous baselines on a popular benchmark MuST-C. Experiments show that the proposed ConST consistently outperforms the previous methods on, and achieves an average BLEU of 29.4. The analysis further verifies that ConST indeed closes the representation gap of different modalities -- its learned representation improves the accuracy of cross-modal speech-text retrieval from 4% to 88%. Code and models are available at https://github.com/ReneeYe/ConST.

* NAACL 2022 main conference (Long Paper) 
Viaarxiv icon

GigaST: A 10,000-hour Pseudo Speech Translation Corpus

Apr 08, 2022
Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao Wang, Mingxuan Wang, Jun Cao

Figure 1 for GigaST: A 10,000-hour Pseudo Speech Translation Corpus
Figure 2 for GigaST: A 10,000-hour Pseudo Speech Translation Corpus
Figure 3 for GigaST: A 10,000-hour Pseudo Speech Translation Corpus
Figure 4 for GigaST: A 10,000-hour Pseudo Speech Translation Corpus

This paper introduces GigaST, a large-scale pseudo speech translation (ST) corpus. We create the corpus by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set is translated by human. ST models trained with an addition of our corpus obtain new state-of-the-art results on the MuST-C English-German benchmark test set. We provide a detailed description of the translation process and verify its quality. We make the translated text data public and hope to facilitate research in speech translation. Additionally, we also release the training scripts on NeurST to make it easy to replicate our systems. GigaST dataset is available at https://st-benchmark.github.io/resources/GigaST.

* Submitted to Interspeech 2022. GigaST dataset is available at https://st-benchmark.github.io/resources/GigaST 
Viaarxiv icon

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

Mar 20, 2022
Qingkai Fang, Rong Ye, Lei Li, Yang Feng, Mingxuan Wang

Figure 1 for STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation
Figure 2 for STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation
Figure 3 for STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation
Figure 4 for STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a self-learning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.

* ACL 2022 main conference 
Viaarxiv icon

The Volctrans Neural Speech Translation System for IWSLT 2021

May 16, 2021
Chengqi Zhao, Zhicheng Liu, Jian Tong, Tao Wang, Mingxuan Wang, Rong Ye, Qianqian Dong, Jun Cao, Lei Li

Figure 1 for The Volctrans Neural Speech Translation System for IWSLT 2021
Figure 2 for The Volctrans Neural Speech Translation System for IWSLT 2021
Figure 3 for The Volctrans Neural Speech Translation System for IWSLT 2021
Figure 4 for The Volctrans Neural Speech Translation System for IWSLT 2021

This paper describes the systems submitted to IWSLT 2021 by the Volctrans team. We participate in the offline speech translation and text-to-text simultaneous translation tracks. For offline speech translation, our best end-to-end model achieves 8.1 BLEU improvements over the benchmark on the MuST-C test set and is even approaching the results of a strong cascade solution. For text-to-text simultaneous translation, we explore the best practice to optimize the wait-k model. As a result, our final submitted systems exceed the benchmark at around 7 BLEU on the same latency regime. We will publish our code and model to facilitate both future research works and industrial applications.

Viaarxiv icon