Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Renjie Zheng

Decoupling recognition and transcription in Mandarin ASR

Aug 02, 2021

Jiahong Yuan, Xingyu Cai, Dongji Gao, Renjie Zheng, Liang Huang, Kenneth Church

Figure 1 for Decoupling recognition and transcription in Mandarin ASR

Figure 2 for Decoupling recognition and transcription in Mandarin ASR

Figure 3 for Decoupling recognition and transcription in Mandarin ASR

Figure 4 for Decoupling recognition and transcription in Mandarin ASR

Abstract:Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose factoring audio -> Hanzi into two sub-tasks: (1) audio -> Pinyin and (2) Pinyin -> Hanzi, where Pinyin is a system of phonetic transcription of standard Chinese. Factoring the audio -> Hanzi task in this way achieves 3.9% CER (character error rate) on the Aishell-1 corpus, the best result reported on this dataset so far.

* submitted to ASRU 2021

Via

Access Paper or Ask Questions

Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR

Jun 11, 2021

Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang

Figure 1 for Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR

Figure 2 for Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR

Figure 3 for Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR

Figure 4 for Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR

Abstract:Simultaneous speech-to-text translation is widely useful in many scenarios. The conventional cascaded approach uses a pipeline of streaming ASR followed by simultaneous MT, but suffers from error propagation and extra latency. To alleviate these issues, recent efforts attempt to directly translate the source speech into target text simultaneously, but this is much harder due to the combination of two separate tasks. We instead propose a new paradigm with the advantages of both cascaded and end-to-end approaches. The key idea is to use two separate, but synchronized, decoders on streaming ASR and direct speech-to-text translation (ST), respectively, and the intermediate results of ASR guide the decoding policy of (but is not fed as input to) ST. During training time, we use multitask learning to jointly learn these two tasks with a shared encoder. En-to-De and En-to-Es experiments on the MuSTC dataset demonstrate that our proposed technique achieves substantially better translation quality at similar levels of latency.

* accepted by Findings of ACL 2021

Via

Access Paper or Ask Questions

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Feb 10, 2021

Renjie Zheng, Junkun Chen, Mingbo Ma, Liang Huang

Figure 1 for Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Figure 2 for Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Figure 3 for Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Figure 4 for Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Abstract:Recently text and speech representation learning has successfully improved many language related tasks. However, all existing methods only learn from one input modality, while a unified acoustic and text representation is desired by many speech-related tasks such as speech translation. We propose a Fused Acoustic and Text Masked Language Model (FAT-MLM) which jointly learns a unified representation for both acoustic and text in-put. Within this cross modal representation learning framework, we further present an end-to-end model for Fused Acoustic and Text Speech Translation (FAT-ST). Experiments on three translation directions show that our proposed speech translation models fine-tuned from FAT-MLM substantially improve translation quality (+5.90 BLEU).

Via

Access Paper or Ask Questions

MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

Oct 22, 2020

Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang

Figure 1 for MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

Figure 2 for MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

Figure 3 for MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

Figure 4 for MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

Abstract:End-to-end Speech-to-text Translation (E2E- ST), which directly translates source language speech to target language text, is widely useful in practice, but traditional cascaded approaches (ASR+MT) often suffer from error propagation in the pipeline. On the other hand, existing end-to-end solutions heavily depend on the source language transcriptions for pre-training or multi-task training with Automatic Speech Recognition (ASR). We instead propose a simple technique to learn a robust speech encoder in a self-supervised fashion only on the speech side, which can utilize speech data without transcription. This technique, termed Masked Acoustic Modeling (MAM), can also perform pre-training, for the first time, on any acoustic signals (including non-speech ones) without annotation. Compared with current state-of-the-art models on ST, our technique achieves +1.4 BLEU improvement without using transcriptions, and +1.2 BLEU using transcriptions. The pre-training of MAM with arbitrary acoustic signals also boosts the downstream speech-related tasks.

* 10 pages

Via

Access Paper or Ask Questions

Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

Oct 21, 2020

Renjie Zheng, Mingbo Ma, Baigong Zheng, Kaibo Liu, Jiahong Yuan, Kenneth Church, Liang Huang

Figure 1 for Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

Figure 2 for Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

Figure 3 for Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

Figure 4 for Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

Abstract:Simultaneous speech-to-speech translation is widely useful but extremely challenging, since it needs to generate target-language speech concurrently with the source-language speech, with only a few seconds delay. In addition, it needs to continuously translate a stream of sentences, but all recent solutions merely focus on the single-sentence scenario. As a result, current approaches accumulate latencies progressively when the speaker talks faster, and introduce unnatural pauses when the speaker talks slower. To overcome these issues, we propose Self-Adaptive Translation (SAT) which flexibly adjusts the length of translations to accommodate different source speech rates. At similar levels of translation quality (as measured by BLEU), our method generates more fluent target speech (as measured by the naturalness metric MOS) with substantially lower latency than the baseline, in both Zh <-> En directions.

* Findings of EMNLP 2020
* 10 pages, accepted by Findings of EMNLP 2020

Via

Access Paper or Ask Questions

Improving Simultaneous Translation with Pseudo References

Oct 21, 2020

Junkun Chen, Renjie Zheng, Atsuhito Kita, Mingbo Ma, Liang Huang

Figure 1 for Improving Simultaneous Translation with Pseudo References

Figure 2 for Improving Simultaneous Translation with Pseudo References

Figure 3 for Improving Simultaneous Translation with Pseudo References

Figure 4 for Improving Simultaneous Translation with Pseudo References

Abstract:Simultaneous translation is vastly different from full-sentence translation, in the sense that it starts translation before the source sentence ends, with only a few words delay. However, due to the lack of large scale and publicly available simultaneous translation datasets, most simultaneous translation systems still train with ordinary full-sentence parallel corpora which are not suitable for the simultaneous scenario due to the existence of unnecessary long-distance reorderings. Instead of expensive, time-consuming annotation, we propose a novel method that rewrites the target side of existing full-sentence corpus into simultaneous-style translation. Experiments on Chinese-to-English translation demonstrate about +2.7 BLEU improvements with the addition of newly generated pseudo references.

* 6 pages

Via

Access Paper or Ask Questions

Simultaneous Translation Policies: From Fixed to Adaptive

May 02, 2020

Baigong Zheng, Kaibo Liu, Renjie Zheng, Mingbo Ma, Hairong Liu, Liang Huang

Figure 1 for Simultaneous Translation Policies: From Fixed to Adaptive

Figure 2 for Simultaneous Translation Policies: From Fixed to Adaptive

Figure 3 for Simultaneous Translation Policies: From Fixed to Adaptive

Figure 4 for Simultaneous Translation Policies: From Fixed to Adaptive

Abstract:Adaptive policies are better than fixed policies for simultaneous translation, since they can flexibly balance the tradeoff between translation quality and latency based on the current context information. But previous methods on obtaining adaptive policies either rely on complicated training process, or underperform simple fixed policies. We design an algorithm to achieve adaptive policies via a simple heuristic composition of a set of fixed policies. Experiments on Chinese -> English and German -> English show that our adaptive policies can outperform fixed ones by up to 4 BLEU points for the same latency, and more surprisingly, it even surpasses the BLEU score of full-sentence translation in the greedy mode (and very close to beam mode), but with much lower latency.

Via

Access Paper or Ask Questions

Opportunistic Decoding with Timely Correction for Simultaneous Translation

May 02, 2020

Renjie Zheng, Mingbo Ma, Baigong Zheng, Kaibo Liu, Liang Huang

Figure 1 for Opportunistic Decoding with Timely Correction for Simultaneous Translation

Figure 2 for Opportunistic Decoding with Timely Correction for Simultaneous Translation

Figure 3 for Opportunistic Decoding with Timely Correction for Simultaneous Translation

Figure 4 for Opportunistic Decoding with Timely Correction for Simultaneous Translation

Abstract:Simultaneous translation has many important application scenarios and attracts much attention from both academia and industry recently. Most existing frameworks, however, have difficulties in balancing between the translation quality and latency, i.e., the decoding policy is usually either too aggressive or too conservative. We propose an opportunistic decoding technique with timely correction ability, which always (over-)generates a certain mount of extra words at each step to keep the audience on track with the latest information. At the same time, it also corrects, in a timely fashion, the mistakes in the former overgenerated words when observing more source context to ensure high translation quality. Experiments show our technique achieves substantial reduction in latency and up to +3.1 increase in BLEU, with revision rate under 8% in Chinese-to-English and English-to-Chinese translation.

* accepted by ACL 2020

Via

Access Paper or Ask Questions

Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Nov 07, 2019

Mingbo Ma, Baigong Zheng, Kaibo Liu, Renjie Zheng, Hairong Liu, Kainan Peng, Kenneth Church, Liang Huang

Figure 1 for Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Figure 2 for Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Figure 3 for Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Figure 4 for Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Abstract:Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audio with near human-level naturalness. However, these efforts still suffer from two types of latencies: (a) the computational latency (synthesize time), which grows linearly with the sentence length even with parallel approaches, and (b) the input latency in scenarios where the input text is incrementally generated (such as in simultaneous translation, dialog generation, and assistive technologies). To reduce these latencies, we devise the first neural incremental TTS approach based on the recently proposed prefix-to-prefix framework. We synthesize speech in an online fashion, playing a segment of audio while generating the next, resulting in an O(1) rather than O(n) latency. Experiments on English TTS show that our approach achieves similar speech naturalness compared to full sentence methods, but only using a fraction of time and a constant (1 - 2 words) latency.

* 11 pages

Via

Access Paper or Ask Questions

Simpler and Faster Learning of Adaptive Policies for Simultaneous Translation

Sep 12, 2019

Baigong Zheng, Renjie Zheng, Mingbo Ma, Liang Huang

Figure 1 for Simpler and Faster Learning of Adaptive Policies for Simultaneous Translation

Figure 2 for Simpler and Faster Learning of Adaptive Policies for Simultaneous Translation

Figure 3 for Simpler and Faster Learning of Adaptive Policies for Simultaneous Translation

Figure 4 for Simpler and Faster Learning of Adaptive Policies for Simultaneous Translation

Abstract:Simultaneous translation is widely useful but remains challenging. Previous work falls into two main categories: (a) fixed-latency policies such as Ma et al. (2019) and (b) adaptive policies such as Gu et al. (2017). The former are simple and effective, but have to aggressively predict future content due to diverging source-target word order; the latter do not anticipate, but suffer from unstable and inefficient training. To combine the merits of both approaches, we propose a simple supervised-learning framework to learn an adaptive policy from oracle READ/WRITE sequences generated from parallel text. At each step, such an oracle sequence chooses to WRITE the next target word if the available source sentence context provides enough information to do so, otherwise READ the next source word. Experiments on German<->English show that our method, without retraining the underlying NMT model, can learn flexible policies with better BLEU scores and similar latencies compared to previous work.

* EMNLP 2019

Via

Access Paper or Ask Questions