Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Peidong Wang

Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach

Oct 06, 2023

Junkun Chen, Jian Xue, Peidong Wang, Jing Pan, Jinyu Li

Figure 1 for Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach

Figure 2 for Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach

Figure 3 for Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach

Figure 4 for Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach

Abstract:Simultaneous Speech-to-Text translation serves a critical role in real-time crosslingual communication. Despite the advancements in recent years, challenges remain in achieving stability in the translation process, a concern primarily manifested in the flickering of partial results. In this paper, we propose a novel revision-controllable method designed to address this issue. Our method introduces an allowed revision window within the beam search pruning process to screen out candidate translations likely to cause extensive revisions, leading to a substantial reduction in flickering and, crucially, providing the capability to completely eliminate flickering. The experiments demonstrate the proposed method can significantly improve the decoding stability without compromising substantially on the translation quality.

* accepted by ASRU 2023

Via

Access Paper or Ask Questions

DiariST: Streaming Speech Translation with Speaker Diarization

Sep 14, 2023

Mu Yang, Naoyuki Kanda, Xiaofei Wang, Junkun Chen, Peidong Wang, Jian Xue, Jinyu Li, Takuya Yoshioka

Abstract:End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion. In this work, we propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector, which were originally developed for multi-talker speech recognition. Due to the absence of evaluation benchmarks in this area, we develop a new evaluation dataset, DiariST-AliMeeting, by translating the reference Chinese transcriptions of the AliMeeting corpus into English. We also propose new metrics, called speaker-agnostic BLEU and speaker-attributed BLEU, to measure the ST quality while taking SD accuracy into account. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech. To facilitate the research in this new direction, we release the evaluation data, the offline baseline systems, and the evaluation code.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

Mar 01, 2023

Eric Sun, Jinyu Li, Yuxuan Hu, Yimeng Zhu, Long Zhou, Jian Xue, Peidong Wang, Linquan Liu, Shujie Liu, Edward Lin(+1 more)

Figure 1 for Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

Figure 2 for Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

Figure 3 for Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

Figure 4 for Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

Abstract:We propose gated language experts to improve multilingual transformer transducer models without any language identification (LID) input from users during inference. We define gating mechanism and LID loss to let transformer encoders learn language-dependent information, construct the multilingual transformer block with gated transformer experts and shared transformer layers for compact models, and apply linear experts on joint network output to better regularize speech acoustic and token label joint information. Furthermore, a curriculum training scheme is proposed to let LID guide the gated language experts for better serving their corresponding languages. Evaluated on the English and Spanish bilingual task, our methods achieve average 12.5% and 7.3% relative word error reductions over the baseline bilingual model and monolingual models, respectively, obtaining similar results to the upper bound model trained and inferred with oracle LID. We further explore our method on trilingual, quadrilingual, and pentalingual models, and observe similar advantages as in the bilingual models, which demonstrates the easy extension to more languages.

Via

Access Paper or Ask Questions

Self-supervised learning with bi-label masked speech prediction for streaming multi-talker speech recognition

Nov 10, 2022

Zili Huang, Zhuo Chen, Naoyuki Kanda, Jian Wu, Yiming Wang, Jinyu Li, Takuya Yoshioka, Xiaofei Wang, Peidong Wang

Abstract:Self-supervised learning (SSL), which utilizes the input data itself for representation learning, has achieved state-of-the-art results for various downstream speech tasks. However, most of the previous studies focused on offline single-talker applications, with limited investigations in multi-talker cases, especially for streaming scenarios. In this paper, we investigate SSL for streaming multi-talker speech recognition, which generates transcriptions of overlapping speakers in a streaming fashion. We first observe that conventional SSL techniques do not work well on this task due to the poor representation of overlapping speech. We then propose a novel SSL training objective, referred to as bi-label masked speech prediction, which explicitly preserves representations of all speakers in overlapping speech. We investigate various aspects of the proposed system including data configuration and quantizer selection. The proposed SSL setup achieves substantially better word error rates on the LibriSpeechMix dataset.

* submitted to ICASSP 2023

Via

Access Paper or Ask Questions

LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

Nov 05, 2022

Peidong Wang, Eric Sun, Jian Xue, Yu Wu, Long Zhou, Yashesh Gaur, Shujie Liu, Jinyu Li

Figure 1 for LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

Figure 2 for LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

Figure 3 for LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

Figure 4 for LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

Abstract:End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it easy to use a single model for both multilingual ASR and many-to-many ST. In this paper, we propose streaming language-agnostic multilingual speech recognition and translation using neural transducers (LAMASSU). To enable multilingual text generation in LAMASSU, we conduct a systematic comparison between specified and unified prediction and joint networks. We leverage a language-agnostic multilingual encoder that substantially outperforms shared encoders. To enhance LAMASSU, we propose to feed target LID to encoders. We also apply connectionist temporal classification regularization to transducer training. Experimental results show that LAMASSU not only drastically reduces the model size but also outperforms monolingual ASR and bilingual ST models.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

Nov 04, 2022

Jian Xue, Peidong Wang, Jinyu Li, Eric Sun

Abstract:In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high streaming capability. Instead of human labeled speech translation (ST) data, SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service. With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models. More importantly, we show that SM2 has the truly zero-shot capability when expanding to new target languages, yielding high quality ST results for {source-speech, target-text} pairs that are not seen during training.

* submitted to ICASSP 2023

Via

Access Paper or Ask Questions

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Apr 27, 2022

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Zhuo Chen, Peidong Wang, Gang Liu, Jinyu Li, Jian Wu, Xiangzhan Yu(+1 more)

Figure 1 for Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Figure 2 for Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Figure 3 for Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Figure 4 for Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Abstract:Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.

* Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Large-Scale Streaming End-to-End Speech Translation with Neural Transducers

Apr 11, 2022

Jian Xue, Peidong Wang, Jinyu Li, Matt Post, Yashesh Gaur

Figure 1 for Large-Scale Streaming End-to-End Speech Translation with Neural Transducers

Figure 2 for Large-Scale Streaming End-to-End Speech Translation with Neural Transducers

Figure 3 for Large-Scale Streaming End-to-End Speech Translation with Neural Transducers

Figure 4 for Large-Scale Streaming End-to-End Speech Translation with Neural Transducers

Abstract:Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly. Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency, exploits speech information, and avoids error propagation from ASR to MT. To improve the modeling capacity, we propose attention pooling for the joint network in TT. In addition, we extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time. Experimental results on a large-scale 50 thousand (K) hours pseudo-labeled training set show that TT-based ST not only significantly reduces inference time but also outperforms non-streaming cascaded ST for English-German translation.

* The paper was submitted to Interspeech 2022

Via

Access Paper or Ask Questions

A Conformer Based Acoustic Model for Robust Automatic Speech Recognition

Mar 20, 2022

Yufeng Yang, Peidong Wang, DeLiang Wang

Figure 1 for A Conformer Based Acoustic Model for Robust Automatic Speech Recognition

Figure 2 for A Conformer Based Acoustic Model for Robust Automatic Speech Recognition

Figure 3 for A Conformer Based Acoustic Model for Robust Automatic Speech Recognition

Figure 4 for A Conformer Based Acoustic Model for Robust Automatic Speech Recognition

Abstract:This study addresses robust automatic speech recognition (ASR) by introducing a Conformer-based acoustic model. The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation, but employs a Conformer encoder instead of the BLSTM network. The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling. The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus. Coupled with utterance-wise normalization and speaker adaptation, our model achieves $6.25\%$ word error rate, which outperforms the previous best system by $8.4\%$ relatively. In addition, the proposed Conformer-based model is $18.3\%$ smaller in model size and reduces total training time by $79.6\%$.

* 5 pages, 2 figures, submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Predicting Atlantic Multidecadal Variability

Oct 29, 2021

Glenn Liu, Peidong Wang, Matthew Beveridge, Young-Oh Kwon, Iddo Drori

Figure 1 for Predicting Atlantic Multidecadal Variability

Figure 2 for Predicting Atlantic Multidecadal Variability

Abstract:Atlantic Multidecadal Variability (AMV) describes variations of North Atlantic sea surface temperature with a typical cycle of between 60 and 70 years. AMV strongly impacts local climate over North America and Europe, therefore prediction of AMV, especially the extreme values, is of great societal utility for understanding and responding to regional climate change. This work tests multiple machine learning models to improve the state of AMV prediction from maps of sea surface temperature, salinity, and sea level pressure in the North Atlantic region. We use data from the Community Earth System Model 1 Large Ensemble Project, a state-of-the-art climate model with 3,440 years of data. Our results demonstrate that all of the models we use outperform the traditional persistence forecast baseline. Predicting the AMV is important for identifying future extreme temperatures and precipitation, as well as hurricane activity, in Europe and North America up to 25 years in advance.

* Tackling Climate Change with Machine Learning workshop at NeurIPS 2021
* 7 pages, 3 figures

Via

Access Paper or Ask Questions