Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yosuke Higuchi

End-to-End Speech Recognition with Pre-trained Masked Language Model

Oct 01, 2024

Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

Figure 1 for End-to-End Speech Recognition with Pre-trained Masked Language Model

Figure 2 for End-to-End Speech Recognition with Pre-trained Masked Language Model

Figure 3 for End-to-End Speech Recognition with Pre-trained Masked Language Model

Figure 4 for End-to-End Speech Recognition with Pre-trained Masked Language Model

Abstract:We present a novel approach to end-to-end automatic speech recognition (ASR) that utilizes pre-trained masked language models (LMs) to facilitate the extraction of linguistic information. The proposed models, BERT-CTC and BECTRA, are specifically designed to effectively integrate pre-trained LMs (e.g., BERT) into end-to-end ASR models. BERT-CTC adapts BERT for connectionist temporal classification (CTC) by addressing the constraint of the conditional independence assumption between output tokens. This enables explicit conditioning of BERT's contextualized embeddings in the ASR process, seamlessly merging audio and linguistic information through an iterative refinement algorithm. BECTRA extends BERT-CTC to the transducer framework and trains the decoder network using a vocabulary suitable for ASR training. This aims to bridge the gap between the text processed in end-to-end ASR and BERT, as these models have distinct vocabularies with varying text formats and styles, such as the presence of punctuation. Experimental results on various ASR tasks demonstrate that the proposed models improve over both the CTC and transducer-based baselines, owing to the incorporation of BERT knowledge. Moreover, our in-depth analysis and investigation verify the effectiveness of the proposed formulations and architectural designs.

Via

Access Paper or Ask Questions

Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Sep 30, 2024

Oswald Zink, Yosuke Higuchi, Carlos Mullov, Alexander Waibel, Tetsunori Kobayashi

Figure 1 for Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Figure 2 for Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Figure 3 for Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Figure 4 for Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Abstract:Effective spoken dialog systems should facilitate natural interactions with quick and rhythmic timing, mirroring human communication patterns. To reduce response times, previous efforts have focused on minimizing the latency in automatic speech recognition (ASR) to optimize system efficiency. However, this approach requires waiting for ASR to complete processing until a speaker has finished speaking, which limits the time available for natural language processing (NLP) to formulate accurate responses. As humans, we continuously anticipate and prepare responses even while the other party is still speaking. This allows us to respond appropriately without missing the optimal time to speak. In this work, as a pioneering study toward a conversational system that simulates such human anticipatory behavior, we aim to realize a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance (EOU), using the middle portion of an utterance. To achieve this, we propose a training strategy for an encoder-decoder-based ASR system, which involves masking future segments of an utterance and prompting the decoder to predict the words in the masked audio. Additionally, we develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information to accurately detect the EOU. The experimental results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU. Moreover, the proposed training strategy exhibits general improvements in ASR performance.

* Submitted to ICASSP2025

Via

Access Paper or Ask Questions

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

Oct 01, 2023

Masao Someki, Nicholas Eng, Yosuke Higuchi, Shinji Watanabe

Figure 1 for Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

Figure 2 for Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

Figure 3 for Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

Figure 4 for Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

Abstract:Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam search for improving the inference speed of an ASR model based on the hybrid connectionist temporal classification (CTC) attention-based architecture. It first generates an initial hypothesis using greedy CTC decoding, identifying low-confidence tokens based on their output probabilities. We then utilize the decoder to perform segment-level vectorized beam search on these tokens, re-predicting in parallel with minimal decoder calculations. Experimental results show that our method is 12 to 13 times faster in inference on the LibriSpeech corpus over AR decoding whilst preserving high accuracy.

* Accepted at ASRU 2023

Via

Access Paper or Ask Questions

Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

Sep 19, 2023

Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

Figure 1 for Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

Figure 2 for Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

Figure 3 for Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

Figure 4 for Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

Abstract:We present a novel integration of an instruction-tuned large language model (LLM) and end-to-end automatic speech recognition (ASR). Modern LLMs can perform a wide range of linguistic tasks within zero-shot learning when provided with a precise instruction or a prompt to guide the text generation process towards the desired task. We explore using this zero-shot capability of LLMs to extract linguistic information that can contribute to improving ASR performance. Specifically, we direct an LLM to correct grammatical errors in an ASR hypothesis and harness the embedded linguistic knowledge to conduct end-to-end ASR. The proposed model is built on the hybrid connectionist temporal classification (CTC) and attention architecture, where an instruction-tuned LLM (i.e., Llama2) is employed as a front-end of the decoder. An ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding, which is then fed into the LLM along with an instruction. The decoder subsequently takes as input the LLM embeddings to perform sequence generation, incorporating acoustic information from the encoder output. Experimental results and analyses demonstrate that the proposed integration yields promising performance improvements, and our approach largely benefits from LLM-based rescoring.

* Submitted to ICASSP2024

Via

Access Paper or Ask Questions

Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

Sep 09, 2023

Huaibo Zhao, Yosuke Higuchi, Yusuke Kida, Tetsuji Ogawa, Tetsunori Kobayashi

Figure 1 for Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

Figure 2 for Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

Figure 3 for Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

Figure 4 for Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

Abstract:Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipates long-term contexts, which is desirable for streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in achieving low latency and high accuracy for triggered attention-based ASR. However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. We also discuss the effect of the proposed pre-training method on obtaining accurate output spike timing.

* Accepted to EUSIPCO 2023

Via

Access Paper or Ask Questions

A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

Nov 10, 2022

Yifan Peng, Siddhant Arora, Yosuke Higuchi, Yushi Ueda, Sujay Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe

Figure 1 for A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

Figure 2 for A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

Figure 3 for A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

Figure 4 for A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

Abstract:Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively.

* Accepted at SLT 2022

Via

Access Paper or Ask Questions

BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

Nov 02, 2022

Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

Figure 1 for BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

Figure 2 for BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

Figure 3 for BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

Figure 4 for BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

Abstract:We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR has been actively studied, aiming to utilize versatile linguistic knowledge for generating accurate text. One crucial factor that makes this integration challenging lies in the vocabulary mismatch; the vocabulary constructed for a pre-trained LM is generally too large for E2E-ASR training and is likely to have a mismatch against a target ASR domain. To overcome such an issue, we propose BECTRA, an extended version of our previous BERT-CTC, that realizes BERT-based E2E-ASR using a vocabulary of interest. BECTRA is a transducer-based model, which adopts BERT-CTC for its encoder and trains an ASR-specific decoder using a vocabulary suitable for a target task. With the combination of the transducer and BERT-CTC, we also propose a novel inference algorithm for taking advantage of both autoregressive and non-autoregressive decoding. Experimental results on several ASR tasks, varying in amounts of data, speaking styles, and languages, demonstrate that BECTRA outperforms BERT-CTC by effectively dealing with the vocabulary mismatch while exploiting BERT knowledge.

* Submitted to ICASSP2023

Via

Access Paper or Ask Questions

InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss

Nov 02, 2022

Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

Figure 1 for InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss

Figure 2 for InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss

Figure 3 for InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss

Figure 4 for InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss

Abstract:This paper presents InterMPL, a semi-supervised learning method of end-to-end automatic speech recognition (ASR) that performs pseudo-labeling (PL) with intermediate supervision. Momentum PL (MPL) trains a connectionist temporal classification (CTC)-based model on unlabeled data by continuously generating pseudo-labels on the fly and improving their quality. In contrast to autoregressive formulations, such as the attention-based encoder-decoder and transducer, CTC is well suited for MPL, or PL-based semi-supervised ASR in general, owing to its simple/fast inference algorithm and robustness against generating collapsed labels. However, CTC generally yields inferior performance than the autoregressive models due to the conditional independence assumption, thereby limiting the performance of MPL. We propose to enhance MPL by introducing intermediate loss, inspired by the recent advances in CTC-based modeling. Specifically, we focus on self-conditional and hierarchical conditional CTC, that apply auxiliary CTC losses to intermediate layers such that the conditional independence assumption is explicitly relaxed. We also explore how pseudo-labels should be generated and used as supervision for intermediate losses. Experimental results in different semi-supervised settings demonstrate that the proposed approach outperforms MPL and improves an ASR model by up to a 12.1% absolute performance gain. In addition, our detailed analysis validates the importance of the intermediate loss.

* Submitted to ICASSP2023

Via

Access Paper or Ask Questions

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

Oct 29, 2022

Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

Figure 1 for BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

Figure 2 for BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

Figure 3 for BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

Figure 4 for BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

Abstract:This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the input and hypothesized output sequences via the self-attention mechanism. This mechanism encourages a model to learn inner/inter-dependencies between the audio and token representations while maintaining CTC's training efficiency. During inference, BERT-CTC combines a mask-predict algorithm with CTC decoding, which iteratively refines an output sequence. The experimental results reveal that BERT-CTC improves over conventional approaches across variations in speaking styles and languages. Finally, we show that the semantic representations in BERT-CTC are beneficial towards downstream spoken language understanding tasks.

* Accepted to Findings of EMNLP2022

Via

Access Paper or Ask Questions

CTC Alignments Improve Autoregressive Translation

Oct 11, 2022

Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe

Figure 1 for CTC Alignments Improve Autoregressive Translation

Figure 2 for CTC Alignments Improve Autoregressive Translation

Figure 3 for CTC Alignments Improve Autoregressive Translation

Figure 4 for CTC Alignments Improve Autoregressive Translation

Abstract:Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment. However for translation, CTC exhibits clear limitations due to the contextual and non-monotonic nature of the task and thus lags behind attentional decoder approaches in terms of translation quality. In this work, we argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework wherein CTC's core properties can counteract several key weaknesses of pure-attention models during training and decoding. To validate this conjecture, we modify the Hybrid CTC/Attention model originally proposed for ASR to support text-to-text translation (MT) and speech-to-text translation (ST). Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.

Via

Access Paper or Ask Questions