Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Muhammad Shakeel

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

Jan 30, 2026

Muhammad Shakeel, Yosuke Fukumoto, Chikara Maeda, Chyi-Jiunn Lin, Shinji Watanabe

Abstract:We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.

* Accepted to IEEE ICASSP 2026

Via

Access Paper or Ask Questions

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

Jun 23, 2024

Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

Figure 1 for Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

Figure 2 for Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

Figure 3 for Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

Figure 4 for Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

Abstract:Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives. Our proposed intermediate biasing loss brings more regularization and contextualization to the network. Our method outperforms a conventional contextual biasing baseline on the LibriSpeech corpus, achieving a relative improvement of 22.5% in biased word error rate (B-WER) and up to 44% compared to the non-contextual baseline with a biasing list size of 100. Moreover, employing RNN-transducer-driven joint decoding further reduces the unbiased word error rate (U-WER), resulting in a more robust network.

* Accepted to INTERSPEECH 2024

Via

Access Paper or Ask Questions

4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Jun 05, 2024

Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, Shinji Watanabe

Figure 1 for 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Figure 2 for 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Figure 3 for 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Figure 4 for 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Abstract:End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on application requirements. Instead of building separate models, we propose a joint modeling scheme where four decoders (CTC, RNN-T, attention, and mask-predict) share the same encoder -- we refer to this as 4D modeling. The 4D model is trained using multitask learning, which will bring model regularization and maximize the model robustness thanks to their complementary properties. To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning. In addition, we propose three novel one-pass beam search algorithms by combining three decoders (CTC, RNN-T, and attention) to further improve performance. These three beam search algorithms differ in which decoder is used as the primary decoder. We carefully evaluate the performance and computational tradeoffs associated with each algorithm. Experimental results demonstrate that the jointly trained 4D model outperforms the E2E-ASR models trained with only one individual decoder. Furthermore, we demonstrate that the proposed one-pass beam search algorithm outperforms the previously proposed CTC/attention decoding.

* submitted to IEEE/ACM Transactions on Audio Speech and Language Processing

Via

Access Paper or Ask Questions

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

May 22, 2024

Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

Figure 1 for Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Figure 2 for Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Figure 3 for Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Figure 4 for Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Abstract:End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7% relative CERRs for non-streaming ASR within a single model compared to multiple standalone modules.

* Accepted to IEEE ICASSP 2024 workshop Hands-free Speech Communication and Microphone Arrays (HSCMA 2024)

Via

Access Paper or Ask Questions

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Feb 20, 2024

Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe

Figure 1 for OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Figure 2 for OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Figure 3 for OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Figure 4 for OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Abstract:There has been an increasing interest in large speech models that can perform multiple speech processing tasks in a single model. Such models usually adopt the encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID). Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 25% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference. OWSM-CTC also improves the long-form ASR result with 20x speed-up. We will publicly release our codebase, pre-trained model, and training logs to promote open science in speech foundation models.

* 18 pages, 2 figures

Via

Access Paper or Ask Questions

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Jan 30, 2024

Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang(+2 more)

Figure 1 for OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Figure 2 for OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Figure 3 for OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Figure 4 for OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Abstract:Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI's Whisper using publicly available data and open-source toolkits. With the aim of reproducing Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performance compared to other state-of-the-art speech encoders. In this work, we aim to improve the performance and efficiency of OWSM without extra training data. We present E-Branchformer based OWSM v3.1 models at two scales, i.e., 100M and 1B. The 1B model is the largest E-Branchformer based speech model that has been made publicly available. It outperforms the previous OWSM v3 in a vast majority of evaluation benchmarks, while demonstrating up to 25% faster inference speed. We publicly release the data preparation scripts, pre-trained models and training logs.

* Project webpage: https://www.wavlab.org/activities/2024/owsm/

Via

Access Paper or Ask Questions

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Jan 19, 2024

Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, Shinji Watanabe

Figure 1 for Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Figure 2 for Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Figure 3 for Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Figure 4 for Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Abstract:End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.

* accepted by ICASSP20224

Via

Access Paper or Ask Questions

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Oct 02, 2023

Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma(+6 more)

Abstract:Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.

* Accepted at ASRU 2023

Via

Access Paper or Ask Questions

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Dec 21, 2022

Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe

Figure 1 for 4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Figure 2 for 4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Figure 3 for 4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Abstract:The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models depending on the application requirement, resulting in the increased overhead of maintaining all models. Several methods for integrating two of these complementary models to mitigate the overhead issue have been proposed; however, if we integrate more models, we will further benefit from these complementary models and realize broader applications with a single system. This paper proposes four-decoder joint modeling (4D) of CTC, attention, RNN-T, and mask-predict, which has the following three advantages: 1) The four decoders are jointly trained so that they can be easily switched depending on the application scenarios. 2) Joint training may bring model regularization and improve the model robustness thanks to their complementary properties. 3) Novel one-pass joint decoding methods using CTC, attention, and RNN-T further improves the performance. The experimental results showed that the proposed model consistently reduced the WER.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

Metric-based multimodal meta-learning for human movement identification via footstep recognition

Nov 15, 2021

Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai

Figure 1 for Metric-based multimodal meta-learning for human movement identification via footstep recognition

Figure 2 for Metric-based multimodal meta-learning for human movement identification via footstep recognition

Figure 3 for Metric-based multimodal meta-learning for human movement identification via footstep recognition

Figure 4 for Metric-based multimodal meta-learning for human movement identification via footstep recognition

Abstract:We describe a novel metric-based learning approach that introduces a multimodal framework and uses deep audio and geophone encoders in siamese configuration to design an adaptable and lightweight supervised model. This framework eliminates the need for expensive data labeling procedures and learns general-purpose representations from low multisensory data obtained from omnipresent sensing systems. These sensing systems provide numerous applications and various use cases in activity recognition tasks. Here, we intend to explore the human footstep movements from indoor environments and analyze representations from a small self-collected dataset of acoustic and vibration-based sensors. The core idea is to learn plausible similarities between two sensory traits and combining representations from audio and geophone signals. We present a generalized framework to learn embeddings from temporal and spatial features extracted from audio and geophone signals. We then extract the representations in a shared space to maximize the learning of a compatibility function between acoustic and geophone features. This, in turn, can be used effectively to carry out a classification task from the learned model, as demonstrated by assigning high similarity to the pairs with a human footstep movement and lower similarity to pairs containing no footstep movement. Performance analyses show that our proposed multimodal framework achieves a 19.99\% accuracy increase (in absolute terms) and avoided overfitting on the evaluation set when the training samples were increased from 200 pairs to just 500 pairs while satisfactorily learning the audio and geophone representations. Our results employ a metric-based contrastive learning approach for multi-sensor data to mitigate the impact of data scarcity and perform human movement identification with limited data size.

Via

Access Paper or Ask Questions