Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aparna Khare

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Mar 28, 2024

Yash Jain, David Chan, Pranav Dheram, Aparna Khare, Olabanji Shonibare, Venkatesh Ravichandran, Shalini Ghosh

Figure 1 for Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Figure 2 for Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Figure 3 for Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Figure 4 for Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Abstract:Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-training followed by fine-tuning on the downstream task. In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB. Additionally, we share several important findings for choosing pre-training methods and datasets.

* Accepted in LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

Via

Access Paper or Ask Questions

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Jan 26, 2024

Jinhan Wang, Long Chen, Aparna Khare, Anirudh Raju, Pranav Dheram, Di He, Minhua Wu, Andreas Stolcke, Venkatesh Ravichandran

Figure 1 for Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Figure 2 for Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Figure 3 for Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Figure 4 for Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Abstract:We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

* To appear in IEEE ICASSP 2024

Via

Access Paper or Ask Questions

Two-pass Endpoint Detection for Speech Recognition

Jan 17, 2024

Anirudh Raju, Aparna Khare, Di He, Ilya Sklyar, Long Chen, Sam Alptekin, Viet Anh Trinh, Zhe Zhang, Colin Vaz, Venkatesh Ravichandran(+2 more)

Abstract:Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs and latency over a baseline endpointer, as tested on datasets including voice-assistant transactional queries, conversational speech, and the public SLURP corpus. We demonstrate that our method shows improvements regardless of the first-pass EP model used.

* ASRU 2023

Via

Access Paper or Ask Questions

Cross-utterance ASR Rescoring with Graph-based Label Propagation

Mar 27, 2023

Srinath Tankasala, Long Chen, Andreas Stolcke, Anirudh Raju, Qianli Deng, Chander Chandak, Aparna Khare, Roland Maas, Venkatesh Ravichandran

Figure 1 for Cross-utterance ASR Rescoring with Graph-based Label Propagation

Figure 2 for Cross-utterance ASR Rescoring with Graph-based Label Propagation

Figure 3 for Cross-utterance ASR Rescoring with Graph-based Label Propagation

Figure 4 for Cross-utterance ASR Rescoring with Graph-based Label Propagation

Abstract:We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK dataset demonstrate that our approach consistently improves ASR performance, as well as fairness across speaker groups with different accents. Our approach provides a low-cost solution for mitigating the majoritarian bias of ASR systems, without the need to train new domain- or accent-specific models.

* To appear in IEEE ICASSP 2023

Via

Access Paper or Ask Questions

ASR-Aware End-to-end Neural Diarization

Feb 02, 2022

Aparna Khare, Eunjung Han, Yuguang Yang, Andreas Stolcke

Figure 1 for ASR-Aware End-to-end Neural Diarization

Figure 2 for ASR-Aware End-to-end Neural Diarization

Abstract:We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features. First, ASR features are concatenated with acoustic features. Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations. Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss. Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20% relative to the baseline.

* To appear in ICASSP 2022

Via

Access Paper or Ask Questions

Audiovisual Highlight Detection in Videos

Feb 11, 2021

Karel Mundnich, Alexandra Fenster, Aparna Khare, Shiva Sundaram

Figure 1 for Audiovisual Highlight Detection in Videos

Figure 2 for Audiovisual Highlight Detection in Videos

Figure 3 for Audiovisual Highlight Detection in Videos

Figure 4 for Audiovisual Highlight Detection in Videos

Abstract:In this paper, we test the hypothesis that interesting events in unstructured videos are inherently audiovisual. We combine deep image representations for object recognition and scene understanding with representations from an audiovisual affect recognition model. To this set, we include content agnostic audio-visual synchrony representations and mel-frequency cepstral coefficients to capture other intrinsic properties of audio. These features are used in a modular supervised model. We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time. For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information. To better study the task of highlight detection, we run a pilot experiment with highlights annotations for a small subset of video clips and fine-tune our best model on it. Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.

* 5 pages, 2 figures, conference paper

Via

Access Paper or Ask Questions

Self-Supervised learning with cross-modal transformers for emotion recognition

Nov 20, 2020

Aparna Khare, Srinivas Parthasarathy, Shiva Sundaram

Figure 1 for Self-Supervised learning with cross-modal transformers for emotion recognition

Figure 2 for Self-Supervised learning with cross-modal transformers for emotion recognition

Figure 3 for Self-Supervised learning with cross-modal transformers for emotion recognition

Abstract:Emotion recognition is a challenging task due to limited availability of in-the-wild labeled datasets. Self-supervised learning has shown improvements on tasks with limited labeled datasets in domains like speech and natural language. Models such as BERT learn to incorporate context in word embeddings, which translates to improved performance in downstream tasks like question answering. In this work, we extend self-supervised training to multi-modal applications. We learn multi-modal representations using a transformer trained on the masked language modeling task with audio, visual and text features. This model is fine-tuned on the downstream task of emotion recognition. Our results on the CMU-MOSEI dataset show that this pre-training technique can improve the emotion recognition performance by up to 3% compared to the baseline.

* To appear in SLT2020

Via

Access Paper or Ask Questions

Multi-modal embeddings using multi-task learning for emotion recognition

Sep 10, 2020

Aparna Khare, Srinivas Parthasarathy, Shiva Sundaram

Figure 1 for Multi-modal embeddings using multi-task learning for emotion recognition

Figure 2 for Multi-modal embeddings using multi-task learning for emotion recognition

Figure 3 for Multi-modal embeddings using multi-task learning for emotion recognition

Abstract:General embeddings like word2vec, GloVe and ELMo have shown a lot of success in natural language tasks. The embeddings are typically extracted from models that are built on general tasks such as skip-gram models and natural language generation. In this paper, we extend the work from natural language understanding to multi-modal architectures that use audio, visual and textual information for machine learning tasks. The embeddings in our network are extracted using the encoder of a transformer model trained using multi-task training. We use person identification and automatic speech recognition as the tasks in our embedding generation framework. We tune and evaluate the embeddings on the downstream task of emotion recognition and demonstrate that on the CMU-MOSEI dataset, the embeddings can be used to improve over previous state of the art results.

* To appear in Interspeech,2020

Via

Access Paper or Ask Questions

Multiresolution and Multimodal Speech Recognition with Transformers

Apr 29, 2020

Georgios Paraskevopoulos, Srinivas Parthasarathy, Aparna Khare, Shiva Sundaram

Figure 1 for Multiresolution and Multimodal Speech Recognition with Transformers

Figure 2 for Multiresolution and Multimodal Speech Recognition with Transformers

Figure 3 for Multiresolution and Multimodal Speech Recognition with Transformers

Figure 4 for Multiresolution and Multimodal Speech Recognition with Transformers

Abstract:This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture. We particularly focus on the scene context provided by the visual information, to ground the ASR. We extract representations for audio features in the encoder layers of the transformer and fuse video features using an additional crossmodal multihead attention layer. Additionally, we incorporate a multitask training criterion for multiresolution ASR, where we train the model to generate both character and subword level transcriptions. Experimental results on the How2 dataset, indicate that multiresolution training can speed up convergence by around 50% and relatively improves word error rate (WER) performance by upto 18% over subword prediction models. Further, incorporating visual information improves performance with relative gains upto 3.76% over audio only models. Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.

* Accepted for ACL 2020

Via

Access Paper or Ask Questions

Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning

Feb 01, 2020

Sanna Wager, Aparna Khare, Minhua Wu, Kenichi Kumatani, Shiva Sundaram

Figure 1 for Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning

Figure 2 for Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning

Figure 3 for Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning

Figure 4 for Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning

Abstract:In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio, we trained a simpler multi-channel student acoustic model used in the speech recognition system. For the student, both multi-channel feature extraction layers and the higher classification layers were jointly trained using the logits from the teacher model. In our experiments, compared to a baseline model trained on about 600 hours of transcribed data, a relative word-error rate (WER) reduction of about 27.3% was achieved when using an additional 1800 hours of untranscribed data. We also investigated the benefit of pre-training the multi-channel front end to output the beamformed log-mel filter bank energies (LFBE) using L2 loss. We find that pre-training improves the word error rate by 10.7% when compared to a multi-channel model directly initialized with a beamformer and mel-filter bank coefficients for the front end. Finally, combining pre-training and teacher-student training produces a WER reduction of 31% compared to our baseline.

* To appear in ICASSP 2020

Via

Access Paper or Ask Questions