Alert button
Picture for Kazuhito Koishida

Kazuhito Koishida

Alert button

Automatic Disfluency Detection from Untranscribed Speech

Nov 01, 2023
Amrit Romana, Kazuhito Koishida, Emily Mower Provost

Speech disfluencies, such as filled pauses or repetitions, are disruptions in the typical flow of speech. Stuttering is a speech disorder characterized by a high rate of disfluencies, but all individuals speak with some disfluencies and the rates of disfluencies may by increased by factors such as cognitive load. Clinically, automatic disfluency detection may help in treatment planning for individuals who stutter. Outside of the clinic, automatic disfluency detection may serve as a pre-processing step to improve natural language understanding in downstream applications. With this wide range of applications in mind, we investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization. Each of these methods relies on audio as an input. First, we evaluate several automatic speech recognition (ASR) systems in terms of their ability to transcribe disfluencies, measured using disfluency error rates. We then use these ASR transcripts as input to a language-based disfluency detection model. We find that disfluency detection performance is largely limited by the quality of transcripts and alignments. We find that an acoustic-based approach that does not require transcription as an intermediate step outperforms the ASR language approach. Finally, we present multimodal architectures which we find improve disfluency detection performance over the unimodal approaches. Ultimately, this work introduces novel approaches for automatic frame-level disfluency and categorization. In the long term, this will help researchers incorporate automatic disfluency detection into a range of applications.

Viaarxiv icon

Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Sep 19, 2023
Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi

Figure 1 for Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Figure 2 for Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Figure 3 for Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Diffusion models power a vast majority of text-to-audio (TTA) generation methods. Unfortunately, these models suffer from slow inference speed due to iterative queries to the underlying denoising network, thus unsuitable for scenarios with inference time or computational constraints. This work modifies the recently proposed consistency distillation framework to train TTA models that require only a single neural network query. In addition to incorporating classifier-free guidance into the distillation process, we leverage the availability of generated audio during distillation training to fine-tune the consistency TTA model with novel loss functions in the audio space, such as the CLAP score. Our objective and subjective evaluation results on the AudioCaps dataset show that consistency models retain diffusion models' high generation quality and diversity while reducing the number of queries by a factor of 400.

Viaarxiv icon

Progressive Knowledge Distillation: Building Ensembles for Efficient Inference

Feb 20, 2023
Don Kurian Dennis, Abhishek Shetty, Anish Sevekari, Kazuhito Koishida, Virginia Smith

Figure 1 for Progressive Knowledge Distillation: Building Ensembles for Efficient Inference
Figure 2 for Progressive Knowledge Distillation: Building Ensembles for Efficient Inference
Figure 3 for Progressive Knowledge Distillation: Building Ensembles for Efficient Inference
Figure 4 for Progressive Knowledge Distillation: Building Ensembles for Efficient Inference

We study the problem of progressive distillation: Given a large, pre-trained teacher model $g$, we seek to decompose the model into an ensemble of smaller, low-inference cost student models $f_i$. The resulting ensemble allows for flexibly tuning accuracy vs. inference cost, which is useful for a number of applications in on-device inference. The method we propose, B-DISTIL, relies on an algorithmic procedure that uses function composition over intermediate activations to construct expressive ensembles with similar performance as $g$, but with much smaller student models. We demonstrate the effectiveness of \algA by decomposing pretrained models across standard image, speech, and sensor datasets. We also provide theoretical guarantees for our method in terms of convergence and generalization.

Viaarxiv icon

SCP-GAN: Self-Correcting Discriminator Optimization for Training Consistency Preserving Metric GAN on Speech Enhancement Tasks

Oct 26, 2022
Vasily Zadorozhnyy, Qiang Ye, Kazuhito Koishida

Figure 1 for SCP-GAN: Self-Correcting Discriminator Optimization for Training Consistency Preserving Metric GAN on Speech Enhancement Tasks
Figure 2 for SCP-GAN: Self-Correcting Discriminator Optimization for Training Consistency Preserving Metric GAN on Speech Enhancement Tasks
Figure 3 for SCP-GAN: Self-Correcting Discriminator Optimization for Training Consistency Preserving Metric GAN on Speech Enhancement Tasks

In recent years, Generative Adversarial Networks (GANs) have produced significantly improved results in speech enhancement (SE) tasks. They are difficult to train, however. In this work, we introduce several improvements to the GAN training schemes, which can be applied to most GAN-based SE models. We propose using consistency loss functions, which target the inconsistency in time and time-frequency domains caused by Fourier and Inverse Fourier Transforms. We also present self-correcting optimization for training a GAN discriminator on SE tasks, which helps avoid "harmful" training directions for parts of the discriminator loss function. We have tested our proposed methods on several state-of-the-art GAN-based SE models and obtained consistent improvements, including new state-of-the-art results for the Voice Bank+DEMAND dataset.

* 5 pages (4 - manuscript and 1 - references), 2 figures, 2 tables 
Viaarxiv icon

Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations

Dec 21, 2021
Melikasadat Emami, Dung Tran, Kazuhito Koishida

Figure 1 for Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations
Figure 2 for Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations
Figure 3 for Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations
Figure 4 for Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations

Improving generalization is a major challenge in audio classification due to labeled data scarcity. Self-supervised learning (SSL) methods tackle this by leveraging unlabeled data to learn useful features for downstream classification tasks. In this work, we propose an augmented contrastive SSL framework to learn invariant representations from unlabeled data. Our method applies various perturbations to the unlabeled input data and utilizes contrastive learning to learn representations robust to such perturbations. Experimental results on the Audioset and DESED datasets show that our framework significantly outperforms state-of-the-art SSL and supervised learning methods on sound/event classification tasks.

* 4 pages, 4 figures 
Viaarxiv icon

A Training Framework for Stereo-Aware Speech Enhancement using Deep Neural Networks

Dec 09, 2021
Bahareh Tolooshams, Kazuhito Koishida

Figure 1 for A Training Framework for Stereo-Aware Speech Enhancement using Deep Neural Networks
Figure 2 for A Training Framework for Stereo-Aware Speech Enhancement using Deep Neural Networks
Figure 3 for A Training Framework for Stereo-Aware Speech Enhancement using Deep Neural Networks

Deep learning-based speech enhancement has shown unprecedented performance in recent years. The most popular mono speech enhancement frameworks are end-to-end networks mapping the noisy mixture into an estimate of the clean speech. With growing computational power and availability of multichannel microphone recordings, prior works have aimed to incorporate spatial statistics along with spectral information to boost up performance. Despite an improvement in enhancement performance of mono output, the spatial image preservation and subjective evaluations have not gained much attention in the literature. This paper proposes a novel stereo-aware framework for speech enhancement, i.e., a training loss for deep learning-based speech enhancement to preserve the spatial image while enhancing the stereo mixture. The proposed framework is model independent, hence it can be applied to any deep learning based architecture. We provide an extensive objective and subjective evaluation of the trained models through a listening test. We show that by regularizing for an image preservation loss, the overall performance is improved, and the stereo aspect of the speech is better preserved.

* Submitted to ICASSP 2022 
Viaarxiv icon

Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features

Dec 08, 2021
Trung Dang, Dung Tran, Peter Chin, Kazuhito Koishida

Figure 1 for Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features
Figure 2 for Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features
Figure 3 for Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features

Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker without relying on parallel training data. Recently, self-supervised learning of speech representation has been shown to produce useful linguistic units without using transcripts, which can be directly passed to a VC model. In this paper, we showed that high-quality audio samples can be achieved by using a length resampling decoder, which enables the VC model to work in conjunction with different linguistic feature extractors and vocoders without requiring them to operate on the same sequence length. We showed that our method can outperform many baselines on the VCTK dataset. Without modifying the architecture, we further demonstrated that a) using pairs of different audio segments from the same speaker, b) adding a cycle consistency loss, and c) adding a speaker classification loss can help to learn a better speaker embedding. Our model trained on LibriTTS using these techniques achieves the best performance, producing audio samples transferred well to the target speaker's voice, while preserving the linguistic content that is comparable with actual human utterances in terms of Character Error Rate.

Viaarxiv icon

Interspeech 2021 Deep Noise Suppression Challenge

Jan 10, 2021
Chandan K A Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, Sriram Srinivasan

Figure 1 for Interspeech 2021 Deep Noise Suppression Challenge

The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH and ICASSP 2020. We open-sourced training and test datasets for the wideband scenario. We also open-sourced a subjective evaluation framework based on ITU-T standard P.808, which was also used to evaluate participants of the challenge. Many researchers from academia and industry made significant contributions to push the field forward, yet even the best noise suppressor was far from achieving superior speech quality in challenging scenarios. In this version of the challenge organized at INTERSPEECH 2021, we are expanding both our training and test datasets to accommodate full band scenarios. The two tracks in this challenge will focus on real-time denoising for (i) wide band, and(ii) full band scenarios. We are also making available a reliable non-intrusive objective speech quality metric called DNSMOS for the participants to use during their development phase.

* arXiv admin note: substantial text overlap with arXiv:2009.06122 
Viaarxiv icon

Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"

Jun 30, 2020
Saeed Amizadeh, Hamid Palangi, Oleksandr Polozov, Yichen Huang, Kazuhito Koishida

Figure 1 for Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"
Figure 2 for Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"
Figure 3 for Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"
Figure 4 for Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"

Visual reasoning tasks such as visual question answering (VQA) require an interplay of visual perception with reasoning about the question semantics grounded in perception. Various benchmarks for reasoning across language and vision like VQA, VCR and more recently GQA for compositional question answering facilitate scientific progress from perception models to visual reasoning. However, recent advances are still primarily driven by perception improvements (e.g. scene graph generation) rather than reasoning. Neuro-symbolic models such as Neural Module Networks bring the benefits of compositional reasoning to VQA, but they are still entangled with visual representation learning, and thus neural reasoning is hard to improve and assess on its own. To address this, we propose (1) a framework to isolate and evaluate the reasoning aspect of VQA separately from its perception, and (2) a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception. To this end, we introduce a differentiable first-order logic formalism for VQA that explicitly decouples question answering from visual perception. On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models leading to informative insights regarding the participating models as well as the task.

* To be published in Proceedings of the 37th International Conference on Machine Learning (ICML), Vienna, Austria, PMLR 119, 2020 
Viaarxiv icon