Alert button
Picture for Tatsuya Kawahara

Tatsuya Kawahara

Alert button

Zero- and Few-shot Sound Event Localization and Detection

Sep 17, 2023
Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara

Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes that are trained before inference. To customize target classes after training, we tackle zero- and few-shot SELD tasks, in which we set new classes with a text sample or a few audio samples. While zero-shot sound classification tasks are achievable by embedding from contrastive language-audio pretraining (CLAP), zero-shot SELD tasks require assigning an activity and a DOA to each embedding, especially in overlapping cases. To tackle the assignment problem in overlapping cases, we propose an embed-ACCDOA model, which is trained to output track-wise CLAP embedding and corresponding activity-coupled Cartesian direction-of-arrival (ACCDOA). In our experimental evaluations on zero- and few-shot SELD tasks, the embed-ACCDOA model showed a better location-dependent scores than a straightforward combination of the CLAP audio encoder and a DOA estimation model. Moreover, the proposed combination of the embed-ACCDOA model and CLAP audio encoder with zero- or few-shot samples performed comparably to an official baseline system trained with complete train data in an evaluation dataset.

* 5 pages, 4 figures 
Viaarxiv icon

Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors

Aug 21, 2023
Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara, Gabriel Skantze

Figure 1 for Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors
Figure 2 for Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors
Figure 3 for Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors
Figure 4 for Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors

This paper tackles the challenging task of evaluating socially situated conversational robots and presents a novel objective evaluation approach that relies on multimodal user behaviors. In this study, our main focus is on assessing the human-likeness of the robot as the primary evaluation metric. While previous research often relied on subjective evaluations from users, our approach aims to evaluate the robot's human-likeness based on observable user behaviors indirectly, thus enhancing objectivity and reproducibility. To begin, we created an annotated dataset of human-likeness scores, utilizing user behaviors found in an attentive listening dialogue corpus. We then conducted an analysis to determine the correlation between multimodal user behaviors and human-likeness scores, demonstrating the feasibility of our proposed behavior-based evaluation method.

* Accepted by 25th ACM International Conference on Multimodal Interaction (ICMI '23), Late-Breaking Results 
Viaarxiv icon

Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation

Jul 28, 2023
Yahui Fu, Koji Inoue, Chenhui Chu, Tatsuya Kawahara

Figure 1 for Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation
Figure 2 for Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation
Figure 3 for Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation
Figure 4 for Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation

Recent approaches to empathetic response generation try to incorporate commonsense knowledge or reasoning about the causes of emotions to better understand the user's experiences and feelings. However, these approaches mainly focus on understanding the causalities of context from the user's perspective, ignoring the system's perspective. In this paper, we propose a commonsense-based causality explanation approach for diverse empathetic response generation that considers both the user's perspective (user's desires and reactions) and the system's perspective (system's intentions and reactions). We enhance ChatGPT's ability to reason for the system's perspective by integrating in-context learning with commonsense knowledge. Then, we integrate the commonsense-based causality explanation with both ChatGPT and a T5-based model. Experimental evaluations demonstrate that our method outperforms other comparable methods on both automatic and human evaluations.

Viaarxiv icon

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

May 18, 2023
Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji

Figure 1 for Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders
Figure 2 for Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders
Figure 3 for Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders
Figure 4 for Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

Diffusion-based speech enhancement (SE) has been investigated recently, but its decoding is very time-consuming. One solution is to initialize the decoding process with the enhanced feature estimated by a predictive SE system. However, this two-stage method ignores the complementarity between predictive and diffusion SE. In this paper, we propose a unified system that integrates these two SE modules. The system encodes both generative and predictive information, and then applies both generative and predictive decoders, whose outputs are fused. Specifically, the two SE modules are fused in the first and final diffusion steps: the first step fusion initializes the diffusion process with the predictive SE for improving the convergence, and the final step fusion combines the two complementary SE outputs to improve the SE performance. Experiments on the Voice-Bank dataset show that the diffusion score estimation can benefit from the predictive information and speed up the decoding.

Viaarxiv icon

Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

Mar 26, 2023
Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang, Tatsuya Kawahara

Figure 1 for Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder
Figure 2 for Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder
Figure 3 for Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder
Figure 4 for Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information, we supplement multiple spectrograms in different frame lengths into the time-domain encoders. They extract stationary frequency information in both narrowband and wideband. We also adopt multiple decoder outputs, each of which computes its corresponding resolution frequency loss. Experimental results show that (1) it is more effective to fuse stationary frequency features than non-stationary features in the encoder, and (2) the multiple outputs consistent with the frequency loss improve performance. Experiments on the Voice-Bank dataset show that the proposed method obtained a 0.14 PESQ improvement.

Viaarxiv icon

I Know Your Feelings Before You Do: Predicting Future Affective Reactions in Human-Computer Dialogue

Mar 17, 2023
Yuanchao Li, Koji Inoue, Leimin Tian, Changzeng Fu, Carlos Ishi, Hiroshi Ishiguro, Tatsuya Kawahara, Catherine Lai

Figure 1 for I Know Your Feelings Before You Do: Predicting Future Affective Reactions in Human-Computer Dialogue
Figure 2 for I Know Your Feelings Before You Do: Predicting Future Affective Reactions in Human-Computer Dialogue
Figure 3 for I Know Your Feelings Before You Do: Predicting Future Affective Reactions in Human-Computer Dialogue
Figure 4 for I Know Your Feelings Before You Do: Predicting Future Affective Reactions in Human-Computer Dialogue

Current Spoken Dialogue Systems (SDSs) often serve as passive listeners that respond only after receiving user speech. To achieve human-like dialogue, we propose a novel future prediction architecture that allows an SDS to anticipate future affective reactions based on its current behaviors before the user speaks. In this work, we investigate two scenarios: speech and laughter. In speech, we propose to predict the user's future emotion based on its temporal relationship with the system's current emotion and its causal relationship with the system's current Dialogue Act (DA). In laughter, we propose to predict the occurrence and type of the user's laughter using the system's laughter behaviors in the current turn. Preliminary analysis of human-robot dialogue demonstrated synchronicity in the emotions and laughter displayed by the human and robot, as well as DA-emotion causality in their dialogue. This verifies that our architecture can contribute to the development of an anticipatory SDS.

* Accepted to CHI2023 Late-Breaking Work 
Viaarxiv icon

Alzheimer's Dementia Detection through Spontaneous Dialogue with Proactive Robotic Listeners

Nov 15, 2022
Yuanchao Li, Catherine Lai, Divesh Lala, Koji Inoue, Tatsuya Kawahara

Figure 1 for Alzheimer's Dementia Detection through Spontaneous Dialogue with Proactive Robotic Listeners
Figure 2 for Alzheimer's Dementia Detection through Spontaneous Dialogue with Proactive Robotic Listeners

As the aging of society continues to accelerate, Alzheimer's Disease (AD) has received more and more attention from not only medical but also other fields, such as computer science, over the past decade. Since speech is considered one of the effective ways to diagnose cognitive decline, AD detection from speech has emerged as a hot topic. Nevertheless, such approaches fail to tackle several key issues: 1) AD is a complex neurocognitive disorder which means it is inappropriate to conduct AD detection using utterance information alone while ignoring dialogue information; 2) Utterances of AD patients contain many disfluencies that affect speech recognition yet are helpful to diagnosis; 3) AD patients tend to speak less, causing dialogue breakdown as the disease progresses. This fact leads to a small number of utterances, which may cause detection bias. Therefore, in this paper, we propose a novel AD detection architecture consisting of two major modules: an ensemble AD detector and a proactive listener. This architecture can be embedded in the dialogue system of conversational robots for healthcare.

* Accepted for HRI2022 Late-Breaking Report 
Viaarxiv icon

Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

Sep 08, 2022
Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

Figure 1 for Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM
Figure 2 for Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM
Figure 3 for Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM
Figure 4 for Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

Connectionist temporal classification (CTC) -based models are attractive in automatic speech recognition (ASR) because of their non-autoregressive nature. To take advantage of text-only data, language model (LM) integration approaches such as rescoring and shallow fusion have been widely used for CTC. However, they lose CTC's non-autoregressive nature because of the need for beam search, which slows down the inference speed. In this study, we propose an error correction method with phone-conditioned masked LM (PC-MLM). In the proposed method, less confident word tokens in a greedy decoded output from CTC are masked. PC-MLM then predicts these masked word tokens given unmasked words and phones supplementally predicted from CTC. We further extend it to Deletable PC-MLM in order to address insertion errors. Since both CTC and PC-MLM are non-autoregressive models, the method enables fast LM integration. Experimental evaluations on the Corpus of Spontaneous Japanese (CSJ) and TED-LIUM2 in domain adaptation setting shows that our proposed method outperformed rescoring and shallow fusion in terms of inference speed, and also in terms of recognition accuracy on CSJ.

* Accepted in Interspeech2022 
Viaarxiv icon