Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuki Mitsufuji

Zero- and Few-shot Sound Event Localization and Detection

Sep 17, 2023
Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara

Figure 1 for Zero- and Few-shot Sound Event Localization and Detection

Figure 2 for Zero- and Few-shot Sound Event Localization and Detection

Figure 3 for Zero- and Few-shot Sound Event Localization and Detection

Figure 4 for Zero- and Few-shot Sound Event Localization and Detection

Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes that are trained before inference. To customize target classes after training, we tackle zero- and few-shot SELD tasks, in which we set new classes with a text sample or a few audio samples. While zero-shot sound classification tasks are achievable by embedding from contrastive language-audio pretraining (CLAP), zero-shot SELD tasks require assigning an activity and a DOA to each embedding, especially in overlapping cases. To tackle the assignment problem in overlapping cases, we propose an embed-ACCDOA model, which is trained to output track-wise CLAP embedding and corresponding activity-coupled Cartesian direction-of-arrival (ACCDOA). In our experimental evaluations on zero- and few-shot SELD tasks, the embed-ACCDOA model showed a better location-dependent scores than a straightforward combination of the CLAP audio encoder and a DOA estimation model. Moreover, the proposed combination of the embed-ACCDOA model and CLAP audio encoder with zero- or few-shot samples performed comparably to an official baseline system trained with complete train data in an evaluation dataset.

* 5 pages, 4 figures

Via

Access Paper or Ask Questions

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

Sep 13, 2023
Carlos Hernandez-Olivan, Koichi Saito, Naoki Murata, Chieh-Hsin Lai, Marco A. Martínez-Ramirez, Wei-Hsiang Liao, Yuki Mitsufuji

Figure 1 for VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

Figure 2 for VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

Figure 3 for VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

Figure 4 for VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrinsic properties, making it versatile across various restoration tasks. In this paper, we identify that there are potential issues which will degrade current DPS-based methods' performance and introduce the way to mitigate the issues inspired by diverse diffusion guidance techniques including the RePaint (RP) strategy and the Pseudoinverse-Guided Diffusion Models ($\Pi$GDM). We demonstrate our methods for the vocal declipping and bandwidth extension tasks under various levels of distortion and cutoff frequency, respectively. In both tasks, our methods outperform the current DPS-based music restoration benchmarks. We refer to \url{http://carlosholivan.github.io/demos/audio-restoration-2023.html} for examples of the restored audio samples.

Via

Access Paper or Ask Questions

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Sep 06, 2023
Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

Figure 1 for BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Figure 2 for BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Figure 3 for BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Figure 4 for BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications. Our code is available at https://github.com/sony/bigvsan.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Enhancing Semantic Communication with Deep Generative Models -- An ICASSP Special Session Overview

Sep 05, 2023
Eleonora Grassucci, Yuki Mitsufuji, Ping Zhang, Danilo Comminiello

Semantic communication is poised to play a pivotal role in shaping the landscape of future AI-driven communication systems. Its challenge of extracting semantic information from the original complex content and regenerating semantically consistent data at the receiver, possibly being robust to channel corruptions, can be addressed with deep generative models. This ICASSP special session overview paper discloses the semantic communication challenges from the machine learning perspective and unveils how deep generative models will significantly enhance semantic communication frameworks in dealing with real-world complex data, extracting and exploiting semantic information, and being robust to channel corruptions. Alongside establishing this emerging field, this paper charts novel research pathways for the next generative semantic communication frameworks.

* Submitted to IEEE ICASSP

Via

Access Paper or Ask Questions

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

Aug 14, 2023
Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji

$Figure 1 for The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track$

$Figure 2 for The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track$

$Figure 3 for The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track$

$Figure 4 for The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track$

This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8dB in SDR whereas the top performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7dB.

* under review

Via

Access Paper or Ask Questions

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track

Aug 14, 2023
Giorgio Fabbro, Stefan Uhlich, Chieh-Hsin Lai, Woosung Choi, Marco Martínez-Ramírez, Weihsiang Liao, Igor Gadelha, Geraldo Ramos, Eddie Hsu, Hugo Rodrigues, Fabian-Robert Stöter, Alexandre Défossez, Yi Luo, Jianwei Yu, Dipam Chakraborty, Sharada Mohanty, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Nabarun Goswami, Tatsuya Harada, Minseok Kim, Jun Hyung Lee, Yuanliang Dong, Xinran Zhang, Jiafeng Liu, Yuki Mitsufuji

$Figure 1 for The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track$

$Figure 2 for The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track$

$Figure 3 for The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track$

$Figure 4 for The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track$

This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX'23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce two new datasets that simulate such errors: SDXDB23_LabelNoise and SDXDB23_Bleeding1. We describe the methods that achieved the highest scores in the competition. Moreover, we present a direct comparison with the previous edition of the challenge (the Music Demixing Challenge 2021): the best performing system under the standard MSS formulation achieved an improvement of over 1.6dB in signal-to-distortion ratio over the winner of the previous competition, when evaluated on MDXDB21. Besides relying on the signal-to-distortion ratio as objective metric, we also performed a listening test with renowned producers/musicians to study the perceptual quality of the systems and report here the results. Finally, we provide our insights into the organization of the competition and our prospects for future editions.

* under review

Via

Access Paper or Ask Questions

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

Jul 10, 2023
Keisuke Toyama, Taketo Akama, Yukara Ikemiya, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji

Figure 1 for Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

Figure 2 for Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

Figure 3 for Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

Figure 4 for Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

Taking long-term spectral and temporal dependencies into account is essential for automatic piano transcription. This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content. In this case, we may rely on the capability of self-attention mechanism in Transformers to capture these long-term dependencies in the frequency and time axes. In this work, we propose hFT-Transformer, which is an automatic music transcription method that uses a two-level hierarchical frequency-time Transformer architecture. The first hierarchy includes a convolutional block in the time axis, a Transformer encoder in the frequency axis, and a Transformer decoder that converts the dimension in the frequency axis. The output is then fed into the second hierarchy which consists of another Transformer encoder in the time axis. We evaluated our method with the widely used MAPS and MAESTRO v3.0.0 datasets, and it demonstrated state-of-the-art performance on all the F1-scores of the metrics among Frame, Note, Note with Offset, and Note with Offset and Velocity estimations.

* 8 pages, 6 figures, to be published in ISMIR2023

Via

Access Paper or Ask Questions

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Jun 15, 2023
Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji

Figure 1 for STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Figure 2 for STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Figure 3 for STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Figure 4 for STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results show that the audio-visual SELD system achieves lower localization error than the audio-only system. The data is available at https://zenodo.org/record/7880637.

* 25 pages, 8 figures

Via

Access Paper or Ask Questions

On the Equivalence of Consistency-Type Models: Consistency Models, Consistent Diffusion Models, and Fokker-Planck Regularization

Jun 01, 2023
Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji, Stefano Ermon

Figure 1 for On the Equivalence of Consistency-Type Models: Consistency Models, Consistent Diffusion Models, and Fokker-Planck Regularization

Figure 2 for On the Equivalence of Consistency-Type Models: Consistency Models, Consistent Diffusion Models, and Fokker-Planck Regularization

The emergence of various notions of ``consistency'' in diffusion models has garnered considerable attention and helped achieve improved sample quality, likelihood estimation, and accelerated sampling. Although similar concepts have been proposed in the literature, the precise relationships among them remain unclear. In this study, we establish theoretical connections between three recent ``consistency'' notions designed to enhance diffusion models for distinct objectives. Our insights offer the potential for a more comprehensive and encompassing framework for consistency-type models.

Via

Access Paper or Ask Questions

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

May 18, 2023
Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji

Figure 1 for Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

Figure 2 for Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

Figure 3 for Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

Figure 4 for Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

Diffusion-based speech enhancement (SE) has been investigated recently, but its decoding is very time-consuming. One solution is to initialize the decoding process with the enhanced feature estimated by a predictive SE system. However, this two-stage method ignores the complementarity between predictive and diffusion SE. In this paper, we propose a unified system that integrates these two SE modules. The system encodes both generative and predictive information, and then applies both generative and predictive decoders, whose outputs are fused. Specifically, the two SE modules are fused in the first and final diffusion steps: the first step fusion initializes the diffusion process with the predictive SE for improving the convergence, and the final step fusion combines the two complementary SE outputs to improve the SE performance. Experiments on the Voice-Bank dataset show that the diffusion score estimation can benefit from the predictive information and speed up the decoding.

Via

Access Paper or Ask Questions