Abstract:Diffusion models have recently achieved impressive results in reconstructing images from noisy inputs, and similar ideas have been applied to speech enhancement by treating time-frequency representations as images. With the ubiquity of multi-microphone devices, we extend state-of-the-art diffusion-based methods to exploit multichannel inputs for improved performance. Multichannel diffusion-based enhancement remains in its infancy, with prior work making limited use of advanced mechanisms such as attention for spatial modeling - a gap addressed in this paper. We propose AMDM-SE, an Attention-based Multichannel Diffusion Model for Speech Enhancement, designed specifically for noise reduction. AMDM-SE leverages spatial inter-channel information through a novel cross-channel time-frequency attention block, enabling faithful reconstruction of fine-grained signal details within a generative diffusion framework. On the CHiME-3 benchmark, AMDM-SE outperforms both a single-channel diffusion baseline and a multichannel model without attention, as well as a strong DNN-based predictive method. Simulated-data experiments further underscore the importance of the proposed multichannel attention mechanism. Overall, our results show that incorporating targeted multichannel attention into diffusion models substantially improves noise reduction. While multichannel diffusion-based speech enhancement is still an emerging field, our work contributes a new and complementary approach to the growing body of research in this direction.



Abstract:In this paper, we propose a model which can generate a singing voice from normal speech utterance by harnessing zero-shot, many-to-many style transfer learning. Our goal is to give anyone the opportunity to sing any song in a timely manner. We present a system comprising several available blocks, as well as a modified auto-encoder, and show how this highly-complex challenge can be achieved by tailoring rather simple solutions together. We demonstrate the applicability of the proposed system using a group of 25 non-expert listeners. Samples of the data generated from our model are provided.




Abstract:Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal. The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with TF attention aiming at noisy and reverberant environments. We dub this new architecture as Separation TF Attention Network (Sep-TFAnet). In addition, we present a variant of the separation network, dubbed $ \text{Sep-TFAnet}^{\text{VAD}}$, which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for the analysis and synthesis, respectively. Our system is specially developed for human-robotic interactions and should support online mode. The separation capabilities of $ \text{Sep-TFAnet}^{\text{VAD}}$ and Sep-TFAnet were evaluated and extensively analyzed under several acoustic conditions, demonstrating their advantages over competing methods. Since separation networks trained on simulated data tend to perform poorly on real recordings, we also demonstrate the ability of the proposed scheme to better generalize to realistic examples recorded in our acoustic lab by a humanoid robot. Project page: https://Sep-TFAnet.github.io