Alert button
Picture for Minje Kim

Minje Kim

Alert button

Native Multi-Band Audio Coding within Hyper-Autoencoded Reconstruction Propagation Networks

Mar 14, 2023
Darius Petermann, Inseon Jang, Minje Kim

Figure 1 for Native Multi-Band Audio Coding within Hyper-Autoencoded Reconstruction Propagation Networks
Figure 2 for Native Multi-Band Audio Coding within Hyper-Autoencoded Reconstruction Propagation Networks
Figure 3 for Native Multi-Band Audio Coding within Hyper-Autoencoded Reconstruction Propagation Networks

Spectral sub-bands do not portray the same perceptual relevance. In audio coding, it is therefore desirable to have independent control over each of the constituent bands so that bitrate assignment and signal reconstruction can be achieved efficiently. In this work, we present a novel neural audio coding network that natively supports a multi-band coding paradigm. Our model extends the idea of compressed skip connections in the U-Net-based codec, allowing for independent control over both core and high band-specific reconstructions and bit allocation. Our system reconstructs the full-band signal mainly from the condensed core-band code, therefore exploiting and showcasing its bandwidth extension capabilities to its fullest. Meanwhile, the low-bitrate high-band code helps the high-band reconstruction similarly to MPEG audio codecs' spectral bandwidth replication. MUSHRA tests show that the proposed model not only improves the quality of the core band by explicitly assigning more bits to it but retains a good quality in the high-band as well.

* Accepted to ICASSP 2023. For resources and examples, see https://saige.sice.indiana.edu/research-projects/HARP-Net/ 
Viaarxiv icon

The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement

Nov 14, 2022
Anastasia Kuznetsova, Aswin Sivaraman, Minje Kim

Figure 1 for The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement
Figure 2 for The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement
Figure 3 for The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement
Figure 4 for The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement

With the advances in deep learning, speech enhancement systems benefited from large neural network architectures and achieved state-of-the-art quality. However, speaker-agnostic methods are not always desirable, both in terms of quality and their complexity, when they are to be used in a resource-constrained environment. One promising way is personalized speech enhancement (PSE), which is a smaller and easier speech enhancement problem for small models to solve, because it focuses on a particular test-time user. To achieve the personalization goal, while dealing with the typical lack of personal data, we investigate the effect of data augmentation based on neural speech synthesis (NSS). In the proposed method, we show that the quality of the NSS system's synthetic data matters, and if they are good enough the augmented dataset can be used to improve the PSE system that outperforms the speaker-agnostic baseline. The proposed PSE systems show significant complexity reduction while preserving the enhancement quality.

Viaarxiv icon

Neural Feature Predictor and Discriminative Residual Coding for Low-Bitrate Speech Coding

Nov 04, 2022
Haici Yang, Wootaek Lim, Minje Kim

Figure 1 for Neural Feature Predictor and Discriminative Residual Coding for Low-Bitrate Speech Coding
Figure 2 for Neural Feature Predictor and Discriminative Residual Coding for Low-Bitrate Speech Coding
Figure 3 for Neural Feature Predictor and Discriminative Residual Coding for Low-Bitrate Speech Coding
Figure 4 for Neural Feature Predictor and Discriminative Residual Coding for Low-Bitrate Speech Coding

Low and ultra-low-bitrate neural speech coding achieves unprecedented coding gain by generating speech signals from compact speech features. This paper introduces additional coding efficiency in neural speech coding by reducing the temporal redundancy existing in the frame-level feature sequence via a recurrent neural predictor. The prediction can achieve a low-entropy residual representation, which we discriminatively code based on their contribution to the signal reconstruction. The harmonization of feature prediction and discriminative coding results in a dynamic bit allocation algorithm that spends more bits on unpredictable but rare events. As a result, we develop a scalable, lightweight, low-latency, and low-bitrate neural speech coding system. We demonstrate the advantage of the proposed methods using the LPCNet as a neural vocoder. While the proposed method guarantees causality in its prediction, the subjective tests and feature space analysis show that our model achieves superior coding efficiency compared to LPCNet and Lyra V2 in the very low bitrates.

Viaarxiv icon

Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content

Mar 22, 2022
Haici Yang, Sanna Wager, Spencer Russell, Mike Luo, Minje Kim, Wontak Kim

Figure 1 for Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content
Figure 2 for Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content
Figure 3 for Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content
Figure 4 for Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content

In the stereo-to-multichannel upmixing problem for music, one of the main tasks is to set the directionality of the instrument sources in the multichannel rendering results. In this paper, we propose a modified variational autoencoder model that learns a latent space to describe the spatial images in multichannel music. We seek to disentangle the spatial images and music content, so the learned latent variables are invariant to the music. At test time, we use the latent variables to control the panning of sources. We propose two upmixing use cases: transferring the spatial images from one song to another and blind panning based on the generative model. We report objective and subjective evaluation results to empirically show that our model captures spatial images separately from music content and achieves transfer-based interactive panning.

Viaarxiv icon

SpaIn-Net: Spatially-Informed Stereophonic Music Source Separation

Feb 15, 2022
Darius Petermann, Minje Kim

Figure 1 for SpaIn-Net: Spatially-Informed Stereophonic Music Source Separation
Figure 2 for SpaIn-Net: Spatially-Informed Stereophonic Music Source Separation
Figure 3 for SpaIn-Net: Spatially-Informed Stereophonic Music Source Separation
Figure 4 for SpaIn-Net: Spatially-Informed Stereophonic Music Source Separation

With the recent advancements of data driven approaches using deep neural networks, music source separation has been formulated as an instrument-specific supervised problem. While existing deep learning models implicitly absorb the spatial information conveyed by the multi-channel input signals, we argue that a more explicit and active use of spatial information could not only improve the separation process but also provide an entry-point for many user-interaction based tools. To this end, we introduce a control method based on the stereophonic location of the sources of interest, expressed as the panning angle. We present various conditioning mechanisms, including the use of raw angle and its derived feature representations, and show that spatial information helps. Our proposed approaches improve the separation performance compared to location agnostic architectures by 1.8 dB SI-SDR in our Slakh-based simulated experiments. Furthermore, the proposed methods allow for the disentanglement of same-class instruments, for example, in mixtures containing two guitar tracks. Finally, we also demonstrate that our approach is robust to incorrect source panning information, which can be incurred by our proposed user interaction.

* To Appear in Proc. ICASSP2022 
Viaarxiv icon

BLOOM-Net: Blockwise Optimization for Masking Networks Toward Scalable and Efficient Speech Enhancement

Nov 17, 2021
Sunwoo Kim, Minje Kim

Figure 1 for BLOOM-Net: Blockwise Optimization for Masking Networks Toward Scalable and Efficient Speech Enhancement
Figure 2 for BLOOM-Net: Blockwise Optimization for Masking Networks Toward Scalable and Efficient Speech Enhancement
Figure 3 for BLOOM-Net: Blockwise Optimization for Masking Networks Toward Scalable and Efficient Speech Enhancement
Figure 4 for BLOOM-Net: Blockwise Optimization for Masking Networks Toward Scalable and Efficient Speech Enhancement

In this paper, we present a blockwise optimization method for masking-based networks (BLOOM-Net) for training scalable speech enhancement networks. Here, we design our network with a residual learning scheme and train the internal separator blocks sequentially to obtain a scalable masking-based deep neural network for speech enhancement. Its scalability lets it adjust the run-time complexity based on the test-time resource constraints: once deployed, the model can alter its complexity dynamically depending on the test time environment. To this end, we modularize our models in that they can flexibly accommodate varying needs for enhancement performance and constraints on the resources, incurring minimal memory or training overhead due to the added scalability. Our experiments on speech enhancement demonstrate that the proposed blockwise optimization method achieves the desired scalability with only a slight performance degradation compared to corresponding models trained end-to-end.

* 5 pages, 3 figures, under review 
Viaarxiv icon

Neural Remixer: Learning to Remix Music with Interactive Control

Jul 28, 2021
Haici Yang, Shivani Firodiya, Nicholas J. Bryan, Minje Kim

Figure 1 for Neural Remixer: Learning to Remix Music with Interactive Control
Figure 2 for Neural Remixer: Learning to Remix Music with Interactive Control
Figure 3 for Neural Remixer: Learning to Remix Music with Interactive Control
Figure 4 for Neural Remixer: Learning to Remix Music with Interactive Control

The task of manipulating the level and/or effects of individual instruments to recompose a mixture of recording, or remixing, is common across a variety of applications such as music production, audio-visual post-production, podcasts, and more. This process, however, traditionally requires access to individual source recordings, restricting the creative process. To work around this, source separation algorithms can separate a mixture into its respective components. Then, a user can adjust their levels and mix them back together. This two-step approach, however, still suffers from audible artifacts and motivates further work. In this work, we seek to learn to remix music directly. To do this, we propose two neural remixing architectures that extend Conv-TasNet to either remix via a) source estimates directly or b) their latent representations. Both methods leverage a remixing data augmentation scheme as well as a mixture reconstruction loss to achieve an end-to-end separation and remixing process. We evaluate our methods using the Slakh and MUSDB datasets and report both source separation performance and the remixing quality. Our results suggest learning-to-remix significantly outperforms a strong separation baseline, is particularly useful for small changes, and can provide interactive user-controls.

Viaarxiv icon

HARP-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding

Jul 23, 2021
Darius Petermann, Seungkwon Beack, Minje Kim

Figure 1 for HARP-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding
Figure 2 for HARP-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding
Figure 3 for HARP-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding
Figure 4 for HARP-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding

An autoencoder-based codec employs quantization to turn its bottleneck layer activation into bitstrings, a process that hinders information flow between the encoder and decoder parts. To circumvent this issue, we employ additional skip connections between the corresponding pair of encoder-decoder layers. The assumption is that, in a mirrored autoencoder topology, a decoder layer reconstructs the intermediate feature representation of its corresponding encoder layer. Hence, any additional information directly propagated from the corresponding encoder layer helps the reconstruction. We implement this kind of skip connections in the form of additional autoencoders, each of which is a small codec that compresses the massive data transfer between the paired encoder-decoder layers. We empirically verify that the proposed hyper-autoencoded architecture improves perceptual audio quality compared to an ordinary autoencoder baseline.

* Accepted to the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2021, Mohonk Mountain House, New Paltz, NY 
Viaarxiv icon

HARP-Net: Hyper-Autoencoded Reconstruction Propagation\\for Scalable Neural Audio Coding

Jul 22, 2021
Darius Petermann, Seungkwon Beack, Minje Kim

Figure 1 for HARP-Net: Hyper-Autoencoded Reconstruction Propagation\\for Scalable Neural Audio Coding
Figure 2 for HARP-Net: Hyper-Autoencoded Reconstruction Propagation\\for Scalable Neural Audio Coding
Figure 3 for HARP-Net: Hyper-Autoencoded Reconstruction Propagation\\for Scalable Neural Audio Coding
Figure 4 for HARP-Net: Hyper-Autoencoded Reconstruction Propagation\\for Scalable Neural Audio Coding

An autoencoder-based codec employs quantization to turn its bottleneck layer activation into bitstrings, a process that hinders information flow between the encoder and decoder parts. To circumvent this issue, we employ additional skip connections between the corresponding pair of encoder-decoder layers. The assumption is that, in a mirrored autoencoder topology, a decoder layer reconstructs the intermediate feature representation of its corresponding encoder layer. Hence, any additional information directly propagated from the corresponding encoder layer helps the reconstruction. We implement this kind of skip connections in the form of additional autoencoders, each of which is a small codec that compresses the massive data transfer between the paired encoder-decoder layers. We empirically verify that the proposed hyper-autoencoded architecture improves perceptual audio quality compared to an ordinary autoencoder baseline.

* Accepted to the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2021, Mohonk Mountain House, New Paltz, NY 
Viaarxiv icon

Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot Learning with Knowledge Distillation

May 08, 2021
Sunwoo Kim, Minje Kim

Figure 1 for Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot Learning with Knowledge Distillation
Figure 2 for Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot Learning with Knowledge Distillation
Figure 3 for Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot Learning with Knowledge Distillation

In realistic speech enhancement settings for end-user devices, we often encounter only a few speakers and noise types that tend to reoccur in the specific acoustic environment. We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity. Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker, thus fulfilling the requirement for zero-shot learning. To complement the lack of clean utterance, we employ the knowledge distillation framework. Instead of the missing clean utterance target, we distill the more advanced denoising results from an overly large teacher model, and use it as the pseudo target to train the small student model. This zero-shot learning procedure circumvents the process of collecting users' clean speech, a process that users are reluctant to comply due to privacy concerns and technical difficulty of recording clean voice. Experiments on various test-time conditions show that the proposed personalization method achieves significant performance gains compared to larger baseline networks trained from a large speaker- and noise-agnostic datasets. In addition, since the compact personalized models can outperform larger general-purpose models, we claim that the proposed method performs model compression with no loss of denoising performance.

* 5 pages, 5 figures, under review 
Viaarxiv icon