Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shlomo Dubnov

TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic Music

Feb 02, 2022

Ke Chen, Shuai Yu, Cheng-i Wang, Wei Li, Taylor Berg-Kirkpatrick, Shlomo Dubnov

Figure 1 for TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic Music

Figure 2 for TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic Music

Figure 3 for TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic Music

Figure 4 for TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic Music

Abstract:Singing melody extraction is an important problem in the field of music information retrieval. Existing methods typically rely on frequency-domain representations to estimate the sung frequencies. However, this design does not lead to human-level performance in the perception of melody information for both tone (pitch-class) and octave. In this paper, we propose TONet, a plug-and-play model that improves both tone and octave perceptions by leveraging a novel input representation and a novel network architecture. First, we present an improved input representation, the Tone-CFP, that explicitly groups harmonics via a rearrangement of frequency-bins. Second, we introduce an encoder-decoder architecture that is designed to obtain a salience feature map, a tone feature map, and an octave feature map. Third, we propose a tone-octave fusion mechanism to improve the final salience feature map. Experiments are done to verify the capability of TONet with various baseline backbone models. Our results show that tone-octave fusion with Tone-CFP can significantly improve the singing voice extraction performance across various datasets -- with substantial gains in octave and tone accuracy.

* Preprint Version for ICASSP 2022, Singapore

Via

Access Paper or Ask Questions

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Feb 02, 2022

Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov

Figure 1 for HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Figure 2 for HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Figure 3 for HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Figure 4 for HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Abstract:Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.

* Preprint version for ICASSP 2022, Singapore

Via

Access Paper or Ask Questions

Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data

Jan 12, 2022

Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-kirkpatrick, Shlomo Dubnov

Figure 1 for Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data

Figure 2 for Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data

Figure 3 for Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data

Figure 4 for Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data

Abstract:Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet. First, we propose a transformer-based sound event detection system for processing weakly-labeled training data. Second, we devise a query-based audio separation model that leverages this data for model training. Third, we design a latent embedding processor to encode queries that specify audio targets for separation, allowing for zero-shot generalization. Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training. In addition, the proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training. To evaluate the separation performance, we test our model on MUSDB18, while training on the disjoint AudioSet. We further verify the zero-shot performance by conducting another experiment on audio source types that are held-out from training. The model achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases.

* 9 pages, 3 figures, 5 tables, preprint version for Association for the Advancement of Artificial Intelligence Conference, AAAI 2022

Via

Access Paper or Ask Questions

Towards Cross-Cultural Analysis using Music Information Dynamics

Nov 24, 2021

Shlomo Dubnov, Kevin Huang, Cheng-i Wang

Figure 1 for Towards Cross-Cultural Analysis using Music Information Dynamics

Figure 2 for Towards Cross-Cultural Analysis using Music Information Dynamics

Figure 3 for Towards Cross-Cultural Analysis using Music Information Dynamics

Figure 4 for Towards Cross-Cultural Analysis using Music Information Dynamics

Abstract:A music piece is both comprehended hierarchically, from sonic events to melodies, and sequentially, in the form of repetition and variation. Music from different cultures establish different aesthetics by having different style conventions on these two aspects. We propose a framework that could be used to quantitatively compare music from different cultures by looking at these two aspects. The framework is based on an Music Information Dynamics model, a Variable Markov Oracle (VMO), and is extended with a variational representation learning of audio. A variational autoencoder (VAE) is trained to map audio fragments into a latent representation. The latent representation is fed into a VMO. The VMO then learns a clustering of the latent representation via a threshold that maximizes the information rate of the quantized latent representation sequence. This threshold effectively controls the sensibility of the predictive step to acoustic changes, which determines the framework's ability to track repetitions on longer time scales. This approach allows characterization of the overall information contents of a musical signal at each level of acoustic sensibility. Our findings under this framework show that sensibility to subtle acoustic changes is higher for East-Asian musical traditions, while the Western works exhibit longer motivic structures at higher thresholds of differences in the latent space. This suggests that a profile of information contents, analyzed as a function of the level of acoustic detail can serve as a possible cultural characteristic.

Via

Access Paper or Ask Questions

Comparison and Analysis of Deep Audio Embeddings for Music Emotion Recognition

Apr 13, 2021

Eunjeong Koh, Shlomo Dubnov

Figure 1 for Comparison and Analysis of Deep Audio Embeddings for Music Emotion Recognition

Figure 2 for Comparison and Analysis of Deep Audio Embeddings for Music Emotion Recognition

Figure 3 for Comparison and Analysis of Deep Audio Embeddings for Music Emotion Recognition

Figure 4 for Comparison and Analysis of Deep Audio Embeddings for Music Emotion Recognition

Abstract:Emotion is a complicated notion present in music that is hard to capture even with fine-tuned feature engineering. In this paper, we investigate the utility of state-of-the-art pre-trained deep audio embedding methods to be used in the Music Emotion Recognition (MER) task. Deep audio embedding methods allow us to efficiently capture the high dimensional features into a compact representation. We implement several multi-class classifiers with deep audio embeddings to predict emotion semantics in music. We investigate the effectiveness of L3-Net and VGGish deep audio embedding methods for music emotion inference over four music datasets. The experiments with several classifiers on the task show that the deep audio embedding solutions can improve the performances of the previous baseline MER models. We conclude that deep audio embeddings represent musical emotion semantics for the MER task without expert human engineering.

* AAAI 2021
* AAAI Workshop on Affective Content Analysis 2021 Camera Ready Version

Via

Access Paper or Ask Questions

Bias-Free FedGAN

Mar 17, 2021

Vaikkunth Mugunthan, Vignesh Gokul, Lalana Kagal, Shlomo Dubnov

Abstract:Federated Generative Adversarial Network (FedGAN) is a communication-efficient approach to train a GAN across distributed clients without clients having to share their sensitive training data. In this paper, we experimentally show that FedGAN generates biased data points under non-independent-and-identically-distributed (non-iid) settings. Also, we propose Bias-Free FedGAN, an approach to generate bias-free synthetic datasets using FedGAN. Bias-Free FedGAN has the same communication cost as that of FedGAN. Experimental results on image datasets (MNIST and FashionMNIST) validate our claims.

Via

Access Paper or Ask Questions

WaveGuard: Understanding and Mitigating Audio Adversarial Examples

Mar 04, 2021

Shehzeen Hussain, Paarth Neekhara, Shlomo Dubnov, Julian McAuley, Farinaz Koushanfar

Figure 1 for WaveGuard: Understanding and Mitigating Audio Adversarial Examples

Figure 2 for WaveGuard: Understanding and Mitigating Audio Adversarial Examples

Figure 3 for WaveGuard: Understanding and Mitigating Audio Adversarial Examples

Figure 4 for WaveGuard: Understanding and Mitigating Audio Adversarial Examples

Abstract:There has been a recent surge in adversarial attacks on deep learning based automatic speech recognition (ASR) systems. These attacks pose new challenges to deep learning security and have raised significant concerns in deploying ASR systems in safety-critical applications. In this work, we introduce WaveGuard: a framework for detecting adversarial inputs that are crafted to attack ASR systems. Our framework incorporates audio transformation functions and analyses the ASR transcriptions of the original and transformed audio to detect adversarial inputs. We demonstrate that our defense framework is able to reliably detect adversarial examples constructed by four recent audio adversarial attacks, with a variety of audio transformation functions. With careful regard for best practices in defense evaluations, we analyze our proposed defense and its strength to withstand adaptive and robust attacks in the audio domain. We empirically demonstrate that audio transformations that recover audio from perceptually informed representations can lead to a strong defense that is robust against an adaptive adversary even in a complete white-box setting. Furthermore, WaveGuard can be used out-of-the box and integrated directly with any ASR model to efficiently detect audio adversarial examples, without the need for model retraining.

* Published as a conference paper at Usenix Security 2021

Via

Access Paper or Ask Questions

Cross-modal Adversarial Reprogramming

Feb 15, 2021

Paarth Neekhara, Shehzeen Hussain, Jinglong Du, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley

Figure 1 for Cross-modal Adversarial Reprogramming

Figure 2 for Cross-modal Adversarial Reprogramming

Figure 3 for Cross-modal Adversarial Reprogramming

Figure 4 for Cross-modal Adversarial Reprogramming

Abstract:With the abundance of large-scale deep learning models, it has become possible to repurpose pre-trained networks for new tasks. Recent works on adversarial reprogramming have shown that it is possible to repurpose neural networks for alternate tasks without modifying the network architecture or parameters. However these works only consider original and target tasks within the same data domain. In this work, we broaden the scope of adversarial reprogramming beyond the data modality of the original task. We analyze the feasibility of adversarially repurposing image classification neural networks for Natural Language Processing (NLP) and other sequence classification tasks. We design an efficient adversarial program that maps a sequence of discrete tokens into an image which can be classified to the desired class by an image classification model. We demonstrate that by using highly efficient adversarial programs, we can reprogram image classifiers to achieve competitive performance on a variety of text and sequence classification benchmarks without retraining the network.

* 12 pages, 3 figures

Via

Access Paper or Ask Questions

Deep Music Information Dynamics

Feb 01, 2021

Shlomo Dubnov

Figure 1 for Deep Music Information Dynamics

Figure 2 for Deep Music Information Dynamics

Figure 3 for Deep Music Information Dynamics

Figure 4 for Deep Music Information Dynamics

Abstract:Music comprises of a set of complex simultaneous events organized in time. In this paper we introduce a novel framework that we call Deep Musical Information Dynamics, which combines two parallel streams - a low rate latent representation stream that is assumed to capture the dynamics of a thought process contrasted with a higher rate information dynamics derived from the musical data itself. Motivated by rate-distortion theories of human cognition we propose a framework for exploring possible relations between imaginary anticipations existing in the listener's mind and information dynamics of the musical surface itself. This model is demonstrated for the case of symbolic (MIDI) data, as accounting for acoustic surface would require many more layers to capture instrument properties and performance expressive inflections. The mathematical framework is based on variational encoding that first establishes a high rate representation of the musical observations, which is then reduced using a bit-allocation method into a parallel low rate data stream. The combined loss considered here includes both the information rate in terms of time evolution for each stream, and the fidelity of encoding measured in terms of mutual information between the high and low rate representations. In the simulations presented in the paper we are able to juxtapose aspects of latent/imaginary surprisal versus surprisal of the music surface in a manner that is quantifiable and computationally tractable. The set of computational tools is discussed in the paper, suggesting that a trade off between compression and prediction are an important factor in the analysis and design of time-based music generative models.

* The 2020 Joint Conference on AI Music Creativity, October 19-23, 2020, Royal Institute of Technology (KTH), Stockholm, Sweden

Via

Access Paper or Ask Questions

Expressive Neural Voice Cloning

Jan 30, 2021

Paarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley

Figure 1 for Expressive Neural Voice Cloning

Figure 2 for Expressive Neural Voice Cloning

Figure 3 for Expressive Neural Voice Cloning

Figure 4 for Expressive Neural Voice Cloning

Abstract:Voice cloning is the task of learning to synthesize the voice of an unseen speaker from a few samples. While current voice cloning methods achieve promising results in Text-to-Speech (TTS) synthesis for a new voice, these approaches lack the ability to control the expressiveness of synthesized audio. In this work, we propose a controllable voice cloning method that allows fine-grained control over various style aspects of the synthesized speech for an unseen speaker. We achieve this by explicitly conditioning the speech synthesis model on a speaker encoding, pitch contour and latent style tokens during training. Through both quantitative and qualitative evaluations, we show that our framework can be used for various expressive voice cloning tasks using only a few transcribed or untranscribed speech samples for a new speaker. These cloning tasks include style transfer from a reference speech, synthesizing speech directly from text, and fine-grained style control by manipulating the style conditioning variables during inference.

* 12 pages, 2 figures, 2 tables

Via

Access Paper or Ask Questions