Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"music generation": models, code, and papers

Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis

Apr 28, 2021
Erica Cooper, Xin Wang, Junichi Yamagishi

Speech synthesis and music audio generation from symbolic input differ in many aspects but share some similarities. In this study, we investigate how text-to-speech synthesis techniques can be used for piano MIDI-to-audio synthesis tasks. Our investigation includes Tacotron and neural source-filter waveform models as the basic components, with which we build MIDI-to-audio synthesis systems in similar ways to TTS frameworks. We also include reference systems using conventional sound modeling techniques such as sample-based and physical-modeling-based methods. The subjective experimental results demonstrate that the investigated TTS components can be applied to piano MIDI-to-audio synthesis with minor modifications. The results also reveal the performance bottleneck -- while the waveform model can synthesize high quality piano sound given natural acoustic features, the conversion from MIDI to acoustic features is challenging. The full MIDI-to-audio synthesis system is still inferior to the sample-based or physical-modeling-based approaches, but we encourage TTS researchers to test their TTS models for this new task and improve the performance.

* Submitted to ISCA Speech Synthesis Workshop 2021 

SoundStream: An End-to-End Neural Audio Codec

Jul 07, 2021
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, Marco Tagliasacchi

We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3kbps to 18kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU. In subjective evaluations using audio at 24kHz sampling rate, SoundStream at 3kbps outperforms Opus at 12kbps and approaches EVS at 9.6kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech.


Audio representations for deep learning in sound synthesis: A review

Jan 07, 2022
Anastasia Natsiou, Sean O'Leary

The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficiently - and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.


Policy Gradients for General Contextual Bandits

May 22, 2018
Feiyang Pan, Qingpeng Cai, Pingzhong Tang, Fuzhen Zhuang, Qing He

Contextual bandits algorithms have been successfully deployed to various industrial applications for the trade-off between exploration and exploitation and the state-of-art performance on minimizing online costs. However, the applicability is limited by the over-simplified assumptions on the problem, such as assuming the rewards linearly depend on the contexts, or assuming a static environment where the states are not affected by previous actions. In this work, we put forward an alternative method for general contextual bandits using actor-critic neural networks to directly optimize in the policy space, coined policy gradient for contextual bandits (PGCB). It optimizes over a class of policies in which the marginal probability of choosing an arm (in expectation of other arms) has a simple closed form so that the objective is differentiable. In particular, the gradient of this class of policies is in a succinct form. Moreover, we propose two useful heuristic techniques called Time-Dependent Greed and Actor-Dropout. The former ensures PGCB to be empirically greedy in the limit, while the later balances a trade-off between exploration and exploitation by using the actor-network with dropout as a Bayesian approximation. PGCB can solve contextual bandits in the standard case as well as the Markov Decision Process generalization where there is a state that decides the distribution of contexts of arms and affects the immediate reward when choosing an arm, therefore can be applied to a wide range of realistic settings such as personalized recommender systems and natural language generations. We evaluate PGCB on toy datasets as well as a music recommender dataset. Experiments show that PGCB has fast convergence and low regret and outperforms both classic contextual-bandits methods and vanilla policy gradient methods.


Graphical Contrastive Losses for Scene Graph Generation

Mar 28, 2019
Ji Zhang, Kevin J. Shih, Ahmed Elgammal, Andrew Tao, Bryan Catanzaro

Most scene graph generators use a two-stage pipeline to detect visual relationships: the first stage detects entities, and the second predicts the predicate for each entity pair using a softmax distribution. We find that such pipelines, trained with only a cross entropy loss over predicate classes, suffer from two common errors. The first, Entity Instance Confusion, occurs when the model confuses multiple instances of the same type of entity (e.g. multiple cups). The second, Proximal Relationship Ambiguity, arises when multiple subject-predicate-object triplets appear in close proximity with the same predicate, and the model struggles to infer the correct subject-object pairings (e.g. mis-pairing musicians and their instruments). We propose a set of contrastive loss formulations that specifically target these types of errors within the scene graph generation problem, collectively termed the Graphical Contrastive Losses. These losses explicitly force the model to disambiguate related and unrelated instances through margin constraints specific to each type of confusion. We further construct a relationship detector, called RelDN, using the aforementioned pipeline to demonstrate the efficacy of our proposed losses. Our model outperforms the winning method of the OpenImages Relationship Detection Challenge by 4.7\% (16.5\% relative) on the test set. We also show improved results over the best previous methods on the Visual Genome and Visual Relationship Detection datasets.


Using Deep learning methods for generation of a personalized list of shuffled songs

Dec 17, 2017
Rushin Gindra, Srushti Kotak, Asmita Natekar, Grishma Sharma

The shuffle mode, where songs are played in a randomized order that is decided upon for all tracks at once, is widely found and known to exist in music player systems. There are only few music enthusiasts who use this mode since it either is too random to suit their mood or it keeps on repeating the same list every time. In this paper, we propose to build a convolutional deep belief network(CDBN) that is trained to perform genre recognition based on audio features retrieved from the records of the Million Song Dataset. The learned parameters shall be used to initialize a multi-layer perceptron which takes extracted features of user's playlist as input alongside the metadata to classify to various categories. These categories will be shuffled retrospectively based on the metadata to autonomously provide with a list that is efficacious in playing songs that are desired by humans in normal conditions.

* 4 pages, 3 figures, submitted to IEEE Xplore, 12th INDIACom 2018 5th International Conference on Computing for Sustainable Global Development 

Jurassic is (almost) All You Need: Few-Shot Meaning-to-Text Generation for Open-Domain Dialogue

Nov 10, 2021
Lena Reed, Cecilia Li, Angela Ramirez, Liren Wu, Marilyn Walker

One challenge with open-domain dialogue systems is the need to produce truthful, high-quality responses on any topic. We aim to improve the quality and coverage of Athena, an Alexa Prize dialogue system. We experiment with few-shot prompt-based learning, comparing GPT-Neo to Jurassic-1, for the movies, music, TV, sports, and video game domains, both within and cross-domain, with different prompt set sizes (2, 3, 10), formats, and meaning representations consisting of either sets of WikiData KG triples, or dialogue acts. Our evaluation uses BLEURT and human metrics, and shows that with 10-shot prompting, Athena-Jurassic's performance is significantly better for coherence and semantic accuracy. Experiments with 2-shot cross-domain prompts results in a huge performance drop for Athena-GPT-Neo, whose semantic accuracy falls to 0.41, and whose untrue hallucination rate increases to 12%. Experiments with dialogue acts for video games show that with 10-shot prompting, both models learn to control dialogue acts, but Athena-Jurassic has significantly higher coherence, and only 4% untrue hallucinations. Our results suggest that Athena-Jurassic produces high enough quality outputs to be useful in live systems with real users. To our knowledge, these are the first results demonstrating that few-shot semantic prompt-based learning can create NLGs that generalize to new domains, and produce high-quality, semantically-controlled, conversational responses directly from meaning representations.

* The 12th International Workshop on Spoken Dialog System Technology, IWSDS 2021 
* Final Conference Proceedings version