Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"music generation": models, code, and papers

Weakly Supervised Deep Recurrent Neural Networks for Basic Dance Step Generation

Jul 03, 2018
Nelson Yalta, Shinji Watanabe, Kazuhiro Nakadai, Tetsuya Ogata

A deep recurrent neural network with audio input is applied to model basic dance steps. The proposed model employs multilayered Long Short-Term Memory (LSTM) layers and convolutional layers to process the audio power spectrum. Then, another deep LSTM layer decodes the target dance sequence. This end-to-end approach has an auto-conditioned decode configuration that reduces accumulation of feedback error. Experimental results demonstrate that, after training using a small dataset, the model generates basic dance steps with low cross entropy and maintains a motion beat F-measure score similar to that of a baseline dancer. In addition, we investigate the use of a contrastive cost function for music-motion regulation. This cost function targets motion direction and maps similarities between music frames. Experimental result demonstrate that the cost function improves the motion beat f-score.

* 5 pages, 5 figures 
Access Paper or Ask Questions

Codified audio language modeling learns useful representations for music information retrieval

Jul 12, 2021
Rodrigo Castellon, Chris Donahue, Percy Liang

We demonstrate that language models pre-trained on codified (discretely-encoded) music audio learn representations that are useful for downstream MIR tasks. Specifically, we explore representations from Jukebox (Dhariwal et al. 2020): a music generation system containing a language model trained on codified audio from 1M songs. To determine if Jukebox's representations contain useful information for MIR, we use them as input features to train shallow models on several MIR tasks. Relative to representations from conventional MIR models which are pre-trained on tagging, we find that using representations from Jukebox as input features yields 30% stronger performance on average across four MIR tasks: tagging, genre classification, emotion recognition, and key detection. For key detection, we observe that representations from Jukebox are considerably stronger than those from models pre-trained on tagging, suggesting that pre-training via codified audio language modeling may address blind spots in conventional approaches. We interpret the strength of Jukebox's representations as evidence that modeling audio instead of tags provides richer representations for MIR.

* To appear in the proceedings of ISMIR 2021 
Access Paper or Ask Questions

Deep Interactive Evolution

Jan 24, 2018
Philip Bontrager, Wending Lin, Julian Togelius, Sebastian Risi

This paper describes an approach that combines generative adversarial networks (GANs) with interactive evolutionary computation (IEC). While GANs can be trained to produce lifelike images, they are normally sampled randomly from the learned distribution, providing limited control over the resulting output. On the other hand, interactive evolution has shown promise in creating various artifacts such as images, music and 3D objects, but traditionally relies on a hand-designed evolvable representation of the target domain. The main insight in this paper is that a GAN trained on a specific target domain can act as a compact and robust genotype-to-phenotype mapping (i.e. most produced phenotypes do resemble valid domain artifacts). Once such a GAN is trained, the latent vector given as input to the GAN's generator network can be put under evolutionary control, allowing controllable and high-quality image generation. In this paper, we demonstrate the advantage of this novel approach through a user study in which participants were able to evolve images that strongly resemble specific target images.

* 16 pages, 5 figures, Published at EvoMUSART EvoStar 2018 
Access Paper or Ask Questions

Smart Home Appliances: Chat with Your Fridge

Dec 19, 2019
Denis Gudovskiy, Gyuri Han, Takuya Yamaguchi, Sotaro Tsukizawa

Current home appliances are capable to execute a limited number of voice commands such as turning devices on or off, adjusting music volume or light conditions. Recent progress in machine reasoning gives an opportunity to develop new types of conversational user interfaces for home appliances. In this paper, we apply state-of-the-art visual reasoning model and demonstrate that it is feasible to ask a smart fridge about its contents and various properties of the food with close-to-natural conversation experience. Our visual reasoning model answers user questions about existence, count, category and freshness of each product by analyzing photos made by the image sensor inside the smart fridge. Users may chat with their fridge using off-the-shelf phone messenger while being away from home, for example, when shopping in the supermarket. We generate a visually realistic synthetic dataset to train machine learning reasoning model that achieves 95% answer accuracy on test data. We present the results of initial user tests and discuss how we modify distribution of generated questions for model training based on human-in-the-loop guidance. We open source code for the whole system including dataset generation, reasoning model and demonstration scripts.

* NeurIPS 2019 demo track 
Access Paper or Ask Questions

MP3net: coherent, minute-long music generation from raw audio with a simple convolutional GAN

Jan 12, 2021
Korneel van den Broek

We present a deep convolutional GAN which leverages techniques from MP3/Vorbis audio compression to produce long, high-quality audio samples with long-range coherence. The model uses a Modified Discrete Cosine Transform (MDCT) data representation, which includes all phase information. Phase generation is hence integral part of the model. We leverage the auditory masking and psychoacoustic perception limit of the human ear to widen the true distribution and stabilize the training process. The model architecture is a deep 2D convolutional network, where each subsequent generator model block increases the resolution along the time axis and adds a higher octave along the frequency axis. The deeper layers are connected with all parts of the output and have the context of the full track. This enables generation of samples which exhibit long-range coherence. We use MP3net to create 95s stereo tracks with a 22kHz sample rate after training for 250h on a single Cloud TPUv2. An additional benefit of the CNN-based model architecture is that generation of new songs is almost instantaneous.

* 11 pages, 8 figures, samples and source code available on 
Access Paper or Ask Questions

Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription

Jun 27, 2012
Nicolas Boulanger-Lewandowski, Yoshua Bengio, Pascal Vincent

We investigate the problem of modeling symbolic sequences of polyphonic music in a completely general piano-roll representation. We introduce a probabilistic model based on distribution estimators conditioned on a recurrent neural network that is able to discover temporal dependencies in high-dimensional sequences. Our approach outperforms many traditional models of polyphonic music on a variety of realistic datasets. We show how our musical language model can serve as a symbolic prior to improve the accuracy of polyphonic transcription.

* Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012) 
Access Paper or Ask Questions

Investigating the usefulness of Quantum Blur

Nov 27, 2021
James R. Wootton, Marcel Pfaffhauser

Though some years remain before quantum computation can outperform conventional computation, it already provides resources that be used for exploratory purposes in various fields. This includes certain tasks for procedural generation in computer games, music and art. The Quantum Blur method was introduced as a proof-of-principle example, to show that it can be useful to design methods for procedural generation using the principles of quantum software. Here we analyse the effects of the method and compare it to conventional blur effects. We also determine how the effects seen derive from the manipulation of quantum superposition and entanglement.

Access Paper or Ask Questions

VaPar Synth -- A Variational Parametric Model for Audio Synthesis

Mar 30, 2020
Krishna Subramani, Preeti Rao, Alexandre D'Hooge

With the advent of data-driven statistical modeling and abundant computing power, researchers are turning increasingly to deep learning for audio synthesis. These methods try to model audio signals directly in the time or frequency domain. In the interest of more flexible control over the generated sound, it could be more useful to work with a parametric representation of the signal which corresponds more directly to the musical attributes such as pitch, dynamics and timbre. We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation. We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.

* , Accepted in ICASSP 2020 
Access Paper or Ask Questions

Progressive Generative Adversarial Binary Networks for Music Generation

Mar 12, 2019
Manan Oza, Himanshu Vaghela, Kriti Srivastava

Recent improvements in generative adversarial network (GAN) training techniques prove that progressively training a GAN drastically stabilizes the training and improves the quality of outputs produced. Adding layers after the previous ones have converged has proven to help in better overall convergence and stability of the model as well as reducing the training time by a sufficient amount. Thus we use this training technique to train the model progressively in the time and pitch domain i.e. starting from a very small time value and pitch range we gradually expand the matrix sizes until the end result is a completely trained model giving outputs having tensor sizes [4 (bar) x 96 (time steps) x 84 (pitch values) x 8 (tracks)]. As proven in previously proposed models deterministic binary neurons also help in improving the results. Thus we make use of a layer of deterministic binary neurons at the end of the generator to get binary valued outputs instead of fractional values existing between 0 and 1.

Access Paper or Ask Questions

Modeling Baroque Two-Part Counterpoint with Neural Machine Translation

Jun 29, 2020
Eric P. Nichols, Stefano Kalonaris, Gianluca Micchi, Anna Aljanaki

We propose a system for contrapuntal music generation based on a Neural Machine Translation (NMT) paradigm. We consider Baroque counterpoint and are interested in modeling the interaction between any two given parts as a mapping between a given source material and an appropriate target material. Like in translation, the former imposes some constraints on the latter, but doesn't define it completely. We collate and edit a bespoke dataset of Baroque pieces, use it to train an attention-based neural network model, and evaluate the generated output via BLEU score and musicological analysis. We show that our model is able to respond with some idiomatic trademarks, such as imitation and appropriate rhythmic offset, although it falls short of having learned stylistically correct contrapuntal motion (e.g., avoidance of parallel fifths) or stricter imitative rules, such as canon.

* International Computer Music Conference 2020, 5 pages 
Access Paper or Ask Questions