Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"music generation": models, code, and papers

Generation of Multimedia Artifacts: An Extractive Summarization-based Approach

Aug 13, 2015
Paulo Figueiredo, Marta Aparício, David Martins de Matos, Ricardo Ribeiro

We explore methods for content selection and address the issue of coherence in the context of the generation of multimedia artifacts. We use audio and video to present two case studies: generation of film tributes, and lecture-driven science talks. For content selection, we use centrality-based and diversity-based summarization, along with topic analysis. To establish coherence, we use the emotional content of music, for film tributes, and ensure topic similarity between lectures and documentaries, for science talks. Composition techniques for the production of multimedia artifacts are addressed as a means of organizing content, in order to improve coherence. We discuss our results considering the above aspects.

* 7 pages, 2 figures 

NONOTO: A Model-agnostic Web Interface for Interactive Music Composition by Inpainting

Jul 23, 2019
Théis Bazin, Gaëtan Hadjeres

Inpainting-based generative modeling allows for stimulating human-machine interactions by letting users perform stylistically coherent local editions to an object using a statistical model. We present NONOTO, a new interface for interactive music generation based on inpainting models. It is aimed both at researchers, by offering a simple and flexible API allowing them to connect their own models with the interface, and at musicians by providing industry-standard features such as audio playback, real-time MIDI output and straightforward synchronization with DAWs using Ableton Link.

* 3 pages, 1 figure. Published as a conference paper at the 10th International Conference on Computational Creativity (ICCC 2019), UNC Charlotte, North Carolina 

RAVE: A variational autoencoder for fast and high-quality neural audio synthesis

Nov 09, 2021
Antoine Caillon, Philippe Esling

Deep generative models applied to audio have improved by a large margin the state-of-the-art in many speech and music related tasks. However, as raw waveform modelling remains an inherently difficult task, audio generative models are either computationally intensive, rely on low sampling rates, are complicated to control or restrict the nature of possible signals. Among those models, Variational AutoEncoders (VAE) give control over the generation by exposing latent variables, although they usually suffer from low synthesis quality. In this paper, we introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We introduce a novel two-stage training procedure, namely representation learning and adversarial fine-tuning. We show that using a post-training analysis of the latent space allows a direct control between the reconstruction fidelity and the representation compactness. By leveraging a multi-band decomposition of the raw waveform, we show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU. We evaluate synthesis quality using both quantitative and qualitative subjective experiments and show the superiority of our approach compared to existing models. Finally, we present applications of our model for timbre transfer and signal compression. All of our source code and audio examples are publicly available.


Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning

Dec 30, 2021
Ziyu Wang, Dejing Xu, Gus Xia, Ying Shan

Could we automatically derive the score of a piano accompaniment based on the audio of a pop song? This is the audio-to-symbolic arrangement problem we tackle in this paper. A good arrangement model should not only consider the audio content but also have prior knowledge of piano composition (so that the generation "sounds like" the audio and meanwhile maintains musicality.) To this end, we contribute a cross-modal representation-learning model, which 1) extracts chord and melodic information from the audio, and 2) learns texture representation from both audio and a corrupted ground truth arrangement. We further introduce a tailored training strategy that gradually shifts the source of texture information from corrupted score to audio. In the end, the score-based texture posterior is reduced to a standard normal distribution, and only audio is needed for inference. Experiments show that our model captures major audio information and outperforms baselines in generation quality.


DeepDrummer : Generating Drum Loops using Deep Learning and a Human in the Loop

Aug 26, 2020
Guillaume Alain, Maxime Chevalier-Boisvert, Frederic Osterrath, Remi Piche-Taillefer

DeepDrummer is a drum loop generation tool that uses active learning to learn the preferences (or current artistic intentions) of a human user from a small number of interactions. The principal goal of this tool is to enable an efficient exploration of new musical ideas. We train a deep neural network classifier on audio data and show how it can be used as the core component of a system that generates drum loops based on few prior beliefs as to how these loops should be structured. We aim to build a system that can converge to meaningful results even with a limited number of interactions with the user. This property enables our method to be used from a cold start situation (no pre-existing dataset), or starting from a collection of audio samples provided by the user. In a proof of concept study with 25 participants, we empirically demonstrate that DeepDrummer is able to converge towards the preference of our subjects after a small number of interactions.


Weakly Supervised Deep Recurrent Neural Networks for Basic Dance Step Generation

Jul 03, 2018
Nelson Yalta, Shinji Watanabe, Kazuhiro Nakadai, Tetsuya Ogata

A deep recurrent neural network with audio input is applied to model basic dance steps. The proposed model employs multilayered Long Short-Term Memory (LSTM) layers and convolutional layers to process the audio power spectrum. Then, another deep LSTM layer decodes the target dance sequence. This end-to-end approach has an auto-conditioned decode configuration that reduces accumulation of feedback error. Experimental results demonstrate that, after training using a small dataset, the model generates basic dance steps with low cross entropy and maintains a motion beat F-measure score similar to that of a baseline dancer. In addition, we investigate the use of a contrastive cost function for music-motion regulation. This cost function targets motion direction and maps similarities between music frames. Experimental result demonstrate that the cost function improves the motion beat f-score.

* 5 pages, 5 figures 

Codified audio language modeling learns useful representations for music information retrieval

Jul 12, 2021
Rodrigo Castellon, Chris Donahue, Percy Liang

We demonstrate that language models pre-trained on codified (discretely-encoded) music audio learn representations that are useful for downstream MIR tasks. Specifically, we explore representations from Jukebox (Dhariwal et al. 2020): a music generation system containing a language model trained on codified audio from 1M songs. To determine if Jukebox's representations contain useful information for MIR, we use them as input features to train shallow models on several MIR tasks. Relative to representations from conventional MIR models which are pre-trained on tagging, we find that using representations from Jukebox as input features yields 30% stronger performance on average across four MIR tasks: tagging, genre classification, emotion recognition, and key detection. For key detection, we observe that representations from Jukebox are considerably stronger than those from models pre-trained on tagging, suggesting that pre-training via codified audio language modeling may address blind spots in conventional approaches. We interpret the strength of Jukebox's representations as evidence that modeling audio instead of tags provides richer representations for MIR.

* To appear in the proceedings of ISMIR 2021 

Deep Interactive Evolution

Jan 24, 2018
Philip Bontrager, Wending Lin, Julian Togelius, Sebastian Risi

This paper describes an approach that combines generative adversarial networks (GANs) with interactive evolutionary computation (IEC). While GANs can be trained to produce lifelike images, they are normally sampled randomly from the learned distribution, providing limited control over the resulting output. On the other hand, interactive evolution has shown promise in creating various artifacts such as images, music and 3D objects, but traditionally relies on a hand-designed evolvable representation of the target domain. The main insight in this paper is that a GAN trained on a specific target domain can act as a compact and robust genotype-to-phenotype mapping (i.e. most produced phenotypes do resemble valid domain artifacts). Once such a GAN is trained, the latent vector given as input to the GAN's generator network can be put under evolutionary control, allowing controllable and high-quality image generation. In this paper, we demonstrate the advantage of this novel approach through a user study in which participants were able to evolve images that strongly resemble specific target images.

* 16 pages, 5 figures, Published at EvoMUSART EvoStar 2018 

Smart Home Appliances: Chat with Your Fridge

Dec 19, 2019
Denis Gudovskiy, Gyuri Han, Takuya Yamaguchi, Sotaro Tsukizawa

Current home appliances are capable to execute a limited number of voice commands such as turning devices on or off, adjusting music volume or light conditions. Recent progress in machine reasoning gives an opportunity to develop new types of conversational user interfaces for home appliances. In this paper, we apply state-of-the-art visual reasoning model and demonstrate that it is feasible to ask a smart fridge about its contents and various properties of the food with close-to-natural conversation experience. Our visual reasoning model answers user questions about existence, count, category and freshness of each product by analyzing photos made by the image sensor inside the smart fridge. Users may chat with their fridge using off-the-shelf phone messenger while being away from home, for example, when shopping in the supermarket. We generate a visually realistic synthetic dataset to train machine learning reasoning model that achieves 95% answer accuracy on test data. We present the results of initial user tests and discuss how we modify distribution of generated questions for model training based on human-in-the-loop guidance. We open source code for the whole system including dataset generation, reasoning model and demonstration scripts.

* NeurIPS 2019 demo track