Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"music generation": models, code, and papers

Comparing Probabilistic Models for Melodic Sequences

Sep 30, 2011
Athina Spiliopoulou, Amos Storkey

Modelling the real world complexity of music is a challenge for machine learning. We address the task of modeling melodic sequences from the same music genre. We perform a comparative analysis of two probabilistic models; a Dirichlet Variable Length Markov Model (Dirichlet-VMM) and a Time Convolutional Restricted Boltzmann Machine (TC-RBM). We show that the TC-RBM learns descriptive music features, such as underlying chords and typical melody transitions and dynamics. We assess the models for future prediction and compare their performance to a VMM, which is the current state of the art in melody generation. We show that both models perform significantly better than the VMM, with the Dirichlet-VMM marginally outperforming the TC-RBM. Finally, we evaluate the short order statistics of the models, using the Kullback-Leibler divergence between test sequences and model samples, and show that our proposed methods match the statistics of the music genre significantly better than the VMM.

* Proceedings of the ECML-PKDD 2011. Lecture Notes in Computer Science, vol. 6913, pp. 289-304. Springer (2011) 
* in Proceedings of the ECML-PKDD 2011. Lecture Notes in Computer Science, vol. 6913, pp. 289-304. Springer (2011) 
Access Paper or Ask Questions

Hierarchical Timbre-Painting and Articulation Generation

Sep 07, 2020
Michael Michelashvili, Lior Wolf

We present a fast and high-fidelity method for music generation, based on specified f0 and loudness, such that the synthesized audio mimics the timbre and articulation of a target instrument. The generation process consists of learned source-filtering networks, which reconstruct the signal at increasing resolutions. The model optimizes a multi-resolution spectral loss as the reconstruction loss, an adversarial loss to make the audio sound more realistic, and a perceptual f0 loss to align the output to the desired input pitch contour. The proposed architecture enables high-quality fitting of an instrument, given a sample that can be as short as a few minutes, and the method demonstrates state-of-the-art timbre transfer capabilities. Code and audio samples are shared at

* accepted in Proc. of the 21st International Society for Music Information Retrieval (ISMIR2020) 
Access Paper or Ask Questions

Making grains tangible: microtouch for microsound

Jul 27, 2021
Staas de Jong

This paper proposes a new research direction for the large family of instrumental musical interfaces where sound is generated using digital granular synthesis, and where interaction and control involve the (fine) operation of stiff, flat contact surfaces. First, within a historical context, a general absence of, and clear need for, tangible output that is dynamically instantiated by the grain-generating process itself is identified. Second, to fill this gap, a concrete general approach is proposed based on the careful construction of non-vibratory and vibratory force pulses, in a one-to-one relationship with sonic grains. An informal pilot psychophysics experiment initiating the approach was conducted, which took into account the two main cases for applying forces to the human skin: perpendicular, and lateral. Initial results indicate that the force pulse approach can enable perceivably multidimensional, tangible display of the ongoing grain-generating process. Moreover, it was found that this can be made to meaningfully happen (in real time) in the same timescale of basic sonic grain generation. This is not a trivial property, and provides an important and positive fundament for further developing this type of enhanced display. It also leads to the exciting prospect of making arbitrary sonic grains actual physical manipulanda.

* Proceedings of the International Conference on New Interfaces for Musical Expression, 2011 
Access Paper or Ask Questions

Generation of Multimedia Artifacts: An Extractive Summarization-based Approach

Aug 13, 2015
Paulo Figueiredo, Marta Aparício, David Martins de Matos, Ricardo Ribeiro

We explore methods for content selection and address the issue of coherence in the context of the generation of multimedia artifacts. We use audio and video to present two case studies: generation of film tributes, and lecture-driven science talks. For content selection, we use centrality-based and diversity-based summarization, along with topic analysis. To establish coherence, we use the emotional content of music, for film tributes, and ensure topic similarity between lectures and documentaries, for science talks. Composition techniques for the production of multimedia artifacts are addressed as a means of organizing content, in order to improve coherence. We discuss our results considering the above aspects.

* 7 pages, 2 figures 
Access Paper or Ask Questions

NONOTO: A Model-agnostic Web Interface for Interactive Music Composition by Inpainting

Jul 23, 2019
Théis Bazin, Gaëtan Hadjeres

Inpainting-based generative modeling allows for stimulating human-machine interactions by letting users perform stylistically coherent local editions to an object using a statistical model. We present NONOTO, a new interface for interactive music generation based on inpainting models. It is aimed both at researchers, by offering a simple and flexible API allowing them to connect their own models with the interface, and at musicians by providing industry-standard features such as audio playback, real-time MIDI output and straightforward synchronization with DAWs using Ableton Link.

* 3 pages, 1 figure. Published as a conference paper at the 10th International Conference on Computational Creativity (ICCC 2019), UNC Charlotte, North Carolina 
Access Paper or Ask Questions

RAVE: A variational autoencoder for fast and high-quality neural audio synthesis

Nov 09, 2021
Antoine Caillon, Philippe Esling

Deep generative models applied to audio have improved by a large margin the state-of-the-art in many speech and music related tasks. However, as raw waveform modelling remains an inherently difficult task, audio generative models are either computationally intensive, rely on low sampling rates, are complicated to control or restrict the nature of possible signals. Among those models, Variational AutoEncoders (VAE) give control over the generation by exposing latent variables, although they usually suffer from low synthesis quality. In this paper, we introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We introduce a novel two-stage training procedure, namely representation learning and adversarial fine-tuning. We show that using a post-training analysis of the latent space allows a direct control between the reconstruction fidelity and the representation compactness. By leveraging a multi-band decomposition of the raw waveform, we show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU. We evaluate synthesis quality using both quantitative and qualitative subjective experiments and show the superiority of our approach compared to existing models. Finally, we present applications of our model for timbre transfer and signal compression. All of our source code and audio examples are publicly available.

Access Paper or Ask Questions

Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning

Dec 30, 2021
Ziyu Wang, Dejing Xu, Gus Xia, Ying Shan

Could we automatically derive the score of a piano accompaniment based on the audio of a pop song? This is the audio-to-symbolic arrangement problem we tackle in this paper. A good arrangement model should not only consider the audio content but also have prior knowledge of piano composition (so that the generation "sounds like" the audio and meanwhile maintains musicality.) To this end, we contribute a cross-modal representation-learning model, which 1) extracts chord and melodic information from the audio, and 2) learns texture representation from both audio and a corrupted ground truth arrangement. We further introduce a tailored training strategy that gradually shifts the source of texture information from corrupted score to audio. In the end, the score-based texture posterior is reduced to a standard normal distribution, and only audio is needed for inference. Experiments show that our model captures major audio information and outperforms baselines in generation quality.

Access Paper or Ask Questions

DeepDrummer : Generating Drum Loops using Deep Learning and a Human in the Loop

Aug 26, 2020
Guillaume Alain, Maxime Chevalier-Boisvert, Frederic Osterrath, Remi Piche-Taillefer

DeepDrummer is a drum loop generation tool that uses active learning to learn the preferences (or current artistic intentions) of a human user from a small number of interactions. The principal goal of this tool is to enable an efficient exploration of new musical ideas. We train a deep neural network classifier on audio data and show how it can be used as the core component of a system that generates drum loops based on few prior beliefs as to how these loops should be structured. We aim to build a system that can converge to meaningful results even with a limited number of interactions with the user. This property enables our method to be used from a cold start situation (no pre-existing dataset), or starting from a collection of audio samples provided by the user. In a proof of concept study with 25 participants, we empirically demonstrate that DeepDrummer is able to converge towards the preference of our subjects after a small number of interactions.

Access Paper or Ask Questions