Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"music generation": models, code, and papers

Generation of musical patterns through operads

Apr 26, 2021
Samuele Giraudo

We introduce the notion of multi-pattern, a combinatorial abstraction of polyphonic musical phrases. The interest of this approach lies in the fact that this offers a way to compose two multi-patterns in order to produce a longer one. This dives musical phrases into an algebraic context since the set of multi-patterns has the structure of an operad; operads being structures offering a formalization of the notion of operators and their compositions. Seeing musical phrases as operators allows us to perform computations on phrases and admits applications in generative music: given a set of short patterns, we propose various algorithms to randomly generate a new and longer phrase inspired by the inputted patterns.

* Journ\'ees d'informatique musicale, 2020 
* 10 pages 

EMOPIA: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation

Aug 03, 2021
Hsiao-Tzu Hung, Joann Ching, Seungheon Doh, Nabin Kim, Juhan Nam, Yi-Hsuan Yang

While there are many music datasets with emotion labels in the literature, they cannot be used for research on symbolic-domain music analysis or generation, as there are usually audio files only. In this paper, we present the EMOPIA (pronounced `yee-m\`{o}-pi-uh') dataset, a shared multi-modal (audio and MIDI) database focusing on perceived emotion in pop piano music, to facilitate research on various tasks related to music emotion. The dataset contains 1,087 music clips from 387 songs and clip-level emotion labels annotated by four dedicated annotators. Since the clips are not restricted to one clip per song, they can also be used for song-level analysis. We present the methodology for building the dataset, covering the song list curation, clip selection, and emotion annotation processes. Moreover, we prototype use cases on clip-level music emotion classification and emotion-based symbolic music generation by training and evaluating corresponding models using the dataset. The result demonstrates the potential of EMOPIA for being used in future exploration on piano emotion-related MIR tasks.

* The paper has been accepted for publication at ISMIR 2021 

QuiKo: A Quantum Beat Generation Application

Apr 09, 2022
Scott Oshiro

In this chapter a quantum music generation application called QuiKo will be discussed. It combines existing quantum algorithms with data encoding methods from quantum machine learning to build drum and audio sample patterns from a database of audio tracks. QuiKo leverages the physical properties and characteristics of quantum computers to generate what can be referred to as Soft Rules proposed by Alexis Kirke. These rules take advantage of the noise produced by quantum devices to develop flexible rules and grammars for quantum music generation. These properties include qubit decoherence and phase kickback due controlled quantum gates within the quantum circuit. QuiKo builds upon the concept of soft rules in quantum music generation and takes it a step further. It attempts to mimic and react to an external musical inputs, similar to the way that human musicians play and compose with one another. Audio signals are used as inputs into the system. Feature extraction is then performed on the signal to identify the harmonic and percussive elements. This information is then encoded onto the quantum circuit. Measurements of the quantum circuit are then taken providing results in the form of probability distributions for external music applications to use to build the new drum patterns.

* Pre-publication draft, to appear in the book "Quantum Computer Music", E. R. Miranda (Ed.) 

MG-VAE: Deep Chinese Folk Songs Generation with Specific Regional Style

Sep 29, 2019
Jing Luo, Xinyu Yang, Shulei Ji, Juan Li

Regional style in Chinese folk songs is a rich treasure that can be used for ethnic music creation and folk culture research. In this paper, we propose MG-VAE, a music generative model based on VAE (Variational Auto-Encoder) that is capable of capturing specific music style and generating novel tunes for Chinese folk songs (Min Ge) in a manipulatable way. Specifically, we disentangle the latent space of VAE into four parts in an adversarial training way to control the information of pitch and rhythm sequence, as well as of music style and content. In detail, two classifiers are used to separate style and content latent space, and temporal supervision is utilized to disentangle the pitch and rhythm sequence. The experimental results show that the disentanglement is successful and our model is able to create novel folk songs with controllable regional styles. To our best knowledge, this is the first study on applying deep generative model and adversarial training for Chinese music generation.

* Accepted by the 7th Conference on Sound and Music Technology, 2019, Harbin, China 

Composing Music with Grammar Argumented Neural Networks and Note-Level Encoding

Dec 07, 2016
Zheng Sun, Jiaqi Liu, Zewang Zhang, Jingwen Chen, Zhao Huo, Ching Hua Lee, Xiao Zhang

Creating aesthetically pleasing pieces of art, including music, has been a long-term goal for artificial intelligence research. Despite recent successes of long-short term memory (LSTM) recurrent neural networks (RNNs) in sequential learning, LSTM neural networks have not, by themselves, been able to generate natural-sounding music conforming to music theory. To transcend this inadequacy, we put forward a novel method for music composition that combines the LSTM with Grammars motivated by music theory. The main tenets of music theory are encoded as grammar argumented (GA) filters on the training data, such that the machine can be trained to generate music inheriting the naturalness of human-composed pieces from the original dataset while adhering to the rules of music theory. Unlike previous approaches, pitches and durations are encoded as one semantic entity, which we refer to as note-level encoding. This allows easy implementation of music theory grammars, as well as closer emulation of the thinking pattern of a musician. Although the GA rules are applied to the training data and never directly to the LSTM music generation, our machine still composes music that possess high incidences of diatonic scale notes, small pitch intervals and chords, in deference to music theory.

* 6 pages, 4 figures 

Assisted Sound Sample Generation with Musical Conditioning in Adversarial Auto-Encoders

Apr 12, 2019
Adrien Bitton, Philippe Esling, Antoine Caillon, Martin Fouilleul

Generative models have thrived in computer vision, enabling unprecedented image processes. Yet the results in audio remain less advanced. Our project targets real-time sound synthesis from a reduced set of high-level parameters, including semantic controls that can be adapted to different sound libraries and specific tags. These generative variables should allow expressive modulations of target musical qualities and continuously mix into new styles. To this extent we train AEs on an orchestral database of individual note samples, along with their intrinsic attributes: note class, timbre domain and extended playing techniques. We condition the decoder for control over the rendered note attributes and use latent adversarial training for learning expressive style parameters that can ultimately be mixed. We evaluate both generative performances and latent representation. Our ablation study demonstrates the effectiveness of the musical conditioning mechanisms. The proposed model generates notes as magnitude spectrograms from any probabilistic latent code samples, with expressive control of orchestral timbres and playing styles. Its training data subsets can directly be visualized in the 3D latent representation. Waveform rendering can be done offline with GLA. In order to allow real-time interactions, we fine-tune the decoder with a pretrained MCNN and embed the full waveform generation pipeline in a plugin. Moreover the encoder could be used to process new input samples, after manipulating their latent attribute representation, the decoder can generate sample variations as an audio effect would. Our solution remains rather fast to train, it can directly be applied to other sound domains, including an user's libraries with custom sound tags that could be mapped to specific generative controls. As a result, it fosters creativity and intuitive audio style experimentations.

* this article was submitted to the 22nd International Conference on Digital Audio Effects (DAFx 2019), the non-anonymous review is in progress ; we also work on a companion repository 

Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory

Mar 25, 2022
Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, Ziwei Liu

Driving 3D characters to dance following a piece of music is highly challenging due to the spatial constraints applied to poses by choreography norms. In addition, the generated dance sequence also needs to maintain temporal coherency with different music genres. To tackle these challenges, we propose a novel music-to-dance framework, Bailando, with two powerful components: 1) a choreographic memory that learns to summarize meaningful dancing units from 3D pose sequence to a quantized codebook, 2) an actor-critic Generative Pre-trained Transformer (GPT) that composes these units to a fluent dance coherent to the music. With the learned choreographic memory, dance generation is realized on the quantized units that meet high choreography standards, such that the generated dancing sequences are confined within the spatial constraints. To achieve synchronized alignment between diverse motion tempos and music beats, we introduce an actor-critic-based reinforcement learning scheme to the GPT with a newly-designed beat-align reward function. Extensive experiments on the standard benchmark demonstrate that our proposed framework achieves state-of-the-art performance both qualitatively and quantitatively. Notably, the learned choreographic memory is shown to discover human-interpretable dancing-style poses in an unsupervised manner.

* Accepted by CVPR 2022. Code and video link: 

Conditional LSTM-GAN for Melody Generation from Lyrics

Aug 15, 2019
Yi Yu, Simon Canales

Melody generation from lyrics has been a challenging research issue in the field of artificial intelligence and music, which enables to learn and discover latent relationship between interesting lyrics and accompanying melody. Unfortunately, the limited availability of paired lyrics-melody dataset with alignment information has hindered the research progress. To address this problem, we create a large dataset consisting of 12,197 MIDI songs each with paired lyrics and melody alignment through leveraging different music sources where alignment relationship between syllables and music attributes is extracted. Most importantly, we propose a novel deep generative model, conditional Long Short-Term Memory - Generative Adversarial Network (LSTM-GAN) for melody generation from lyrics, which contains a deep LSTM generator and a deep LSTM discriminator both conditioned on lyrics. In particular, lyrics-conditioned melody and alignment relationship between syllables of given lyrics and notes of predicted melody are generated simultaneously. Experimental results have proved the effectiveness of our proposed lyrics-to-melody generative model, where plausible and tuneful sequences can be inferred from lyrics.


RoboJam: A Musical Mixture Density Network for Collaborative Touchscreen Interaction

Nov 29, 2017
Charles P. Martin, Jim Torresen

RoboJam is a machine-learning system for generating music that assists users of a touchscreen music app by performing responses to their short improvisations. This system uses a recurrent artificial neural network to generate sequences of touchscreen interactions and absolute timings, rather than high-level musical notes. To accomplish this, RoboJam's network uses a mixture density layer to predict appropriate touch interaction locations in space and time. In this paper, we describe the design and implementation of RoboJam's network and how it has been integrated into a touchscreen music app. A preliminary evaluation analyses the system in terms of training, musical generation and user interaction.


Maximum entropy models for generation of expressive music

Oct 12, 2016
Simon Moulieras, François Pachet

In the context of contemporary monophonic music, expression can be seen as the difference between a musical performance and its symbolic representation, i.e. a musical score. In this paper, we show how Maximum Entropy (MaxEnt) models can be used to generate musical expression in order to mimic a human performance. As a training corpus, we had a professional pianist play about 150 melodies of jazz, pop, and latin jazz. The results show a good predictive power, validating the choice of our model. Additionally, we set up a listening test whose results reveal that on average, people significantly prefer the melodies generated by the MaxEnt model than the ones without any expression, or with fully random expression. Furthermore, in some cases, MaxEnt melodies are almost as popular as the human performed ones.