Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"music generation": models, code, and papers

Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer

Nov 07, 2021
Yi-Jen Shih, Shih-Lun Wu, Frank Zalkow, Meinard Müller, Yi-Hsuan Yang

Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with a user-specified sequence, a popular approach is to take that conditioning sequence as a priming sequence and ask a Transformer decoder to generate a continuation. However, this prompt-based conditioning cannot guarantee that the conditioning sequence would develop or even simply repeat itself in the generated continuation. In this paper, we propose an alternative conditioning approach, called theme-based conditioning, that explicitly trains the Transformer to treat the conditioning sequence as a thematic material that has to manifest itself multiple times in its generation result. This is achieved with two main technical contributions. First, we propose a deep learning-based approach that uses contrastive representation learning and clustering to automatically retrieve thematic materials from music pieces in the training data. Second, we propose a novel gated parallel attention module to be used in a sequence-to-sequence (seq2seq) encoder/decoder architecture to more effectively account for a given conditioning thematic material in the generation process of the Transformer decoder. We report on objective and subjective evaluations of variants of the proposed Theme Transformer and the conventional prompt-based baseline, showing that our best model can generate, to some extent, polyphonic pop piano music with repetition and plausible variations of a given condition.


MMM : Exploring Conditional Multi-Track Music Generation with the Transformer

Aug 20, 2020
Jeff Ens, Philippe Pasquier

We propose the Multi-Track Music Machine (MMM), a generative system based on the Transformer architecture that is capable of generating multi-track music. In contrast to previous work, which represents musical material as a single time-ordered sequence, where the musical events corresponding to different tracks are interleaved, we create a time-ordered sequence of musical events for each track and concatenate several tracks into a single sequence. This takes advantage of the Transformer's attention-mechanism, which can adeptly handle long-term dependencies. We explore how various representations can offer the user a high degree of control at generation time, providing an interactive demo that accommodates track-level and bar-level inpainting, and offers control over track instrumentation and note density.


Scorpiano -- A System for Automatic Music Transcription for Monophonic Piano Music

Aug 24, 2021
Bojan Sofronievski, Branislav Gerazov

Music transcription is the process of transcribing music audio into music notation. It is a field in which the machines still cannot beat human performance. The main motivation for automatic music transcription is to make it possible for anyone playing a musical instrument, to be able to generate the music notes for a piece of music quickly and accurately. It does not matter if the person is a beginner and simply struggles to find the music score by searching, or an expert who heard a live jazz improvisation and would like to reproduce it without losing time doing manual transcription. We propose Scorpiano -- a system that can automatically generate a music score for simple monophonic piano melody tracks using digital signal processing. The system integrates multiple digital audio processing methods: notes onset detection, tempo estimation, beat detection, pitch detection and finally generation of the music score. The system has proven to give good results for simple piano melodies, comparable to commercially available neural network based systems.


The music box operad: Random generation of musical phrases from patterns

Apr 27, 2021
Samuele Giraudo

We introduce the notion of multi-pattern, a combinatorial abstraction of polyphonic musical phrases. The interest of this approach to encode musical phrases lies in the fact that it becomes possible to compose multi-patterns in order to produce new ones. This dives the set of musical phrases into an algebraic framework since the set of multi-patterns has the structure of an operad. Operads are algebraic structures offering a formalization of the notion of operators and their compositions. Seeing musical phrases as operators allows us to perform computations on phrases and admits applications in generative music. Indeed, given a set of short patterns, we propose various algorithms to randomly generate a new and longer phrase inspired by the inputted patterns.

* 31 pages. Extended version of arXiv:2104.12432 

Lets Play Music: Audio-driven Performance Video Generation

Nov 05, 2020
Hao Zhu, Yi Li, Feixia Zhu, Aihua Zheng, Ran He

We propose a new task named Audio-driven Per-formance Video Generation (APVG), which aims to synthesizethe video of a person playing a certain instrument guided bya given music audio clip. It is a challenging task to gener-ate the high-dimensional temporal consistent videos from low-dimensional audio modality. In this paper, we propose a multi-staged framework to achieve this new task to generate realisticand synchronized performance video from given music. Firstly,we provide both global appearance and local spatial informationby generating the coarse videos and keypoints of body and handsfrom a given music respectively. Then, we propose to transformthe generated keypoints to heatmap via a differentiable spacetransformer, since the heatmap offers more spatial informationbut is harder to generate directly from audio. Finally, wepropose a Structured Temporal UNet (STU) to extract bothintra-frame structured information and inter-frame temporalconsistency. They are obtained via graph-based structure module,and CNN-GRU based high-level temporal module respectively forfinal video generation. Comprehensive experiments validate theeffectiveness of our proposed framework.

* ICPR 2020 

Music Generation Using an LSTM

Mar 23, 2022
Michael Conner, Lucas Gral, Kevin Adams, David Hunger, Reagan Strelow, Alexander Neuwirth

Over the past several years, deep learning for sequence modeling has grown in popularity. To achieve this goal, LSTM network structures have proven to be very useful for making predictions for the next output in a series. For instance, a smartphone predicting the next word of a text message could use an LSTM. We sought to demonstrate an approach of music generation using Recurrent Neural Networks (RNN). More specifically, a Long Short-Term Memory (LSTM) neural network. Generating music is a notoriously complicated task, whether handmade or generated, as there are a myriad of components involved. Taking this into account, we provide a brief synopsis of the intuition, theory, and application of LSTMs in music generation, develop and present the network we found to best achieve this goal, identify and address issues and challenges faced, and include potential future improvements for our network.

* Published in MICS 2022 

Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation

Oct 06, 2018
Hao-Wen Dong, Yi-Hsuan Yang

It has been shown recently that deep convolutional generative adversarial networks (GANs) can learn to generate music in the form of piano-rolls, which represent music by binary-valued time-pitch matrices. However, existing models can only generate real-valued piano-rolls and require further post-processing, such as hard thresholding (HT) or Bernoulli sampling (BS), to obtain the final binary-valued results. In this paper, we study whether we can have a convolutional GAN model that directly creates binary-valued piano-rolls by using binary neurons. Specifically, we propose to append to the generator an additional refiner network, which uses binary neurons at the output layer. The whole network is trained in two stages. Firstly, the generator and the discriminator are pretrained. Then, the refiner network is trained along with the discriminator to learn to binarize the real-valued piano-rolls the pretrained generator creates. Experimental results show that using binary neurons instead of HT or BS indeed leads to better results in a number of objective measures. Moreover, deterministic binary neurons perform better than stochastic ones in both objective measures and a subjective test. The source code, training data and audio examples of the generated results can be found at .

* A preliminary version of this paper appeared in ISMIR 2018. In this version, we added an appendix to provide figures of sample results and remarks on the end-to-end models 

Interpretable Melody Generation from Lyrics with Discrete-Valued Adversarial Training

Jun 30, 2022
Wei Duan, Zhe Zhang, Yi Yu, Keizo Oyama

Generating melody from lyrics is an interesting yet challenging task in the area of artificial intelligence and music. However, the difficulty of keeping the consistency between input lyrics and generated melody limits the generation quality of previous works. In our proposal, we demonstrate our proposed interpretable lyrics-to-melody generation system which can interact with users to understand the generation process and recreate the desired songs. To improve the reliability of melody generation that matches lyrics, mutual information is exploited to strengthen the consistency between lyrics and generated melodies. Gumbel-Softmax is exploited to solve the non-differentiability problem of generating discrete music attributes by Generative Adversarial Networks (GANs). Moreover, the predicted probabilities output by the generator is utilized to recommend music attributes. Interacting with our lyrics-to-melody generation system, users can listen to the generated AI song as well as recreate a new song by selecting from recommended music attributes.

* 3 pages, 3 figures 

DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer

Mar 25, 2021
Buyu Li, Yongchi Zhao, Lu Sheng

In this work, we propose a novel deep learning framework that can generate a vivid dance from a whole piece of music. In contrast to previous works that define the problem as generation of frames of motion state parameters, we formulate the task as a prediction of motion curves between key poses, which is inspired by the animation industry practice. The proposed framework, named DanceNet3D, first generates key poses on beats of the given music and then predicts the in-between motion curves. DanceNet3D adopts the encoder-decoder architecture and the adversarial schemes for training. The decoders in DanceNet3D are constructed on MoTrans, a transformer tailored for motion generation. In MoTrans we introduce the kinematic correlation by the Kinematic Chain Networks, and we also propose the Learned Local Attention module to take the temporal local correlation of human motion into consideration. Furthermore, we propose PhantomDance, the first large-scale dance dataset produced by professional animatiors, with accurate synchronization with music. Extensive experiments demonstrate that the proposed approach can generate fluent, elegant, performative and beat-synchronized 3D dances, which significantly surpasses previous works quantitatively and qualitatively.

* Add project link in abstract