Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"music generation": models, code, and papers

Deep Cross-Modal Audio-Visual Generation

Apr 26, 2017
Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, Chenliang Xu

Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite works in computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluations demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.


Melody-Conditioned Lyrics Generation with SeqGANs

Oct 28, 2020
Yihao Chen, Alexander Lerch

Automatic lyrics generation has received attention from both music and AI communities for years. Early rule-based approaches have~---due to increases in computational power and evolution in data-driven models---~mostly been replaced with deep-learning-based systems. Many existing approaches, however, either rely heavily on prior knowledge in music and lyrics writing or oversimplify the task by largely discarding melodic information and its relationship with the text. We propose an end-to-end melody-conditioned lyrics generation system based on Sequence Generative Adversarial Networks (SeqGAN), which generates a line of lyrics given the corresponding melody as the input. Furthermore, we investigate the performance of the generator with an additional input condition: the theme or overarching topic of the lyrics to be generated. We show that the input conditions have no negative impact on the evaluation metrics while enabling the network to produce more meaningful results.


Maximum entropy models capture melodic styles

Oct 11, 2016
Jason Sakellariou, Francesca Tria, Vittorio Loreto, François Pachet

We introduce a Maximum Entropy model able to capture the statistics of melodies in music. The model can be used to generate new melodies that emulate the style of the musical corpus which was used to train it. Instead of using the $n-$body interactions of $(n-1)-$order Markov models, traditionally used in automatic music generation, we use a $k-$nearest neighbour model with pairwise interactions only. In that way, we keep the number of parameters low and avoid over-fitting problems typical of Markov models. We show that long-range musical phrases don't need to be explicitly enforced using high-order Markov interactions, but can instead emerge from multiple, competing, pairwise interactions. We validate our Maximum Entropy model by contrasting how much the generated sequences capture the style of the original corpus without plagiarizing it. To this end we use a data-compression approach to discriminate the levels of borrowing and innovation featured by the artificial sequences. The results show that our modelling scheme outperforms both fixed-order and variable-order Markov models. This shows that, despite being based only on pairwise interactions, this Maximum Entropy scheme opens the possibility to generate musically sensible alterations of the original phrases, providing a way to generate innovation.


DadaGP: A Dataset of Tokenized GuitarPro Songs for Sequence Models

Jul 30, 2021
Pedro Sarmento, Adarsh Kumar, CJ Carr, Zack Zukowski, Mathieu Barthet, Yi-Hsuan Yang

Originating in the Renaissance and burgeoning in the digital era, tablatures are a commonly used music notation system which provides explicit representations of instrument fingerings rather than pitches. GuitarPro has established itself as a widely used tablature format and software enabling musicians to edit and share songs for musical practice, learning, and composition. In this work, we present DadaGP, a new symbolic music dataset comprising 26,181 song scores in the GuitarPro format covering 739 musical genres, along with an accompanying tokenized format well-suited for generative sequence models such as the Transformer. The tokenized format is inspired by event-based MIDI encodings, often used in symbolic music generation models. The dataset is released with an encoder/decoder which converts GuitarPro files to tokens and back. We present results of a use case in which DadaGP is used to train a Transformer-based model to generate new songs in GuitarPro format. We discuss other relevant use cases for the dataset (guitar-bass transcription, music style transfer and artist/genre classification) as well as ethical implications. DadaGP opens up the possibility to train GuitarPro score generators, fine-tune models on custom data, create new styles of music, AI-powered songwriting apps, and human-AI improvisation.


AccoMontage: Accompaniment Arrangement via Phrase Selection and Style Transfer

Aug 25, 2021
Jingwei Zhao, Gus Xia

Accompaniment arrangement is a difficult music generation task involving intertwined constraints of melody, harmony, texture, and music structure. Existing models are not yet able to capture all these constraints effectively, especially for long-term music generation. To address this problem, we propose AccoMontage, an accompaniment arrangement system for whole pieces of music through unifying phrase selection and neural style transfer. We focus on generating piano accompaniments for folk/pop songs based on a lead sheet (i.e., melody with chord progression). Specifically, AccoMontage first retrieves phrase montages from a database while recombining them structurally using dynamic programming. Second, chords of the retrieved phrases are manipulated to match the lead sheet via style transfer. Lastly, the system offers controls over the generation process. In contrast to pure learning-based approaches, AccoMontage introduces a novel hybrid pathway, in which rule-based optimization and deep learning are both leveraged to complement each other for high-quality generation. Experiments show that our model generates well-structured accompaniment with delicate texture, significantly outperforming the baselines.

* Accepted by ISMIR 2021 

A Framework for Automated Pop-song Melody Generation with Piano Accompaniment Arrangement

Dec 28, 2018
Ziyu Wang, Gus Xia

We contribute a pop-song automation framework for lead melody generation and accompaniment arrangement. The framework reflects the major procedures of human music composition, generating both lead melody and piano accompaniment by a unified strategy. Specifically, we take chord progression as an input and propose three models to generate a structured melody with piano accompaniment textures. First, the harmony alternation model transforms a raw input chord progression to an altered one to better fit the specified music style. Second, the melody generation model generates the lead melody and other voices (melody lines) of the accompaniment using seasonal ARMA (Autoregressive Moving Average) processes. Third, the melody integration model integrates melody lines (voices) together as the final piano accompaniment. We evaluate the proposed framework using subjective listening tests. Experimental results show that the generated melodies are rated significantly higher than the ones generated by bi-directional LSTM, and our accompaniment arrangement result is comparable with a state-of-the-art commercial software, Band in a Box.

* In Proceeding of 6th Conference on Sound and Music Technology, 2018, Xiamen, China 

Automated Composition of Picture-Synched Music Soundtracks for Movies

Oct 19, 2019
Vansh Dassani, Jon Bird, Dave Cliff

We describe the implementation of and early results from a system that automatically composes picture-synched musical soundtracks for videos and movies. We use the phrase "picture-synched" to mean that the structure of the automatically composed music is determined by visual events in the input movie, i.e. the final music is synchronised to visual events and features such as cut transitions or within-shot key-frame events. Our system combines automated video analysis and computer-generated music-composition techniques to create unique soundtracks in response to the video input, and can be thought of as an initial step in creating a computerised replacement for a human composer writing music to fit the picture-locked edit of a movie. Working only from the video information in the movie, key features are extracted from the input video, using video analysis techniques, which are then fed into a machine-learning-based music generation tool, to compose a piece of music from scratch. The resulting soundtrack is tied to video features, such as scene transition markers and scene-level energy values, and is unique to the input video. Although the system we describe here is only a preliminary proof-of-concept, user evaluations of the output of the system have been positive.

* To be presented at the 16th ACM SIGGRAPH European Conference on Visual Media Production. London, England: 17th-18th December 2019. 10 pages, 9 figures 

TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage Method

Sep 20, 2021
Zeqian Ju, Peiling Lu, Xu Tan, Rui Wang, Chen Zhang, Songruoyao Wu, Kejun Zhang, Xiangyang Li, Tao Qin, Tie-Yan Liu

Lyric-to-melody generation is an important task in automatic songwriting. Previous lyric-to-melody generation systems usually adopt end-to-end models that directly generate melodies from lyrics, which suffer from several issues: 1) lack of paired lyric-melody training data; 2) lack of control on generated melodies. In this paper, we develop TeleMelody, a two-stage lyric-to-melody generation system with music template (e.g., tonality, chord progression, rhythm pattern, and cadence) to bridge the gap between lyrics and melodies (i.e., the system consists of a lyric-to-template module and a template-to-melody module). TeleMelody has two advantages. First, it is data efficient. The template-to-melody module is trained in a self-supervised way (i.e., the source template is extracted from the target melody) that does not need any lyric-melody paired data. The lyric-to-template module is made up of some rules and a lyric-to-rhythm model, which is trained with paired lyric-rhythm data that is easier to obtain than paired lyric-melody data. Second, it is controllable. The design of template ensures that the generated melodies can be controlled by adjusting the musical elements in template. Both subjective and objective experimental evaluations demonstrate that TeleMelody generates melodies with higher quality, better controllability, and less requirement on paired lyric-melody data than previous generation systems.


Symbolic Music Loop Generation with VQ-VAE

Nov 15, 2021
Sangjun Han, Hyeongrae Ihm, Woohyung Lim

Music is a repetition of patterns and rhythms. It can be composed by repeating a certain number of bars in a structured way. In this paper, the objective is to generate a loop of 8 bars that can be used as a building block of music. Even considering musical diversity, we assume that music patterns familiar to humans can be defined in a finite set. With explicit rules to extract loops from music, we found that discrete representations are sufficient to model symbolic music sequences. Among VAE family, musical properties from VQ-VAE are better observed rather than other models. Further, to emphasize musical structure, we have manipulated discrete latent features to be repetitive so that the properties are more strengthened. Quantitative and qualitative experiments are extensively conducted to verify our assumptions.


It's Raw! Audio Generation with State-Space Models

Feb 20, 2022
Karan Goel, Albert Gu, Chris Donahue, Christopher Ré

Developing architectures suitable for modeling raw audio is a challenging problem due to the high sampling rates of audio waveforms. Standard sequence modeling approaches like RNNs and CNNs have previously been tailored to fit the demands of audio, but the resultant architectures make undesirable computational tradeoffs and struggle to model waveforms effectively. We propose SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling. We identify that S4 can be unstable during autoregressive generation, and provide a simple improvement to its parameterization by drawing connections to Hurwitz matrices. SaShiMi yields state-of-the-art performance for unconditional waveform generation in the autoregressive setting. Additionally, SaShiMi improves non-autoregressive generation performance when used as the backbone architecture for a diffusion model. Compared to prior architectures in the autoregressive generation setting, SaShiMi generates piano and speech waveforms which humans find more musical and coherent respectively, e.g. 2x better mean opinion scores than WaveNet on an unconditional speech generation task. On a music generation task, SaShiMi outperforms WaveNet on density estimation and speed at both training and inference even when using 3x fewer parameters. Code can be found at and samples at

* 23 pages, 7 figures, 7 tables