Alert button

"music generation": models, code, and papers
Alert button

MelNet: A Generative Model for Audio in the Frequency Domain

Jun 04, 2019
Sean Vasquez, Mike Lewis

Figure 1 for MelNet: A Generative Model for Audio in the Frequency Domain
Figure 2 for MelNet: A Generative Model for Audio in the Frequency Domain
Figure 3 for MelNet: A Generative Model for Audio in the Frequency Domain
Figure 4 for MelNet: A Generative Model for Audio in the Frequency Domain

Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps. While long-range dependencies are difficult to model directly in the time domain, we show that they can be more tractably modelled in two-dimensional time-frequency representations such as spectrograms. By leveraging this representational advantage, in conjunction with a highly expressive probabilistic model and a multiscale generation procedure, we design a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve. We apply our model to a variety of audio generation tasks, including unconditional speech generation, music generation, and text-to-speech synthesis---showing improvements over previous approaches in both density estimates and human judgments.

Viaarxiv icon

DAWSON: A Domain Adaptive Few Shot Generation Framework

Jan 02, 2020
Weixin Liang, Zixuan Liu, Can Liu

Figure 1 for DAWSON: A Domain Adaptive Few Shot Generation Framework
Figure 2 for DAWSON: A Domain Adaptive Few Shot Generation Framework
Figure 3 for DAWSON: A Domain Adaptive Few Shot Generation Framework
Figure 4 for DAWSON: A Domain Adaptive Few Shot Generation Framework

Training a Generative Adversarial Networks (GAN) for a new domain from scratch requires an enormous amount of training data and days of training time. To this end, we propose DAWSON, a Domain Adaptive FewShot Generation FrameworkFor GANs based on meta-learning. A major challenge of applying meta-learning GANs is to obtain gradients for the generator from evaluating it on development sets due to the likelihood-free nature of GANs. To address this challenge, we propose an alternative GAN training procedure that naturally combines the two-step training procedure of GANs and the two-step training procedure of meta-learning algorithms. DAWSON is a plug-and-play framework that supports a broad family of meta-learning algorithms and various GANs with architectural-variants. Based on DAWSON, We also propose MUSIC MATINEE, which is the first few-shot music generation model. Our experiments show that MUSIC MATINEE could quickly adapt to new domains with only tens of songs from the target domains. We also show that DAWSON can learn to generate new digits with only four samples in the MNIST dataset. We release source codes implementation of DAWSON in both PyTorch and Tensorflow, generated music samples on two genres and the lightning video.

Viaarxiv icon

MG-VAE: Deep Chinese Folk Songs Generation with Specific Regional Style

Sep 29, 2019
Jing Luo, Xinyu Yang, Shulei Ji, Juan Li

Figure 1 for MG-VAE: Deep Chinese Folk Songs Generation with Specific Regional Style
Figure 2 for MG-VAE: Deep Chinese Folk Songs Generation with Specific Regional Style
Figure 3 for MG-VAE: Deep Chinese Folk Songs Generation with Specific Regional Style
Figure 4 for MG-VAE: Deep Chinese Folk Songs Generation with Specific Regional Style

Regional style in Chinese folk songs is a rich treasure that can be used for ethnic music creation and folk culture research. In this paper, we propose MG-VAE, a music generative model based on VAE (Variational Auto-Encoder) that is capable of capturing specific music style and generating novel tunes for Chinese folk songs (Min Ge) in a manipulatable way. Specifically, we disentangle the latent space of VAE into four parts in an adversarial training way to control the information of pitch and rhythm sequence, as well as of music style and content. In detail, two classifiers are used to separate style and content latent space, and temporal supervision is utilized to disentangle the pitch and rhythm sequence. The experimental results show that the disentanglement is successful and our model is able to create novel folk songs with controllable regional styles. To our best knowledge, this is the first study on applying deep generative model and adversarial training for Chinese music generation.

* Accepted by the 7th Conference on Sound and Music Technology, 2019, Harbin, China 
Viaarxiv icon

Lead Sheet Generation and Arrangement by Conditional Generative Adversarial Network

Jul 30, 2018
Hao-Min Liu, Yi-Hsuan Yang

Figure 1 for Lead Sheet Generation and Arrangement by Conditional Generative Adversarial Network
Figure 2 for Lead Sheet Generation and Arrangement by Conditional Generative Adversarial Network
Figure 3 for Lead Sheet Generation and Arrangement by Conditional Generative Adversarial Network
Figure 4 for Lead Sheet Generation and Arrangement by Conditional Generative Adversarial Network

Research on automatic music generation has seen great progress due to the development of deep neural networks. However, the generation of multi-instrument music of arbitrary genres still remains a challenge. Existing research either works on lead sheets or multi-track piano-rolls found in MIDIs, but both musical notations have their limits. In this work, we propose a new task called lead sheet arrangement to avoid such limits. A new recurrent convolutional generative model for the task is proposed, along with three new symbolic-domain harmonic features to facilitate learning from unpaired lead sheets and MIDIs. Our model can generate lead sheets and their arrangements of eight-bar long. Audio samples of the generated result can be found at https://drive.google.com/open?id=1c0FfODTpudmLvuKBbc23VBCgQizY6-Rk

* 7 pages, 7 figures and 4 tables 
Viaarxiv icon

Melody Generation using an Interactive Evolutionary Algorithm

Jul 07, 2019
Majid Farzaneh, Rahil Mahdian Toroghi

Figure 1 for Melody Generation using an Interactive Evolutionary Algorithm
Figure 2 for Melody Generation using an Interactive Evolutionary Algorithm
Figure 3 for Melody Generation using an Interactive Evolutionary Algorithm
Figure 4 for Melody Generation using an Interactive Evolutionary Algorithm

Music generation with the aid of computers has been recently grabbed the attention of many scientists in the area of artificial intelligence. Deep learning techniques have evolved sequence production methods for this purpose. Yet, a challenging problem is how to evaluate generated music by a machine. In this paper, a methodology has been developed based upon an interactive evolutionary optimization method, with which the scoring of the generated melodies is primarily performed by human expertise, during the training. This music quality scoring is modeled using a Bi-LSTM recurrent neural network. Moreover, the innovative generated melody through a Genetic algorithm will then be evaluated using this Bi-LSTM network. The results of this mechanism clearly show that the proposed method is able to create pleasurable melodies with desired styles and pieces. This method is also quite fast, compared to the state-of-the-art data-oriented evolutionary systems.

* 5 pages, 4 images, submitted to MEDPRAI2019 conference 
Viaarxiv icon

Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation

Oct 06, 2018
Hao-Wen Dong, Yi-Hsuan Yang

Figure 1 for Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation
Figure 2 for Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation
Figure 3 for Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation
Figure 4 for Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation

It has been shown recently that deep convolutional generative adversarial networks (GANs) can learn to generate music in the form of piano-rolls, which represent music by binary-valued time-pitch matrices. However, existing models can only generate real-valued piano-rolls and require further post-processing, such as hard thresholding (HT) or Bernoulli sampling (BS), to obtain the final binary-valued results. In this paper, we study whether we can have a convolutional GAN model that directly creates binary-valued piano-rolls by using binary neurons. Specifically, we propose to append to the generator an additional refiner network, which uses binary neurons at the output layer. The whole network is trained in two stages. Firstly, the generator and the discriminator are pretrained. Then, the refiner network is trained along with the discriminator to learn to binarize the real-valued piano-rolls the pretrained generator creates. Experimental results show that using binary neurons instead of HT or BS indeed leads to better results in a number of objective measures. Moreover, deterministic binary neurons perform better than stochastic ones in both objective measures and a subjective test. The source code, training data and audio examples of the generated results can be found at https://salu133445.github.io/bmusegan/ .

* A preliminary version of this paper appeared in ISMIR 2018. In this version, we added an appendix to provide figures of sample results and remarks on the end-to-end models 
Viaarxiv icon

Semi-Recurrent CNN-based VAE-GAN for Sequential Data Generation

Jun 01, 2018
Mohammad Akbari, Jie Liang

Figure 1 for Semi-Recurrent CNN-based VAE-GAN for Sequential Data Generation
Figure 2 for Semi-Recurrent CNN-based VAE-GAN for Sequential Data Generation
Figure 3 for Semi-Recurrent CNN-based VAE-GAN for Sequential Data Generation

A semi-recurrent hybrid VAE-GAN model for generating sequential data is introduced. In order to consider the spatial correlation of the data in each frame of the generated sequence, CNNs are utilized in the encoder, generator, and discriminator. The subsequent frames are sampled from the latent distributions obtained by encoding the previous frames. As a result, the dependencies between the frames are maintained. Two testing frameworks for synthesizing a sequence with any number of frames are also proposed. The promising experimental results on piano music generation indicates the potential of the proposed framework in modeling other sequential data such as video.

* 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, 2321-2325  
* 5 pages, 6 figures, ICASSP 2018 
Viaarxiv icon

Neural Shuffle-Exchange Networks -- Sequence Processing in O(n log n) Time

Jul 23, 2019
Kārlis Freivalds, Emīls Ozoliņš, Agris Šostaks

Figure 1 for Neural Shuffle-Exchange Networks -- Sequence Processing in O(n log n) Time
Figure 2 for Neural Shuffle-Exchange Networks -- Sequence Processing in O(n log n) Time
Figure 3 for Neural Shuffle-Exchange Networks -- Sequence Processing in O(n log n) Time
Figure 4 for Neural Shuffle-Exchange Networks -- Sequence Processing in O(n log n) Time

A key requirement in sequence to sequence processing is the modeling of long range dependencies. To this end, a vast majority of the state-of-the-art models use attention mechanism which is of O($n^2$) complexity that leads to slow execution for long sequences. We introduce a new Shuffle-Exchange neural network model for sequence to sequence tasks which have O(log n) depth and O(n log n) total complexity. We show that this model is powerful enough to infer efficient algorithms for common algorithmic benchmarks including sorting, addition and multiplication. We evaluate our architecture on the challenging LAMBADA question answering dataset and compare it with the state-of-the-art models which use attention. Our model achieves competitive accuracy and scales to sequences with more than a hundred thousand of elements. We are confident that the proposed model has the potential for building more efficient architectures for processing large interrelated data in language modeling, music generation and other application domains.

Viaarxiv icon

Automated Composition of Picture-Synched Music Soundtracks for Movies

Oct 19, 2019
Vansh Dassani, Jon Bird, Dave Cliff

Figure 1 for Automated Composition of Picture-Synched Music Soundtracks for Movies
Figure 2 for Automated Composition of Picture-Synched Music Soundtracks for Movies
Figure 3 for Automated Composition of Picture-Synched Music Soundtracks for Movies
Figure 4 for Automated Composition of Picture-Synched Music Soundtracks for Movies

We describe the implementation of and early results from a system that automatically composes picture-synched musical soundtracks for videos and movies. We use the phrase "picture-synched" to mean that the structure of the automatically composed music is determined by visual events in the input movie, i.e. the final music is synchronised to visual events and features such as cut transitions or within-shot key-frame events. Our system combines automated video analysis and computer-generated music-composition techniques to create unique soundtracks in response to the video input, and can be thought of as an initial step in creating a computerised replacement for a human composer writing music to fit the picture-locked edit of a movie. Working only from the video information in the movie, key features are extracted from the input video, using video analysis techniques, which are then fed into a machine-learning-based music generation tool, to compose a piece of music from scratch. The resulting soundtrack is tied to video features, such as scene transition markers and scene-level energy values, and is unique to the input video. Although the system we describe here is only a preliminary proof-of-concept, user evaluations of the output of the system have been positive.

* To be presented at the 16th ACM SIGGRAPH European Conference on Visual Media Production. London, England: 17th-18th December 2019. 10 pages, 9 figures 
Viaarxiv icon

This Time with Feeling: Learning Expressive Musical Performance

Aug 10, 2018
Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, Karen Simonyan

Figure 1 for This Time with Feeling: Learning Expressive Musical Performance
Figure 2 for This Time with Feeling: Learning Expressive Musical Performance
Figure 3 for This Time with Feeling: Learning Expressive Musical Performance
Figure 4 for This Time with Feeling: Learning Expressive Musical Performance

Music generation has generally been focused on either creating scores or interpreting them. We discuss differences between these two problems and propose that, in fact, it may be valuable to work in the space of direct $\it performance$ generation: jointly predicting the notes $\it and$ $\it also$ their expressive timing and dynamics. We consider the significance and qualities of the data set needed for this. Having identified both a problem domain and characteristics of an appropriate data set, we show an LSTM-based recurrent network model that subjectively performs quite well on this task. Critically, we provide generated examples. We also include feedback from professional composers and musicians about some of these examples.

* Includes links to urls for audio samples 
Viaarxiv icon