In this paper, we present a transformer-based learning framework for 3D dance generation conditioned on music. We carefully design our network architecture and empirically study the keys for obtaining qualitatively pleasing results. The critical components include a deep cross-modal transformer, which well learns the correlation between the music and dance motion; and the full-attention with future-N supervision mechanism which is essential in producing long-range non-freezing motion. In addition, we propose a new dataset of paired 3D motion and music called AIST++, which we reconstruct from the AIST multi-view dance videos. This dataset contains 1.1M frames of 3D dance motion in 1408 sequences, covering 10 genres of dance choreographies and accompanied with multi-view camera parameters. To our knowledge it is the largest dataset of this kind. Rich experiments on AIST++ demonstrate our method produces much better results than the state-of-the-art methods both qualitatively and quantitatively.
Automatic melody generation has been a long-time aspiration for both AI researchers and musicians. However, learning to generate euphonious melodies has turned out to be highly challenging. This paper introduces 1) a new variant of variational autoencoder (VAE), where the model structure is designed in a modularized manner in order to model polyphonic and dynamic music with domain knowledge, and 2) a hierarchical encoding/decoding strategy, which explicitly models the dependency between melodic features. The proposed framework is capable of generating distinct melodies that sounds natural, and the experiments for evaluating generated music clips show that the proposed model outperforms the baselines in human evaluation.
In recent years, artificial neural networks (ANNs) have become a universal tool for tackling real-world problems. ANNs have also shown great success in music-related tasks including music summarization and classification, similarity estimation, computer-aided or autonomous composition, and automatic music analysis. As structure is a fundamental characteristic of Western music, it plays a role in all these tasks. Some structural aspects are particularly challenging to learn with current ANN architectures. This is especially true for mid- and high-level self-similarity, tonal and rhythmic relationships. In this thesis, I explore the application of ANNs to different aspects of musical structure modeling, identify some challenges involved and propose strategies to address them. First, using probability estimations of a Restricted Boltzmann Machine (RBM), a probabilistic bottom-up approach to melody segmentation is studied. Then, a top-down method for imposing a high-level structural template in music generation is presented, which combines Gibbs sampling using a convolutional RBM with gradient-descent optimization on the intermediate solutions. Furthermore, I motivate the relevance of musical transformations in structure modeling and show how a connectionist model, the Gated Autoencoder (GAE), can be employed to learn transformations between musical fragments. For learning transformations in sequences, I propose a special predictive training of the GAE, which yields a representation of polyphonic music as a sequence of intervals. Furthermore, the applicability of these interval representations to a top-down discovery of repeated musical sections is shown. Finally, a recurrent variant of the GAE is proposed, and its efficacy in music prediction and modeling of low-level repetition structure is demonstrated.
A representation technique that allows encoding music in a way that contains musical meaning would improve the results of any model trained for computer music tasks like generation of melodies and harmonies of better quality. The field of natural language processing has done a lot of work in finding a way to capture the semantic meaning of words and sentences, and word embeddings have successfully shown the capabilities for such a task. In this paper, we experiment with embeddings to represent musical notes from 3 different variations of a dataset and analyze if the model can capture useful musical patterns. To do this, the resulting embeddings are visualized in projections using the t-SNE technique.
Deep learning models are typically evaluated to measure and compare their performance on a given task. The metrics that are commonly used to evaluate these models are standard metrics that are used for different tasks. In the field of music composition or generation, the standard metrics used in other fields have no clear meaning in terms of music theory. In this paper, we propose a subjective method to evaluate AI-based music composition systems by asking questions related to basic music principles to different levels of users based on their musical experience and knowledge. We use this method to compare state-of-the-art models for music composition with deep learning. We give the results of this evaluation method and we compare the responses of each user level for each evaluated model.
The dominant approach for music representation learning involves the deep unsupervised model family variational autoencoder (VAE). However, most, if not all, viable attempts on this problem have largely been limited to monophonic music. Normally composed of richer modality and more complex musical structures, the polyphonic counterpart has yet to be addressed in the context of music representation learning. In this work, we propose the PianoTree VAE, a novel tree-structure extension upon VAE aiming to fit the polyphonic music learning. The experiments prove the validity of the PianoTree VAE via (i)-semantically meaningful latent code for polyphonic segments; (ii)-more satisfiable reconstruction aside of decent geometry learned in the latent space; (iii)-this model's benefits to the variety of the downstream music generation.
We introduce a new artificial intelligence (AI) approach called, the 'Digital Synaptic Neural Substrate' (DSNS). It uses selected attributes from objects in various domains (e.g. chess problems, classical music, renowned artworks) and recombines them in such a way as to generate new attributes that can then, in principle, be used to create novel objects of creative value to humans relating to any one of the source domains. This allows some of the burden of creative content generation to be passed from humans to machines. The approach was tested in the domain of chess problem composition. We used it to automatically compose numerous sets of chess problems based on attributes extracted and recombined from chess problems and tournament games by humans, renowned paintings, computer-evolved abstract art, photographs of people, and classical music tracks. The quality of these generated chess problems was then assessed automatically using an existing and experimentally-validated computational chess aesthetics model. They were also assessed by human experts in the domain. The results suggest that attributes collected and recombined from chess and other domains using the DSNS approach can indeed be used to automatically generate chess problems of reasonably high aesthetic quality. In particular, a low quality chess source (i.e. tournament game sequences between weak players) used in combination with actual photographs of people was able to produce three-move chess problems of comparable quality or better to those generated using a high quality chess source (i.e. published compositions by human experts), and more efficiently as well. Why information from a foreign domain can be integrated and functional in this way remains an open question for now. The DSNS approach is, in principle, scalable and applicable to any domain in which objects have attributes that can be represented using real numbers.
With rapid development of neural networks, deep-learning has been extended to various natural language generation fields, such as machine translation, dialogue generation and even literature creation. In this paper, we propose a theme-aware language generation model for Chinese music lyrics, which improves the theme-connectivity and coherence of generated paragraphs greatly. A multi-channel sequence-to-sequence (seq2seq) model encodes themes and previous sentences as global and local contextual information. Moreover, attention mechanism is incorporated for sequence decoding, enabling to fuse context into predicted next texts. To prepare appropriate train corpus, LDA (Latent Dirichlet Allocation) is applied for theme extraction. Generated lyrics is grammatically correct and semantically coherent with selected themes, which offers a valuable modelling method in other fields including multi-turn chatbots, long paragraph generation and etc.
Deep generative models have recently achieved impressive performance in speech synthesis and music generation. However, compared to the generation of those domain-specific sounds, the generation of general sounds (such as car horn, dog barking, and gun shot) has received less attention, despite their wide potential applications. In our previous work, sounds are generated in the time domain using SampleRNN. However, it is difficult to capture long-range dependencies within sound recordings using this method. In this work, we propose to generate sounds conditioned on sound classes via neural discrete time-frequency representation learning. This offers an advantage in modelling long-range dependencies and retaining local fine-grained structure within a sound clip. We evaluate our proposed approach on the UrbanSound8K dataset, as compared to a SampleRNN baseline, with the performance metrics measuring the quality and diversity of the generated sound samples. Experimental results show that our proposed method offers significantly better performance in diversity and comparable performance in quality, as compared to the baseline method.