Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"music generation": models, code, and papers

GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network

Apr 25, 2023
Haolin Zhuang, Shun Lei, Long Xiao, Weiqin Li, Liyang Chen, Sicheng Yang, Zhiyong Wu, Shiyin Kang, Helen Meng

Figure 1 for GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network

Figure 2 for GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network

Figure 3 for GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network

Figure 4 for GTN-Bailando: Genre Consistent Long-Term 3D Dance Generation based on Pre-trained Genre Token Network

Music-driven 3D dance generation has become an intensive research topic in recent years with great potential for real-world applications. Most existing methods lack the consideration of genre, which results in genre inconsistency in the generated dance movements. In addition, the correlation between the dance genre and the music has not been investigated. To address these issues, we propose a genre-consistent dance generation framework, GTN-Bailando. First, we propose the Genre Token Network (GTN), which infers the genre from music to enhance the genre consistency of long-term dance generation. Second, to improve the generalization capability of the model, the strategy of pre-training and fine-tuning is adopted.Experimental results on the AIST++ dataset show that the proposed dance generation framework outperforms state-of-the-art methods in terms of motion quality and genre consistency.

* Accepted by ICASSP2023.Demo page: https://im1eon.github.io/ICASSP23-GTNB-DG/

Via

Access Paper or Ask Questions

The Effect of Explicit Structure Encoding of Deep Neural Networks for Symbolic Music Generation

Nov 20, 2018
Ke Chen, Weilin Zhang, Shlomo Dubnov, Gus Xia

Figure 1 for The Effect of Explicit Structure Encoding of Deep Neural Networks for Symbolic Music Generation

Figure 2 for The Effect of Explicit Structure Encoding of Deep Neural Networks for Symbolic Music Generation

Figure 3 for The Effect of Explicit Structure Encoding of Deep Neural Networks for Symbolic Music Generation

Figure 4 for The Effect of Explicit Structure Encoding of Deep Neural Networks for Symbolic Music Generation

With recent breakthroughs in artificial neural networks, deep generative models have become one of the leading techniques for computational creativity. Despite very promising progress on image and short sequence generation, symbolic music generation remains a challenging problem since the structure of compositions are usually complicated. In this study, we attempt to solve the melody generation problem constrained by the given chord progression. This music meta-creation problem can also be incorporated into a plan recognition system with user inputs and predictive structural outputs. In particular, we explore the effect of explicit architectural encoding of musical structure via comparing two sequential generative models: LSTM (a type of RNN) and WaveNet (dilated temporal-CNN). As far as we know, this is the first study of applying WaveNet to symbolic music generation, as well as the first systematic comparison between temporal-CNN and RNN for music generation. We conduct a survey for evaluation in our generations and implemented Variable Markov Oracle in music pattern discovery. Experimental results show that to encode structure more explicitly using a stack of dilated convolution layers improved the performance significantly, and a global encoding of underlying chord progression into the generation procedure gains even more.

* 8 pages, 13 figures

Via

Access Paper or Ask Questions

Dance Generation by Sound Symbolic Words

Jun 06, 2023
Miki Okamura, Naruya Kondo, Tatsuki Fushimi, Maki Sakamoto, Yoichi Ochiai

Figure 1 for Dance Generation by Sound Symbolic Words

Figure 2 for Dance Generation by Sound Symbolic Words

Figure 3 for Dance Generation by Sound Symbolic Words

Figure 4 for Dance Generation by Sound Symbolic Words

This study introduces a novel approach to generate dance motions using onomatopoeia as input, with the aim of enhancing creativity and diversity in dance generation. Unlike text and music, onomatopoeia conveys rhythm and meaning through abstract word expressions without constraints on expression and without need for specialized knowledge. We adapt the AI Choreographer framework and employ the Sakamoto system, a feature extraction method for onomatopoeia focusing on phonemes and syllables. Additionally, we present a new dataset of 40 onomatopoeia-dance motion pairs collected through a user survey. Our results demonstrate that the proposed method enables more intuitive dance generation and can create dance motions using sound-symbolic words from a variety of languages, including those without onomatopoeia. This highlights the potential for diverse dance creation across different languages and cultures, accessible to a wider audience. Qualitative samples from our model can be found at: https://sites.google.com/view/onomatopoeia-dance/home/.

Via

Access Paper or Ask Questions

Multi-Genre Music Transformer -- Composing Full Length Musical Piece

Jan 06, 2023
Abhinav Kaushal Keshari

Figure 1 for Multi-Genre Music Transformer -- Composing Full Length Musical Piece

Figure 2 for Multi-Genre Music Transformer -- Composing Full Length Musical Piece

Figure 3 for Multi-Genre Music Transformer -- Composing Full Length Musical Piece

Figure 4 for Multi-Genre Music Transformer -- Composing Full Length Musical Piece

In the task of generating music, the art factor plays a big role and is a great challenge for AI. Previous work involving adversarial training to produce new music pieces and modeling the compatibility of variety in music (beats, tempo, musical stems) demonstrated great examples of learning this task. Though this was limited to generating mashups or learning features from tempo and key distributions to produce similar patterns. Compound Word Transformer was able to represent music generation task as a sequence generation challenge involving musical events defined by compound words. These musical events give a more accurate description of notes progression, chord change, harmony and the art factor. The objective of the project is to implement a Multi-Genre Transformer which learns to produce music pieces through more adaptive learning process involving more challenging task where genres or form of the composition is also considered. We built a multi-genre compound word dataset, implemented a linear transformer which was trained on this dataset. We call this Multi-Genre Transformer, which was able to generate full length new musical pieces which is diverse and comparable to original tracks. The model trains 2-5 times faster than other models discussed.

Via

Access Paper or Ask Questions

Symbolic music generation conditioned on continuous-valued emotions

Mar 30, 2022
Serkan Sulun, Matthew E. P. Davies, Paula Viana

Figure 1 for Symbolic music generation conditioned on continuous-valued emotions

Figure 2 for Symbolic music generation conditioned on continuous-valued emotions

Figure 3 for Symbolic music generation conditioned on continuous-valued emotions

Figure 4 for Symbolic music generation conditioned on continuous-valued emotions

In this paper we present a new approach for the generation of multi-instrument symbolic music driven by musical emotion. The principal novelty of our approach centres on conditioning a state-of-the-art transformer based on continuous-valued valence and arousal labels. In addition, we provide a new large-scale dataset of symbolic music paired with emotion labels in terms of valence and arousal. We evaluate our approach in a quantitative manner in two ways, first by measuring its note prediction accuracy, and second via a regression task in the valence-arousal plane. Our results demonstrate that our proposed approaches outperform conditioning using control tokens which is representative of the current state of the art.

Via

Access Paper or Ask Questions

MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Jun 21, 2023
Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin, Le Zhuo, Yiqi Liu, Jiawen Huang, Zeyue Tian, Binyue Deng, Ningzhi Wang, Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Roger Dannenbert, Wenhu Chen, Gus Xia, Wei Xue, Si Liu, Shi Wang, Ruibo Liu, Yike Guo, Jie Fu

Figure 1 for MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Figure 2 for MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Figure 3 for MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Figure 4 for MARBLE: Music Audio Representation Benchmark for Universal Evaluation

In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at https://marble-bm.shef.ac.uk to promote future music AI research.

Via

Access Paper or Ask Questions

Comparision Of Adversarial And Non-Adversarial LSTM Music Generative Models

Nov 01, 2022
Moseli Mots'oehli, Anna Sergeevna Bosman, Johan Pieter De Villiers

Figure 1 for Comparision Of Adversarial And Non-Adversarial LSTM Music Generative Models

Figure 2 for Comparision Of Adversarial And Non-Adversarial LSTM Music Generative Models

Figure 3 for Comparision Of Adversarial And Non-Adversarial LSTM Music Generative Models

Figure 4 for Comparision Of Adversarial And Non-Adversarial LSTM Music Generative Models

Algorithmic music composition is a way of composing musical pieces with minimal to no human intervention. While recurrent neural networks are traditionally applied to many sequence-to-sequence prediction tasks, including successful implementations of music composition, their standard supervised learning approach based on input-to-output mapping leads to a lack of note variety. These models can therefore be seen as potentially unsuitable for tasks such as music generation. Generative adversarial networks learn the generative distribution of data and lead to varied samples. This work implements and compares adversarial and non-adversarial training of recurrent neural network music composers on MIDI data. The resulting music samples are evaluated by human listeners, their preferences recorded. The evaluation indicates that adversarial training produces more aesthetically pleasing music.

* Submitted to a 2023 conference, 20 pages, 13 figures

Via

Access Paper or Ask Questions

Generative Disco: Text-to-Video Generation for Music Visualization

Apr 17, 2023
Vivian Liu, Tao Long, Nathan Raw, Lydia Chilton

Figure 1 for Generative Disco: Text-to-Video Generation for Music Visualization

Figure 2 for Generative Disco: Text-to-Video Generation for Music Visualization

Figure 3 for Generative Disco: Text-to-Video Generation for Music Visualization

Figure 4 for Generative Disco: Text-to-Video Generation for Music Visualization

Visuals are a core part of our experience of music, owing to the way they can amplify the emotions and messages conveyed through the music. However, creating music visualization is a complex, time-consuming, and resource-intensive process. We introduce Generative Disco, a generative AI system that helps generate music visualizations with large language models and text-to-image models. Users select intervals of music to visualize and then parameterize that visualization by defining start and end prompts. These prompts are warped between and generated according to the beat of the music for audioreactive video. We introduce design patterns for improving generated videos: "transitions", which express shifts in color, time, subject, or style, and "holds", which encourage visual emphasis and consistency. A study with professionals showed that the system was enjoyable, easy to explore, and highly expressive. We conclude on use cases of Generative Disco for professionals and how AI-generated content is changing the landscape of creative work.

Via

Access Paper or Ask Questions

Unsupervised Melody-to-Lyric Generation

May 30, 2023
Yufei Tian, Anjali Narayan-Chen, Shereen Oraby, Alessandra Cervone, Gunnar Sigurdsson, Chenyang Tao, Wenbo Zhao, Tagyoung Chung, Jing Huang, Nanyun Peng

Figure 1 for Unsupervised Melody-to-Lyric Generation

Figure 2 for Unsupervised Melody-to-Lyric Generation

Figure 3 for Unsupervised Melody-to-Lyric Generation

Figure 4 for Unsupervised Melody-to-Lyric Generation

Automatic melody-to-lyric generation is a task in which song lyrics are generated to go with a given melody. It is of significant practical interest and more challenging than unconstrained lyric generation as the music imposes additional constraints onto the lyrics. The training data is limited as most songs are copyrighted, resulting in models that underfit the complicated cross-modal relationship between melody and lyrics. In this work, we propose a method for generating high-quality lyrics without training on any aligned melody-lyric data. Specifically, we design a hierarchical lyric generation framework that first generates a song outline and second the complete lyrics. The framework enables disentanglement of training (based purely on text) from inference (melody-guided text generation) to circumvent the shortage of parallel data. We leverage the segmentation and rhythm alignment between melody and lyrics to compile the given melody into decoding constraints as guidance during inference. The two-step hierarchical design also enables content control via the lyric outline, a much-desired feature for democratizing collaborative song creation. Experimental results show that our model can generate high-quality lyrics that are more on-topic, singable, intelligible, and coherent than strong baselines, for example SongMASS, a SOTA model trained on a parallel dataset, with a 24% relative overall quality improvement based on human ratings. O

* Accepted to ACL 23. arXiv admin note: substantial text overlap with arXiv:2305.07760

Via

Access Paper or Ask Questions

YM2413-MDB: A Multi-Instrumental FM Video Game Music Dataset with Emotion Annotations

Nov 14, 2022
Eunjin Choi, Yoonjin Chung, Seolhee Lee, JongIk Jeon, Taegyun Kwon, Juhan Nam

Figure 1 for YM2413-MDB: A Multi-Instrumental FM Video Game Music Dataset with Emotion Annotations

Figure 2 for YM2413-MDB: A Multi-Instrumental FM Video Game Music Dataset with Emotion Annotations

Figure 3 for YM2413-MDB: A Multi-Instrumental FM Video Game Music Dataset with Emotion Annotations

Figure 4 for YM2413-MDB: A Multi-Instrumental FM Video Game Music Dataset with Emotion Annotations

Existing multi-instrumental datasets tend to be biased toward pop and classical music. In addition, they generally lack high-level annotations such as emotion tags. In this paper, we propose YM2413-MDB, an 80s FM video game music dataset with multi-label emotion annotations. It includes 669 audio and MIDI files of music from Sega and MSX PC games in the 80s using YM2413, a programmable sound generator based on FM. The collected game music is arranged with a subset of 15 monophonic instruments and one drum instrument. They were converted from binary commands of the YM2413 sound chip. Each song was labeled with 19 emotion tags by two annotators and validated by three verifiers to obtain refined tags. We provide the baseline models and results for emotion recognition and emotion-conditioned symbolic music generation using YM2413-MDB.

* The paper has been accepted for publication at ISMIR 2022

Via

Access Paper or Ask Questions