Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:music generation

What is music generation? Music generation is the task of generating music or music-like sounds from a model or algorithm.

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

Dec 09, 2024

Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, Ying Shan

Abstract:Research on large language models has advanced significantly across text, speech, images, and videos. However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets. To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen. Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.

Via

Access Paper or Ask Questions

Detecting Machine-Generated Music with Explainability -- A Challenge and Early Benchmarks

Dec 18, 2024

Yupei Li, Qiyang Sun, Hanqian Li, Lucia Specia, Björn W. Schuller

Figure 1 for Detecting Machine-Generated Music with Explainability -- A Challenge and Early Benchmarks

Figure 2 for Detecting Machine-Generated Music with Explainability -- A Challenge and Early Benchmarks

Figure 3 for Detecting Machine-Generated Music with Explainability -- A Challenge and Early Benchmarks

Figure 4 for Detecting Machine-Generated Music with Explainability -- A Challenge and Early Benchmarks

Abstract:Machine-generated music (MGM) has become a groundbreaking innovation with wide-ranging applications, such as music therapy, personalised editing, and creative inspiration within the music industry. However, the unregulated proliferation of MGM presents considerable challenges to the entertainment, education, and arts sectors by potentially undermining the value of high-quality human compositions. Consequently, MGM detection (MGMD) is crucial for preserving the integrity of these fields. Despite its significance, MGMD domain lacks comprehensive benchmark results necessary to drive meaningful progress. To address this gap, we conduct experiments on existing large-scale datasets using a range of foundational models for audio processing, establishing benchmark results tailored to the MGMD task. Our selection includes traditional machine learning models, deep neural networks, Transformer-based architectures, and State Space Models (SSM). Recognising the inherently multimodal nature of music, which integrates both melody and lyrics, we also explore fundamental multimodal models in our experiments. Beyond providing basic binary classification outcomes, we delve deeper into model behaviour using multiple explainable Aritificial Intelligence (XAI) tools, offering insights into their decision-making processes. Our analysis reveals that ResNet18 performs the best according to in-domain and out-of-domain tests. By providing a comprehensive comparison of benchmark results and their interpretability, we propose several directions to inspire future research to develop more robust and effective detection methods for MGM.

Via

Access Paper or Ask Questions

Watermarking Training Data of Music Generation Models

Dec 12, 2024

Pascal Epple, Igor Shilov, Bozhidar Stevanoski, Yves-Alexandre de Montjoye

Figure 1 for Watermarking Training Data of Music Generation Models

Figure 2 for Watermarking Training Data of Music Generation Models

Figure 3 for Watermarking Training Data of Music Generation Models

Figure 4 for Watermarking Training Data of Music Generation Models

Abstract:Generative Artificial Intelligence (Gen-AI) models are increasingly used to produce content across domains, including text, images, and audio. While these models represent a major technical breakthrough, they gain their generative capabilities from being trained on enormous amounts of human-generated content, which often includes copyrighted material. In this work, we investigate whether audio watermarking techniques can be used to detect an unauthorized usage of content to train a music generation model. We compare outputs generated by a model trained on watermarked data to a model trained on non-watermarked data. We study factors that impact the model's generation behaviour: the watermarking technique, the proportion of watermarked samples in the training set, and the robustness of the watermarking technique against the model's tokenizer. Our results show that audio watermarking techniques, including some that are imperceptible to humans, can lead to noticeable shifts in the model's outputs. We also study the robustness of a state-of-the-art watermarking technique to removal techniques.

Via

Access Paper or Ask Questions

CoheDancers: Enhancing Interactive Group Dance Generation through Music-Driven Coherence Decomposition

Dec 26, 2024

Kaixing Yang, Xulong Tang, Haoyu Wu, Qinliang Xue, Biao Qin, Hongyan Liu, Zhaoxin Fan

Figure 1 for CoheDancers: Enhancing Interactive Group Dance Generation through Music-Driven Coherence Decomposition

Figure 2 for CoheDancers: Enhancing Interactive Group Dance Generation through Music-Driven Coherence Decomposition

Figure 3 for CoheDancers: Enhancing Interactive Group Dance Generation through Music-Driven Coherence Decomposition

Figure 4 for CoheDancers: Enhancing Interactive Group Dance Generation through Music-Driven Coherence Decomposition

Abstract:Dance generation is crucial and challenging, particularly in domains like dance performance and virtual gaming. In the current body of literature, most methodologies focus on Solo Music2Dance. While there are efforts directed towards Group Music2Dance, these often suffer from a lack of coherence, resulting in aesthetically poor dance performances. Thus, we introduce CoheDancers, a novel framework for Music-Driven Interactive Group Dance Generation. CoheDancers aims to enhance group dance generation coherence by decomposing it into three key aspects: synchronization, naturalness, and fluidity. Correspondingly, we develop a Cycle Consistency based Dance Synchronization strategy to foster music-dance correspondences, an Auto-Regressive-based Exposure Bias Correction strategy to enhance the fluidity of the generated dances, and an Adversarial Training Strategy to augment the naturalness of the group dance output. Collectively, these strategies enable CohdeDancers to produce highly coherent group dances with superior quality. Furthermore, to establish better benchmarks for Group Music2Dance, we construct the most diverse and comprehensive open-source dataset to date, I-Dancers, featuring rich dancer interactions, and create comprehensive evaluation metrics. Experimental evaluations on I-Dancers and other extant datasets substantiate that CoheDancers achieves unprecedented state-of-the-art performance. Code will be released.

Via

Access Paper or Ask Questions

VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music

Dec 23, 2024

Jiatong Shi, Hye-jin Shim, Jinchuan Tian, Siddhant Arora, Haibin Wu, Darius Petermann, Jia Qi Yip, You Zhang, Yuxun Tang, Wangyou Zhang(+7 more)

Abstract:In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 63 metrics with 711 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at https://github.com/shinjiwlab/versa.

Via

Access Paper or Ask Questions

M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases

Dec 08, 2024

Yupei Li, Hanqian Li, Lucia Specia, Björn W. Schuller

Abstract:Machine-generated music (MGM) has emerged as a powerful tool with applications in music therapy, personalised editing, and creative inspiration for the music community. However, its unregulated use threatens the entertainment, education, and arts sectors by diminishing the value of high-quality human compositions. Detecting machine-generated music (MGMD) is, therefore, critical to safeguarding these domains, yet the field lacks comprehensive datasets to support meaningful progress. To address this gap, we introduce \textbf{M6}, a large-scale benchmark dataset tailored for MGMD research. M6 is distinguished by its diversity, encompassing multiple generators, domains, languages, cultural contexts, genres, and instruments. We outline our methodology for data selection and collection, accompanied by detailed data analysis, providing all WAV form of music. Additionally, we provide baseline performance scores using foundational binary classification models, illustrating the complexity of MGMD and the significant room for improvement. By offering a robust and multifaceted resource, we aim to empower future research to develop more effective detection methods for MGM. We believe M6 will serve as a critical step toward addressing this societal challenge. The dataset and code will be freely available to support open collaboration and innovation in this field.

Via

Access Paper or Ask Questions

Interpreting Graphic Notation with MusicLDM: An AI Improvisation of Cornelius Cardew's Treatise

Dec 12, 2024

Tornike Karchkhadze, Keren Shao, Shlomo Dubnov

Figure 1 for Interpreting Graphic Notation with MusicLDM: An AI Improvisation of Cornelius Cardew's Treatise

Abstract:This work presents a novel method for composing and improvising music inspired by Cornelius Cardew's Treatise, using AI to bridge graphic notation and musical expression. By leveraging OpenAI's ChatGPT to interpret the abstract visual elements of Treatise, we convert these graphical images into descriptive textual prompts. These prompts are then input into MusicLDM, a pre-trained latent diffusion model designed for music generation. We introduce a technique called "outpainting," which overlaps sections of AI-generated music to create a seamless and cohesive composition. We demostrate a new perspective on performing and interpreting graphic scores, showing how AI can transform visual stimuli into sound and expand the creative possibilities in contemporary/experimental music composition. Musical pieces are available at https://bit.ly/TreatiseAI

* 2024 IEEE International Conference on Big Data (Big Data)

Via

Access Paper or Ask Questions

Zema Dataset: A Comprehensive Study of Yaredawi Zema with a Focus on Horologium Chants

Dec 25, 2024

Mequanent Argaw Muluneh, Yan-Tsung Peng, Worku Abebe Degife, Nigussie Abate Tadesse, Aknachew Mebreku Demeku, Li Su

Figure 1 for Zema Dataset: A Comprehensive Study of Yaredawi Zema with a Focus on Horologium Chants

Figure 2 for Zema Dataset: A Comprehensive Study of Yaredawi Zema with a Focus on Horologium Chants

Figure 3 for Zema Dataset: A Comprehensive Study of Yaredawi Zema with a Focus on Horologium Chants

Figure 4 for Zema Dataset: A Comprehensive Study of Yaredawi Zema with a Focus on Horologium Chants

Abstract:Computational music research plays a critical role in advancing music production, distribution, and understanding across various musical styles worldwide. Despite the immense cultural and religious significance, the Ethiopian Orthodox Tewahedo Church (EOTC) chants are relatively underrepresented in computational music research. This paper contributes to this field by introducing a new dataset specifically tailored for analyzing EOTC chants, also known as Yaredawi Zema. This work provides a comprehensive overview of a 10-hour dataset, 369 instances, creation, and curation process, including rigorous quality assurance measures. Our dataset has a detailed word-level temporal boundary and reading tone annotation along with the corresponding chanting mode label of audios. Moreover, we have also identified the chanting options associated with multiple chanting notations in the manuscript by annotating them accordingly. Our goal in making this dataset available to the public 1 is to encourage more research and study of EOTC chants, including lyrics transcription, lyric-to-audio alignment, and music generation tasks. Such research work will advance knowledge and efforts to preserve this distinctive liturgical music, a priceless cultural artifact for the Ethiopian people.

* 2024 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)
* 6 pages

Via

Access Paper or Ask Questions

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Feb 07, 2025

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov(+3 more)

Figure 1 for Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Figure 2 for Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Figure 3 for Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Figure 4 for Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Abstract:The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context. Traditional methods often depend on human listeners for evaluation, leading to inconsistencies and high resource demands. This paper addresses the growing need for automated systems capable of predicting audio aesthetics without human intervention. Such systems are crucial for applications like data filtering, pseudo-labeling large datasets, and evaluating generative audio models, especially as these models become more sophisticated. In this work, we introduce a novel approach to audio aesthetic evaluation by proposing new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality. Our models are evaluated against human mean opinion scores (MOS) and existing methods, demonstrating comparable or superior performance. This research not only advances the field of audio aesthetics but also provides open-source models and datasets to facilitate future work and benchmarking. We release our code and pre-trained model at: https://github.com/facebookresearch/audiobox-aesthetics

* Repository: https://github.com/facebookresearch/audiobox-aesthetics Website: https://ai.meta.com/research/publications/meta-audiobox-aesthetics-unified-automatic-quality-assessment-for-speech-music-and-sound/

Via

Access Paper or Ask Questions

A2SB: Audio-to-Audio Schrodinger Bridges

Jan 20, 2025

Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro

Abstract:Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded. The following work presents an audio restoration model tailored for high-res music at 44.1kHz. Our model, Audio-to-Audio Schrodinger Bridges (A2SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end without need of a vocoder to predict waveform outputs, able to restore hour-long audio inputs, and trained on permissively licensed music data. A2SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets. Our demo website is https: //research.nvidia.com/labs/adlr/A2SB/.

Via

Access Paper or Ask Questions

Topic:music generation

Papers and Code