Music generation is the task of generating music or music-like sounds from a model or algorithm.
This report presents the work done over 22 weeks of internship within the Sound Perception and Design team of the Sciences and Technologies of Music and Sound (STMS) laboratory at the Institute for Research and Coordination in Acoustics/Music (IRCAM). As part of the launch of the project Reducing Noise with Augmented Reality (ReNAR); which aims to create a tool to reduce in real-time the cognitive impact of sounds perceived as unpleasant or annoying in indoor environments; an initial study was conducted to validate the feasibility and effectiveness of a new masking approach called concealer. The main hypothesis is that the concealer approach could provide better results than a masker approach in terms of perceived pleasantness. Mixtures of two noise sources (ventilation) and five masking sounds (water sounds) were generated using both approaches at various levels. The evaluation of the perceived pleasantness of these mixtures showed that the masker approach remains more effective than the concealer approach, regardless of the noise source, water sound, or level used.
Accurately estimating nonlinear audio effects without access to paired input-output signals remains a challenging problem.This work studies unsupervised probabilistic approaches for solving this task. We introduce a method, novel for this application, based on diffusion generative models for blind system identification, enabling the estimation of unknown nonlinear effects using black- and gray-box models. This study compares this method with a previously proposed adversarial approach, analyzing the performance of both methods under different parameterizations of the effect operator and varying lengths of available effected recordings.Through experiments on guitar distortion effects, we show that the diffusion-based approach provides more stable results and is less sensitive to data availability, while the adversarial approach is superior at estimating more pronounced distortion effects. Our findings contribute to the robust unsupervised blind estimation of audio effects, demonstrating the potential of diffusion models for system identification in music technology.




The rapid advancement of large language models (LLMs) and artificial intelligence-generated content (AIGC) has accelerated AI-native applications, such as AI-based storybooks that automate engaging story production for children. However, challenges remain in improving story attractiveness, enriching storytelling expressiveness, and developing open-source evaluation benchmarks and frameworks. Therefore, we propose and opensource MM-StoryAgent, which creates immersive narrated video storybooks with refined plots, role-consistent images, and multi-channel audio. MM-StoryAgent designs a multi-agent framework that employs LLMs and diverse expert tools (generative models and APIs) across several modalities to produce expressive storytelling videos. The framework enhances story attractiveness through a multi-stage writing pipeline. In addition, it improves the immersive storytelling experience by integrating sound effects with visual, music and narrative assets. MM-StoryAgent offers a flexible, open-source platform for further development, where generative modules can be substituted. Both objective and subjective evaluation regarding textual story quality and alignment between modalities validate the effectiveness of our proposed MM-StoryAgent system. The demo and source code are available.




This paper introduces HarmonySet, a comprehensive dataset designed to advance video-music understanding. HarmonySet consists of 48,328 diverse video-music pairs, annotated with detailed information on rhythmic synchronization, emotional alignment, thematic coherence, and cultural relevance. We propose a multi-step human-machine collaborative framework for efficient annotation, combining human insights with machine-generated descriptions to identify key transitions and assess alignment across multiple dimensions. Additionally, we introduce a novel evaluation framework with tasks and metrics to assess the multi-dimensional alignment of video and music, including rhythm, emotion, theme, and cultural context. Our extensive experiments demonstrate that HarmonySet, along with the proposed evaluation framework, significantly improves the ability of multimodal models to capture and analyze the intricate relationships between video and music.




While neural vocoders have made significant progress in high-fidelity speech synthesis, their application on polyphonic music has remained underexplored. In this work, we propose DisCoder, a neural vocoder that leverages a generative adversarial encoder-decoder architecture informed by a neural audio codec to reconstruct high-fidelity 44.1 kHz audio from mel spectrograms. Our approach first transforms the mel spectrogram into a lower-dimensional representation aligned with the Descript Audio Codec (DAC) latent space before reconstructing it to an audio signal using a fine-tuned DAC decoder. DisCoder achieves state-of-the-art performance in music synthesis on several objective metrics and in a MUSHRA listening study. Our approach also shows competitive performance in speech synthesis, highlighting its potential as a universal vocoder.
Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .
This study examines pitch contours as a unifying semantic construct prevalent across various audio domains including music, speech, bioacoustics, and everyday sounds. Analyzing pitch contours offers insights into the universal role of pitch in the perceptual processing of audio signals and contributes to a deeper understanding of auditory mechanisms in both humans and animals. Conventional pitch-tracking methods, while optimized for music and speech, face challenges in handling much broader frequency ranges and more rapid pitch variations found in other audio domains. This study introduces a vision-based approach to pitch contour analysis that eliminates the need for explicit pitch-tracking. The approach uses a convolutional neural network, pre-trained for object detection in natural images and fine-tuned with a dataset of synthetically generated pitch contours, to extract key contour parameters from the time-frequency representation of short audio segments. A diverse set of eight downstream tasks from four audio domains were selected to provide a challenging evaluation scenario for cross-domain pitch contour analysis. The results show that the proposed method consistently surpasses traditional techniques based on pitch-tracking on a wide range of tasks. This suggests that the vision-based approach establishes a foundation for comparative studies of pitch contour characteristics across diverse audio domains.
Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Motion-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD. See our project website https://steve-zeyu-zhang.github.io/MotionAnything
Contrastive language-audio pre-training (CLAP) has addressed audio-language tasks such as audio-text retrieval by aligning audio and text in a common feature space. While CLAP addresses general audio-language tasks, its audio features do not generalize well in audio tasks. In contrast, self-supervised learning (SSL) models learn general-purpose audio features that perform well in diverse audio tasks. We pursue representation learning that can be widely used in audio applications and hypothesize that a method that learns both general audio features and CLAP features should achieve our goal, which we call a general-purpose audio-language representation. To implement our hypothesis, we propose M2D2, a second-generation masked modeling duo (M2D) that combines an SSL M2D and CLAP. M2D2 learns two types of features using two modalities (audio and text) in a two-stage training process. It also utilizes advanced LLM-based sentence embeddings in CLAP training for powerful semantic supervision. In the first stage, M2D2 learns generalizable audio features from M2D and CLAP, where CLAP aligns the features with the fine LLM-based semantic embeddings. In the second stage, it learns CLAP features using the audio features learned from the LLM-based embeddings. Through these pre-training stages, M2D2 should enhance generalizability and performance in its audio and CLAP features. Experiments validated that M2D2 achieves effective general-purpose audio-language representation, highlighted with SOTA fine-tuning mAP of 49.0 for AudioSet, SOTA performance in music tasks, and top-level performance in audio-language tasks.
We present a novel approach for generating an artificial audio signal that interpolates between given source and target sounds. Our approach relies on the computation of Wasserstein barycenters of the source and target spectrograms, followed by phase reconstruction and inversion. In contrast with previous works, our new method considers the spectrograms globally and does not operate on a temporal frame-to-frame basis. An other contribution is to endow the transportation cost matrix with a specific structure that prohibits remote displacements of energy along the time axis, and for which optimal transport is made possible by leveraging the unbalanced transport framework. The proposed cost matrix makes sense from the audio perspective and also allows to reduce the computation load. Results with synthetic musical notes and real environmental sounds illustrate the potential of our novel approach.