Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Emmanouil Benetos

QMUL

Acoustic identification of individual animals with hierarchical contrastive learning

Sep 13, 2024

Ines Nolasco, Ilyass Moummad, Dan Stowell, Emmanouil Benetos

Figure 1 for Acoustic identification of individual animals with hierarchical contrastive learning

Figure 2 for Acoustic identification of individual animals with hierarchical contrastive learning

Figure 3 for Acoustic identification of individual animals with hierarchical contrastive learning

Figure 4 for Acoustic identification of individual animals with hierarchical contrastive learning

Abstract:Acoustic identification of individual animals (AIID) is closely related to audio-based species classification but requires a finer level of detail to distinguish between individual animals within the same species. In this work, we frame AIID as a hierarchical multi-label classification task and propose the use of hierarchy-aware loss functions to learn robust representations of individual identities that maintain the hierarchical relationships among species and taxa. Our results demonstrate that hierarchical embeddings not only enhance identification accuracy at the individual level but also at higher taxonomic levels, effectively preserving the hierarchical structure in the learned representations. By comparing our approach with non-hierarchical models, we highlight the advantage of enforcing this structure in the embedding space. Additionally, we extend the evaluation to the classification of novel individual classes, demonstrating the potential of our method in open-set classification scenarios.

* Under review; Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

Foundation Models for Music: A Survey

Aug 27, 2024

Yinghao Ma, Anders Øland, Anton Ragni, Bleiz MacSen Del Sette, Charalampos Saitis, Chris Donahue, Chenghua Lin, Christos Plachouras, Emmanouil Benetos, Elio Quinton(+33 more)

Figure 1 for Foundation Models for Music: A Survey

Figure 2 for Foundation Models for Music: A Survey

Figure 3 for Foundation Models for Music: A Survey

Figure 4 for Foundation Models for Music: A Survey

Abstract:In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.

Via

Access Paper or Ask Questions

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Aug 02, 2024

Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, Dmitry Bogdanov

Figure 1 for MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Figure 2 for MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Figure 3 for MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Figure 4 for MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Abstract:Multimodal models that jointly process audio and language hold great promise in audio understanding and are increasingly being adopted in the music domain. By allowing users to query via text and obtain information about a given audio input, these models have the potential to enable a variety of music understanding tasks via language-based interfaces. However, their evaluation poses considerable challenges, and it remains unclear how to effectively assess their ability to correctly interpret music-related inputs with current methods. Motivated by this, we introduce MuChoMusic, a benchmark for evaluating music understanding in multimodal language models focused on audio. MuChoMusic comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets, and covering a wide variety of genres. Questions in the benchmark are crafted to assess knowledge and reasoning abilities across several dimensions that cover fundamental musical concepts and their relation to cultural and functional contexts. Through the holistic analysis afforded by the benchmark, we evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality, pointing to a need for better multimodal integration. Data and code are open-sourced.

* Accepted at ISMIR 2024. Data: https://doi.org/10.5281/zenodo.12709974 Code: https://github.com/mulab-mir/muchomusic Supplementary material: https://mulab-mir.github.io/muchomusic

Via

Access Paper or Ask Questions

Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Jul 31, 2024

Ziya Zhou, Yuhang Wu, Zhiyue Wu, Xinyue Zhang, Ruibin Yuan, Yinghao Ma, Lu Wang, Emmanouil Benetos, Wei Xue, Yike Guo

Figure 1 for Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Figure 2 for Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Figure 3 for Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Figure 4 for Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Abstract:Symbolic Music, akin to language, can be encoded in discrete symbols. Recent research has extended the application of large language models (LLMs) such as GPT-4 and Llama2 to the symbolic music domain including understanding and generation. Yet scant research explores the details of how these LLMs perform on advanced music understanding and conditioned generation, especially from the multi-step reasoning perspective, which is a critical aspect in the conditioned, editable, and interactive human-computer co-creation process. This study conducts a thorough investigation of LLMs' capability and limitations in symbolic music processing. We identify that current LLMs exhibit poor performance in song-level multi-step music reasoning, and typically fail to leverage learned music knowledge when addressing complex musical tasks. An analysis of LLMs' responses highlights distinctly their pros and cons. Our findings suggest achieving advanced musical capability is not intrinsically obtained by LLMs, and future research should focus more on bridging the gap between music knowledge and reasoning, to improve the co-creation experience for musicians.

* Accepted by ISMIR2024

Via

Access Paper or Ask Questions

YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Jul 05, 2024

Sungkyun Chang, Emmanouil Benetos, Holger Kirchhoff, Simon Dixon

Figure 1 for YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Figure 2 for YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Figure 3 for YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Figure 4 for YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Abstract:Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This task is challenging for modeling as it requires simultaneously identifying multiple instruments and transcribing their pitch and precise timing, and the lack of fully annotated data adds to the training difficulties. This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription based on the recent language token decoding approach of MT3. We strengthen its encoder by adopting a hierarchical attention transformer in the time-frequency domain and integrating a mixture of experts (MoE). To address data limitations, we introduce a new multi-channel decoding method for training with incomplete annotations and propose intra- and cross-stem augmentation for dataset mixing. Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors. Benchmarks across ten public datasets show our models' competitiveness with, or superiority to, existing transcription models. Further testing on pop music recordings highlights the limitations of current models. Fully reproducible code and datasets are available at \url{https://github.com/mimbres/YourMT3}

* Preprint submitted to IEEE MLSP 2024

Via

Access Paper or Ask Questions

Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model

Jun 25, 2024

Jiawen Huang, Emmanouil Benetos

Abstract:Multilingual automatic lyrics transcription (ALT) is a challenging task due to the limited availability of labelled data and the challenges introduced by singing, compared to multilingual automatic speech recognition. Although some multilingual singing datasets have been released recently, English continues to dominate these collections. Multilingual ALT remains underexplored due to the scale of data and annotation quality. In this paper, we aim to create a multilingual ALT system with available datasets. Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario by expanding the target vocabulary set. We then evaluate the performance of the multilingual model in comparison to its monolingual counterparts. Additionally, we explore various conditioning methods to incorporate language information into the model. We apply analysis by language and combine it with the language classification performance. Our findings reveal that the multilingual model performs consistently better than the monolingual models trained on the language subsets. Furthermore, we demonstrate that incorporating language information significantly enhances performance.

* Accepted at EUSIPCO 2024

Via

Access Paper or Ask Questions

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

May 29, 2024

Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin(+35 more)

Figure 1 for MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Figure 2 for MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Figure 3 for MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Figure 4 for MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Abstract:Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.

* https://map-neo.github.io/

Via

Access Paper or Ask Questions

Explaining models relating objects and privacy

May 02, 2024

Alessio Xompero, Myriam Bontonou, Jean-Michel Arbona, Emmanouil Benetos, Andrea Cavallaro

Figure 1 for Explaining models relating objects and privacy

Figure 2 for Explaining models relating objects and privacy

Figure 3 for Explaining models relating objects and privacy

Figure 4 for Explaining models relating objects and privacy

Abstract:Accurately predicting whether an image is private before sharing it online is difficult due to the vast variety of content and the subjective nature of privacy itself. In this paper, we evaluate privacy models that use objects extracted from an image to determine why the image is predicted as private. To explain the decision of these models, we use feature-attribution to identify and quantify which objects (and which of their features) are more relevant to privacy classification with respect to a reference input (i.e., no objects localised in an image) predicted as public. We show that the presence of the person category and its cardinality is the main factor for the privacy decision. Therefore, these models mostly fail to identify private images depicting documents with sensitive data, vehicle ownership, and internet activity, or public images with people (e.g., an outdoor concert or people walking in a public space next to a famous landmark). As baselines for future benchmarks, we also devise two strategies that are based on the person presence and cardinality and achieve comparable classification performance of the privacy models.

* 7 pages, 3 figures, 1 table, supplementary material included as Appendix. Paper accepted at the 3rd XAI4CV Workshop at CVPR 2024. Code: https://github.com/graphnex/ig-privacy

Via

Access Paper or Ask Questions

ComposerX: Multi-Agent Symbolic Music Composition with LLMs

Apr 30, 2024

Qixin Deng, Qikai Yang, Ruibin Yuan, Yipeng Huang, Yi Wang, Xubo Liu, Zeyue Tian, Jiahao Pan, Ge Zhang, Hanfeng Lin(+9 more)

Figure 1 for ComposerX: Multi-Agent Symbolic Music Composition with LLMs

Figure 2 for ComposerX: Multi-Agent Symbolic Music Composition with LLMs

Figure 3 for ComposerX: Multi-Agent Symbolic Music Composition with LLMs

Figure 4 for ComposerX: Multi-Agent Symbolic Music Composition with LLMs

Abstract:Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and Chain-of-Thoughts. To further explore and enhance LLMs' potential in music composition by leveraging their reasoning ability and the large knowledge base in music history and theory, we propose ComposerX, an agent-based symbolic music generation framework. We find that applying a multi-agent approach significantly improves the music composition quality of GPT-4. The results demonstrate that ComposerX is capable of producing coherent polyphonic music compositions with captivating melodies, while adhering to user instructions.

Via

Access Paper or Ask Questions

MuPT: A Generative Symbolic Music Pretrained Transformer

Apr 10, 2024

Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang(+19 more)

Figure 1 for MuPT: A Generative Symbolic Music Pretrained Transformer

Figure 2 for MuPT: A Generative Symbolic Music Pretrained Transformer

Figure 3 for MuPT: A Generative Symbolic Music Pretrained Transformer

Figure 4 for MuPT: A Generative Symbolic Music Pretrained Transformer

Abstract:In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.

Via

Access Paper or Ask Questions