Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuefeng Mu

Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning

May 29, 2025

Xiaofeng Pan, Jing Chen, Haitong Zhang, Menglin Xing, Jiayi Wei, Xuefeng Mu, Zhongqian Xie

Abstract:Recent works of music representation learning mainly focus on learning acoustic music representations with unlabeled audios or further attempt to acquire multi-modal music representations with scarce annotated audio-text pairs. They either ignore the language semantics or rely on labeled audio datasets that are difficult and expensive to create. Moreover, merely modeling semantic space usually fails to achieve satisfactory performance on music recommendation tasks since the user preference space is ignored. In this paper, we propose a novel Hierarchical Two-stage Contrastive Learning (HTCL) method that models similarity from the semantic perspective to the user perspective hierarchically to learn a comprehensive music representation bridging the gap between semantic and user preference spaces. We devise a scalable audio encoder and leverage a pre-trained BERT model as the text encoder to learn audio-text semantics via large-scale contrastive pre-training. Further, we explore a simple yet effective way to exploit interaction data from our online music platform to adapt the semantic space to user preference space via contrastive fine-tuning, which differs from previous works that follow the idea of collaborative filtering. As a result, we obtain a powerful audio encoder that not only distills language semantics from the text encoder but also models similarity in user preference space with the integrity of semantic space preserved. Experimental results on both music semantic and recommendation tasks confirm the effectiveness of our method.

* ICMR 2025

Via

Access Paper or Ask Questions

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

Oct 17, 2024

Shangda Wu, Yashan Wang, Ruibin Yuan, Zhancheng Guo, Xu Tan, Ge Zhang, Monan Zhou, Jing Chen, Xuefeng Mu, Yuejie Gao(+5 more)

Figure 1 for CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

Figure 2 for CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

Figure 3 for CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

Figure 4 for CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

Abstract:Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To address these issues, we introduce CLaMP 2, a system compatible with 101 languages that supports both ABC notation (a text-based musical notation format) and MIDI (Musical Instrument Digital Interface) for music information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text triplets, includes a multilingual text encoder and a multimodal music encoder aligned via contrastive learning. By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale, significantly reducing textual noise and balancing language distribution. Our experiments show that CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities, thus establishing a new standard for inclusive and global music information retrieval.

* 17 pages, 10 figures, 4 tables

Via

Access Paper or Ask Questions