Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Song Identification

PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection

Dec 10, 2025

Ali Lotfi Rezaabad, Bikram Khanal, Shashwat Chaurasia, Lu Zeng, Dezhi Hong, Hossein Bashashati, Thomas Butler, Megan Ganji

Abstract:Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases -- such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and well-separated embeddings even for closely related languages. Evaluated on two challenging datasets -- Amazon Massive (multilingual digital assistant utterances) and a Song dataset (music requests with frequent code-switching) -- PolyLingua achieves 99.25% F1 and 98.15% F1, respectively, surpassing Sonnet 3.5 while using 10x fewer parameters, making it ideal for compute- and latency-constrained environments.

Via

Access Paper or Ask Questions

Training a Perceptual Model for Evaluating Auditory Similarity in Music Adversarial Attack

Sep 05, 2025

Yuxuan Liu, Rui Sang, Peihong Zhang, Zhixin Li, Shengchen Li

Figure 1 for Training a Perceptual Model for Evaluating Auditory Similarity in Music Adversarial Attack

Figure 2 for Training a Perceptual Model for Evaluating Auditory Similarity in Music Adversarial Attack

Figure 3 for Training a Perceptual Model for Evaluating Auditory Similarity in Music Adversarial Attack

Figure 4 for Training a Perceptual Model for Evaluating Auditory Similarity in Music Adversarial Attack

Abstract:Music Information Retrieval (MIR) systems are highly vulnerable to adversarial attacks that are often imperceptible to humans, primarily due to a misalignment between model feature spaces and human auditory perception. Existing defenses and perceptual metrics frequently fail to adequately capture these auditory nuances, a limitation supported by our initial listening tests showing low correlation between common metrics and human judgments. To bridge this gap, we introduce Perceptually-Aligned MERT Transformer (PAMT), a novel framework for learning robust, perceptually-aligned music representations. Our core innovation lies in the psychoacoustically-conditioned sequential contrastive transformer, a lightweight projection head built atop a frozen MERT encoder. PAMT achieves a Spearman correlation coefficient of 0.65 with subjective scores, outperforming existing perceptual metrics. Our approach also achieves an average of 9.15\% improvement in robust accuracy on challenging MIR tasks, including Cover Song Identification and Music Genre Classification, under diverse perceptual adversarial attacks. This work pioneers architecturally-integrated psychoacoustic conditioning, yielding representations significantly more aligned with human perception and robust against music adversarial attacks.

Via

Access Paper or Ask Questions

Contrastive and Transfer Learning for Effective Audio Fingerprinting through a Real-World Evaluation Protocol

Jul 08, 2025

Christos Nikou, Theodoros Giannakopoulos

Abstract:Recent advances in song identification leverage deep neural networks to learn compact audio fingerprints directly from raw waveforms. While these methods perform well under controlled conditions, their accuracy drops significantly in real-world scenarios where the audio is captured via mobile devices in noisy environments. In this paper, we introduce a novel evaluation protocol designed to better reflect such real-world conditions. We generate three recordings of the same audio, each with increasing levels of noise, captured using a mobile device's microphone. Our results reveal a substantial performance drop for two state-of-the-art CNN-based models under this protocol, compared to previously reported benchmarks. Additionally, we highlight the critical role of the augmentation pipeline during training with contrastive loss. By introduction low pass and high pass filters in the augmentation pipeline we significantly increase the performance of both systems in our proposed evaluation. Furthermore, we develop a transformer-based model with a tailored projection module and demonstrate that transferring knowledge from a semantically relevant domain yields a more robust solution. The transformer architecture outperforms CNN-based models across all noise levels, and query durations. In low noise conditions it achieves 47.99% for 1-sec queries, and 97% for 10-sec queries in finding the correct song, surpassing by 14%, and by 18.5% the second-best performing model, respectively, Under heavy noise levels, we achieve a detection rate 56.5% for 15-second query duration. All experiments are conducted on public large-scale dataset of over 100K songs, with queries matched against a database of 56 million vectors.

* IJMSTA - Vol. 7 - Issue 1 - January 2025
* International Journal of Music Science, Technology and Art, 15 pages, 7 figures

Via

Access Paper or Ask Questions

Multi-task Learning for Identification of Porcelain in Song and Yuan Dynasties

Mar 18, 2025

Ziyao Ling, Giovanni Delnevo, Paola Salomoni, Silvia Mirri

Figure 1 for Multi-task Learning for Identification of Porcelain in Song and Yuan Dynasties

Figure 2 for Multi-task Learning for Identification of Porcelain in Song and Yuan Dynasties

Figure 3 for Multi-task Learning for Identification of Porcelain in Song and Yuan Dynasties

Figure 4 for Multi-task Learning for Identification of Porcelain in Song and Yuan Dynasties

Abstract:Chinese porcelain holds immense historical and cultural value, making its accurate classification essential for archaeological research and cultural heritage preservation. Traditional classification methods rely heavily on expert analysis, which is time-consuming, subjective, and difficult to scale. This paper explores the application of DL and transfer learning techniques to automate the classification of porcelain artifacts across four key attributes: dynasty, glaze, ware, and type. We evaluate four Convolutional Neural Networks (CNNs) - ResNet50, MobileNetV2, VGG16, and InceptionV3 - comparing their performance with and without pre-trained weights. Our results demonstrate that transfer learning significantly enhances classification accuracy, particularly for complex tasks like type classification, where models trained from scratch exhibit lower performance. MobileNetV2 and ResNet50 consistently achieve high accuracy and robustness across all tasks, while VGG16 struggles with more diverse classifications. We further discuss the impact of dataset limitations and propose future directions, including domain-specific pre-training, integration of attention mechanisms, explainable AI methods, and generalization to other cultural artifacts.

Via

Access Paper or Ask Questions

A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana

Mar 19, 2025

Alba Márquez-Rodríguez, Miguel Ángel Mohedano-Munoz, Manuel J. Marín-Jiménez, Eduardo Santamaría-García, Giulia Bastianelli, Pedro Jordano, Irene Mendoza

Abstract:Passive Acoustic Monitoring with automatic recorders is essential for ecosystem conservation but generates vast unsupervised audio data, posing challenges for extracting meaningful information. Deep Learning techniques offer a promising solution. BirdNET, a widely used model for bird identification, has shown success in many study systems but is limited in some regions due to biases in its training data. A key challenge in bird species detection is that many recordings either lack target species or contain overlapping vocalizations. To overcome these problems, we developed a multi-stage pipeline for automatic bird vocalization identification in Do\~nana National Park (SW Spain), a region facing significant conservation threats. Our approach included a Bird Song Detector to isolate vocalizations and custom classifiers trained with BirdNET embeddings. We manually annotated 461 minutes of audio from three habitats across nine locations, yielding 3,749 annotations for 34 classes. Spectrograms facilitated the use of image processing techniques. Applying the Bird Song Detector before classification improved species identification, as all classification models performed better when analyzing only the segments where birds were detected. Specifically, the combination of the Bird Song Detector and fine-tuned BirdNET compared to the baseline without the Bird Song Detector. Our approach demonstrated the effectiveness of integrating a Bird Song Detector with fine-tuned classification models for bird identification at local soundscapes. These findings highlight the need to adapt general-purpose tools for specific ecological challenges, as demonstrated in Do\~nana. Automatically detecting bird species serves for tracking the health status of this threatened ecosystem, given the sensitivity of birds to environmental changes, and helps in the design of conservation measures for reducing biodiversity loss

* 20 pages, 13 images, for associated dataset see https://huggingface.co/datasets/GrunCrow/BIRDeep_AudioAnnotations , for associated code see https://github.com/GrunCrow/BIRDeep_BirdSongDetector_NeuralNetworks and https://github.com/GrunCrow/Bird-Song-Detector

Via

Access Paper or Ask Questions

Automatic Live Music Song Identification Using Multi-level Deep Sequence Similarity Learning

Jan 14, 2025

Aapo Hakala, Trevor Kincy, Tuomas Virtanen

Figure 1 for Automatic Live Music Song Identification Using Multi-level Deep Sequence Similarity Learning

Figure 2 for Automatic Live Music Song Identification Using Multi-level Deep Sequence Similarity Learning

Figure 3 for Automatic Live Music Song Identification Using Multi-level Deep Sequence Similarity Learning

Figure 4 for Automatic Live Music Song Identification Using Multi-level Deep Sequence Similarity Learning

Abstract:This paper studies the novel problem of automatic live music song identification, where the goal is, given a live recording of a song, to retrieve the corresponding studio version of the song from a music database. We propose a system based on similarity learning and a Siamese convolutional neural network-based model. The model uses cross-similarity matrices of multi-level deep sequences to measure musical similarity between different audio tracks. A manually collected custom live music dataset is used to test the performance of the system with live music. The results of the experiments show that the system is able to identify 87.4% of the given live music queries.

Via

Access Paper or Ask Questions

Leveraging User-Generated Metadata of Online Videos for Cover Song Identification

Dec 16, 2024

Simon Hachmeier, Robert Jäschke

Abstract:YouTube is a rich source of cover songs. Since the platform itself is organized in terms of videos rather than songs, the retrieval of covers is not trivial. The field of cover song identification addresses this problem and provides approaches that usually rely on audio content. However, including the user-generated video metadata available on YouTube promises improved identification results. In this paper, we propose a multi-modal approach for cover song identification on online video platforms. We combine the entity resolution models with audio-based approaches using a ranking model. Our findings implicate that leveraging user-generated metadata can stabilize cover song identification performance on YouTube.

* accepted for presentation at NLP for Music and Audio (NLP4MusA) 2024

Via

Access Paper or Ask Questions

On the Robustness of Cover Version Identification Models: A Study Using Cover Versions from YouTube

Jan 02, 2025

Simon Hachmeier, Robert Jäschke

Abstract:Recent advances in cover song identification have shown great success. However, models are usually tested on a fixed set of datasets which are relying on the online cover song database SecondHandSongs. It is unclear how well models perform on cover songs on online video platforms, which might exhibit alterations that are not expected. In this paper, we annotate a subset of songs from YouTube sampled by a multi-modal uncertainty sampling approach and evaluate state-of-the-art models. We find that existing models achieve significantly lower ranking performance on our dataset compared to a community dataset. We additionally measure the performance of different types of versions (e.g., instrumental versions) and find several types that are particularly hard to rank. Lastly, we provide a taxonomy of alterations in cover versions on the web.

* accepted for presentation at iConference 2025

Via

Access Paper or Ask Questions

Automatic Identification of Samples in Hip-Hop Music via Multi-Loss Training and an Artificial Dataset

Feb 10, 2025

Huw Cheston, Jan Van Balen, Simon Durand

Abstract:Sampling, the practice of reusing recorded music or sounds from another source in a new work, is common in popular music genres like hip-hop and rap. Numerous services have emerged that allow users to identify connections between samples and the songs that incorporate them, with the goal of enhancing music discovery. Designing a system that can perform the same task automatically is challenging, as samples are commonly altered with audio effects like pitch- and time-stretching and may only be seconds long. Progress on this task has been minimal and is further blocked by the limited availability of training data. Here, we show that a convolutional neural network trained on an artificial dataset can identify real-world samples in commercial hip-hop music. We extract vocal, harmonic, and percussive elements from several databases of non-commercial music recordings using audio source separation, and train the model to fingerprint a subset of these elements in transformed versions of the original audio. We optimize the model using a joint classification and metric learning loss and show that it achieves 13% greater precision on real-world instances of sampling than a fingerprinting system using acoustic landmarks, and that it can recognize samples that have been both pitch shifted and time stretched. We also show that, for half of the commercial music recordings we tested, our model is capable of locating the position of a sample to within five seconds.

* 17 pages, 6 figures

Via

Access Paper or Ask Questions

Innovations in Cover Song Detection: A Lyrics-Based Approach

Jun 06, 2024

Maximilian Balluff, Peter Mandl, Christian Wolff

Figure 1 for Innovations in Cover Song Detection: A Lyrics-Based Approach

Figure 2 for Innovations in Cover Song Detection: A Lyrics-Based Approach

Figure 3 for Innovations in Cover Song Detection: A Lyrics-Based Approach

Figure 4 for Innovations in Cover Song Detection: A Lyrics-Based Approach

Abstract:Cover songs are alternate versions of a song by a different artist. Long being a vital part of the music industry, cover songs significantly influence music culture and are commonly heard in public venues. The rise of online music platforms has further increased their prevalence, often as background music or video soundtracks. While current automatic identification methods serve adequately for original songs, they are less effective with cover songs, primarily because cover versions often significantly deviate from the original compositions. In this paper, we propose a novel method for cover song detection that utilizes the lyrics of a song. We introduce a new dataset for cover songs and their corresponding originals. The dataset contains 5078 cover songs and 2828 original songs. In contrast to other cover song datasets, it contains the annotated lyrics for the original song and the cover song. We evaluate our method on this dataset and compare it with multiple baseline approaches. Our results show that our method outperforms the baseline approaches.

* 6 pages, 3 figures

Via

Access Paper or Ask Questions

Topic:Song Identification

Papers and Code