Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vipul Arora

Learning from Limited Labels: Transductive Graph Label Propagation for Indian Music Analysis

Jan 07, 2026

Parampreet Singh, Akshay Raina, Sayeedul Islam Sheikh, Vipul Arora

Abstract:Supervised machine learning frameworks rely on extensive labeled datasets for robust performance on real-world tasks. However, there is a lack of large annotated datasets in audio and music domains, as annotating such recordings is resource-intensive, laborious, and often require expert domain knowledge. In this work, we explore the use of label propagation (LP), a graph-based semi-supervised learning technique, for automatically labeling the unlabeled set in an unsupervised manner. By constructing a similarity graph over audio embeddings, we propagate limited label information from a small annotated subset to a larger unlabeled corpus in a transductive, semi-supervised setting. We apply this method to two tasks in Indian Art Music (IAM): Raga identification and Instrument classification. For both these tasks, we integrate multiple public datasets along with additional recordings we acquire from Prasar Bharati Archives to perform LP. Our experiments demonstrate that LP significantly reduces labeling overhead and produces higher-quality annotations compared to conventional baseline methods, including those based on pretrained inductive models. These results highlight the potential of graph-based semi-supervised learning to democratize data annotation and accelerate progress in music information retrieval.

* Journal of Acoustical Society of India, Vol. 52, No. 3, pp. 145-154, 2025
* Published at Journal of Acoustical Society of India, 2025

Via

Access Paper or Ask Questions

BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

Dec 18, 2025

Anup Singh, Kris Demuynck, Vipul Arora

Figure 1 for BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

Figure 2 for BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

Figure 3 for BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

Figure 4 for BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

Abstract:Fast and accurate spoken content retrieval is vital for applications such as voice search. Query-by-Example Spoken Term Detection (STD) involves retrieving matching segments from an audio database given a spoken query. Token-based STD systems, which use discrete speech representations, enable efficient search but struggle with robustness to noise and reverberation, and with inefficient token utilization. We address these challenges by proposing a noise and reverberation-augmented training strategy to improve tokenizer robustness. In addition, we introduce optimal transport-based regularization to ensure balanced token usage and enhance token efficiency. To further speed up retrieval, we adopt a TF-IDF-based search mechanism. Empirical evaluations demonstrate that the proposed method outperforms STD baselines across various distortion levels while maintaining high search efficiency.

* Submitted to ICASSP 2026

Via

Access Paper or Ask Questions

Regression-based Melody Estimation with Uncertainty Quantification

May 08, 2025

Kavya Ranjan Saxena, Vipul Arora

Abstract:Existing machine learning models approach the task of melody estimation from polyphonic audio as a classification problem by discretizing the pitch values, which results in the loss of finer frequency variations present in the melody. To better capture these variations, we propose to approach this task as a regression problem. Apart from predicting only the pitch for a particular region in the audio, we also predict its uncertainty to enhance the trustworthiness of the model. To perform regression-based melody estimation, we propose three different methods that use histogram representation to model the pitch values. Such a representation requires the support range of the histogram to be continuous. The first two methods address the abrupt discontinuity between unvoiced and voiced frequency ranges by mapping them to a continuous range. The third method reformulates melody estimation as a fully Bayesian task, modeling voicing detection as a classification problem, and voiced pitch estimation as a regression problem. Additionally, we introduce a novel method to estimate the uncertainty from the histogram representation that correlates well with the deviation of the mean of the predicted distribution from the ground truth. Experimental results demonstrate that reformulating melody estimation as a regression problem significantly improves the performance over classification-based approaches. Comparing the proposed methods with a state-of-the-art regression model, it is observed that the Bayesian method performs the best at estimating both the melody and its associated uncertainty.

Via

Access Paper or Ask Questions

Recognizing Ornaments in Vocal Indian Art Music with Active Annotation

May 07, 2025

Sumit Kumar, Parampreet Singh, Vipul Arora

Abstract:Ornamentations, embellishments, or microtonal inflections are essential to melodic expression across many musical traditions, adding depth, nuance, and emotional impact to performances. Recognizing ornamentations in singing voices is key to MIR, with potential applications in music pedagogy, singer identification, genre classification, and controlled singing voice generation. However, the lack of annotated datasets and specialized modeling approaches remains a major obstacle for progress in this research area. In this work, we introduce R\=aga Ornamentation Detection (ROD), a novel dataset comprising Indian classical music recordings curated by expert musicians. The dataset is annotated using a custom Human-in-the-Loop tool for six vocal ornaments marked as event-based labels. Using this dataset, we develop an ornamentation detection model based on deep time-series analysis, preserving ornament boundaries during the chunking of long audio recordings. We conduct experiments using different train-test configurations within the ROD dataset and also evaluate our approach on a separate, manually annotated dataset of Indian classical concert recordings. Our experimental results support the superior performance of our proposed approach over the baseline CRNN.

Via

Access Paper or Ask Questions

Meta-learning-based percussion transcription and $t\bar{a}la$ identification from low-resource audio

Jan 08, 2025

Rahul Bapusaheb Kodag, Vipul Arora

$Figure 1 for Meta-learning-based percussion transcription and $t\bar{a}la$ identification from low-resource audio$

$Figure 2 for Meta-learning-based percussion transcription and $t\bar{a}la$ identification from low-resource audio$

$Figure 3 for Meta-learning-based percussion transcription and $t\bar{a}la$ identification from low-resource audio$

$Figure 4 for Meta-learning-based percussion transcription and $t\bar{a}la$ identification from low-resource audio$

Abstract:This study introduces a meta-learning-based approach for low-resource Tabla Stroke Transcription (TST) and $t\bar{a}la$ identification in Hindustani classical music. Using Model-Agnostic Meta-Learning (MAML), we address the challenge of limited annotated datasets, enabling rapid adaptation to new tasks with minimal data. The method is validated across various datasets, including tabla solo and concert recordings, demonstrating robustness in polyphonic audio scenarios. We propose two novel $t\bar{a}la$ identification techniques based on stroke sequences and rhythmic patterns. Additionally, the approach proves effective for Automatic Drum Transcription (ADT), showcasing its flexibility for Indian and Western percussion music. Experimental results show that the proposed method outperforms existing techniques in low-resource settings, significantly contributing to music transcription and studying musical traditions through computational tools.

* arXiv admin note: substantial text overlap with arXiv:2407.20935

Via

Access Paper or Ask Questions

Dementia Detection using Multi-modal Methods on Audio Data

Dec 31, 2024

Saugat Kannojia, Anirudh Praveen, Danish Vasdev, Saket Nandedkar, Divyansh Mittal, Sarthak Kalankar, Shaurya Johari, Vipul Arora

Figure 1 for Dementia Detection using Multi-modal Methods on Audio Data

Figure 2 for Dementia Detection using Multi-modal Methods on Audio Data

Figure 3 for Dementia Detection using Multi-modal Methods on Audio Data

Abstract:Dementia is a neurodegenerative disease that causes gradual cognitive impairment, which is very common in the world and undergoes a lot of research every year to prevent and cure it. It severely impacts the patient's ability to remember events and communicate clearly, where most variations of it have no known cure, but early detection can help alleviate symptoms before they become worse. One of the main symptoms of dementia is difficulty in expressing ideas through speech. This paper attempts to talk about a model developed to predict the onset of the disease using audio recordings from patients. An ASR-based model was developed that generates transcripts from the audio files using Whisper model and then applies RoBERTa regression model to generate an MMSE score for the patient. This score can be used to predict the extent to which the cognitive ability of a patient has been affected. We use the PROCESS_V1 dataset for this task, which is introduced through the PROCESS Grand Challenge 2025. The model achieved an RMSE score of 2.6911 which is around 10 percent lower than the described baseline.

* 4 pages

Via

Access Paper or Ask Questions

Novel Class Discovery for Open Set Raga Classification

Nov 27, 2024

Parampreet Singh, Adwik Gupta, Vipul Arora

Figure 1 for Novel Class Discovery for Open Set Raga Classification

Figure 2 for Novel Class Discovery for Open Set Raga Classification

Figure 3 for Novel Class Discovery for Open Set Raga Classification

Figure 4 for Novel Class Discovery for Open Set Raga Classification

Abstract:The task of Raga classification in Indian Art Music (IAM) is constrained by the limited availability of labeled datasets, resulting in many Ragas being unrepresented during the training of machine learning models. Traditional Raga classification methods rely on supervised learning, and assume that for a test audio to be classified by a Raga classification model, it must have been represented in the training data, which limits their effectiveness in real-world scenarios where novel, unseen Ragas may appear. To address this limitation, we propose a method based on Novel Class Discovery (NCD) to detect and classify previously unseen Ragas. Our approach utilizes a feature extractor trained in a supervised manner to generate embeddings, which are then employed within a contrastive learning framework for self-supervised training, enabling the identification of previously unseen Raga classes. The results demonstrate that the proposed method can accurately detect audio samples corresponding to these novel Ragas, offering a robust solution for utilizing the vast amount of unlabeled music data available online. This approach reduces the need for manual labeling while expanding the repertoire of recognized Ragas, and other music data in Music Information Retrieval (MIR).

* Under Review at ICASSP-25

Via

Access Paper or Ask Questions

BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection

Nov 21, 2024

Anup Singh, Kris Demuynck, Vipul Arora

Figure 1 for BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection

Figure 2 for BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection

Figure 3 for BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection

Figure 4 for BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection

Abstract:Spoken term detection (STD) is often hindered by reliance on frame-level features and the computationally intensive DTW-based template matching, limiting its practicality. To address these challenges, we propose a novel approach that encodes speech into discrete, speaker-agnostic semantic tokens. This facilitates fast retrieval using text-based search algorithms and effectively handles out-of-vocabulary terms. Our approach focuses on generating consistent token sequences across varying utterances of the same term. We also propose a bidirectional state space modeling within the Mamba encoder, trained in a self-supervised learning framework, to learn contextual frame-level features that are further encoded into discrete tokens. Our analysis shows that our speech tokens exhibit greater speaker invariance than those from existing tokenizers, making them more suitable for STD tasks. Empirical evaluation on LibriSpeech and TIMIT databases indicates that our method outperforms existing STD baselines while being more efficient.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

$T\bar{a}laGen:$ A System for Automatic $T\bar{a}la$ Identification and Generation

Jul 30, 2024

Rahul Bapusaheb Kodag, Himanshu Jindal, Vipul Arora

Abstract:In Hindustani classical music, the tabla plays an important role as a rhythmic backbone and accompaniment. In applications like computer-based music analysis, learning singing, and learning musical instruments, tabla stroke transcription, $t\bar{a}la$ identification, and generation are crucial. This paper proposes a comprehensive system aimed at addressing these challenges. For tabla stroke transcription, we propose a novel approach based on model-agnostic meta-learning (MAML) that facilitates the accurate identification of tabla strokes using minimal data. Leveraging these transcriptions, the system introduces two novel $t\bar{a}la$ identification methods based on the sequence analysis of tabla strokes. \par Furthermore, the paper proposes a framework for $t\bar{a}la$ generation to bridge traditional and modern learning methods. This framework utilizes finite state transducers (FST) and linear time-invariant (LTI) filters to generate $t\bar{a}las$ with real-time tempo control through user interaction, enhancing practice sessions and musical education. Experimental evaluations on tabla solo and concert datasets demonstrate the system's exceptional performance on real-world data and its ability to outperform existing methods. Additionally, the proposed $t\bar{a}la$ identification methods surpass state-of-the-art techniques. The contributions of this paper include a combined approach to tabla stroke transcription, innovative $t\bar{a}la$ identification techniques, and a robust framework for $t\bar{a}la$ generation that handles the rhythmic complexities of Hindustani music.

Via

Access Paper or Ask Questions

Explainable Deep Learning Analysis for Raga Identification in Indian Art Music

Jun 04, 2024

Parampreet Singh, Vipul Arora

Abstract:The task of Raga Identification is a very popular research problem in Music Information Retrieval. Few studies that have explored this task employed various approaches, such as signal processing, Machine Learning (ML) methods, and more recently Deep Learning (DL) based methods. However, a key question remains unanswered in all of these works: do these ML/DL methods learn and interpret Ragas in a manner similar to human experts? Besides, a significant roadblock in this research is the unavailability of ample supply of rich, labeled datasets, which drives these ML/DL based methods. In this paper, we introduce "Prasarbharti Indian Music" version-1 (PIM-v1), a novel dataset comprising of 191 hours of meticulously labeled Hindustani Classical Music (HCM) recordings, which is the largest labeled dataset for HCM recordings to the best of our knowledge. Our approach involves conducting ablation studies to find the benchmark classification model for Automatic Raga Identification (ARI) using PIM-v1 dataset. We achieve a chunk-wise f1-score of 0.89 for a subset of 12 Raga classes. Subsequently, we employ model explainability techniques to evaluate the classifier's predictions, aiming to ascertain whether they align with human understanding of Ragas or are driven by arbitrary patterns. We validate the correctness of model's predictions by comparing the explanations given by two ExAI models with human expert annotations. Following this, we analyze explanations for individual test examples to understand the role of regions highlighted by explanations in correct or incorrect predictions made by the model.

Via

Access Paper or Ask Questions