Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.
Harmony is a compact symbolic layer where mathematical pitch relations, acoustic consonance, and musical convention meet. This report treats chord-symbol sequences not as a complete representation of music, but as an interpretable, controllable time series for genre-local harmonic modeling. Starting from a frozen pop-jazz Music Transformer checkpoint, I evaluate how far small adaptation interfaces can extend the model to eleven target genres: blues, bossa nova, Bach chorales, country, electronic, folk, funk, gospel, hip-hop, R&B/soul, and rock. The main evaluation compares LoRA, IA3, BitFit, prefix tuning, and full fine-tuning over 11 genres and 3 seeds, a complete 165-cell grid. All five methods improve over the frozen base on held-out chord prediction, with macro gains from +2.89 to +3.61 points; LoRA and IA3 score highest, but Wilcoxon tests with Holm and Benjamini-Hochberg correction do not support a decisive winner. A matched-data-size control sharpens this: when genres are sub-sampled to a common corpus size, IA3 stays on top but LoRA's full-data edge disappears and it falls to last, indicating the small gaps are partly data-driven. A control-token baseline is also strong, and wrong-genre adapters often beat the frozen base, suggesting much of the effect comes from lightweight conditioning over a reusable harmonic base rather than one particular adapter family. Additional diagnostics (rank sweeps, wrong-genre rotation, a base-checkpoint ablation, chord-only genre classification, generated-output statistics, real-song evaluation, and duplicate analysis) support a bounded conclusion: chord-symbol adaptation reliably improves genre-local harmonic prediction, but chord symbols alone do not carry complete genre identity. The report therefore avoids claims about perceived genre authenticity or full musical quality, which require controlled listener or musician evaluation.
As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse forms. These include vocal synthesis, arrangement, and professional mastering. However, current detection research remains largely confined to a binary `AI-or-human' paradigm. It fails to reflect the realities of contemporary music production workflows. In real-world production, AI tools are increasingly used to refine or master human-produced tracks, and human engineers likewise post-process AI-generated material to ensure professional quality. Moreover, users often employ adversarial tactics to bypass AI detectors, such as applying human mastering to AI-generated tracks. This creates a grey area that a simple binary classification fails to capture. In this paper, we define and investigate ``AI Music Tracking'': the challenge of identifying specific AI integration across the multifaceted spectrum of music production. To this end, we introduce HAIM, a dataset with diverse labels for stages of music production. It is designed to isolate stages of AI intervention, including hybrid production and agent-level tracking. Our evaluation of state-of-the-art detectors reveals systemic flaws. By releasing HAIM, we propose a new benchmark that shifts the field beyond binary classification toward a granular, structured evaluation of AI music.
This paper presents novel algorithms for multi-target direction-of-arrival (DoA) estimation in array signal processing. Although the maximum likelihood estimator (MLE) asymptotically attains the Cramér-Rao bound, its exponential complexity motivates practical alternatives, such as greedy or subspace-based methods. In this context, greedy methods such as orthogonal matching pursuit (OMP) and orthogonal least squares (OLS) are sensitive to early selection errors, especially for angularly proximate targets, whereas subspace-based methods such as multiple signal classification (MUSIC) present angular super-resolution capabilities but degrade under strong inter-target signal correlation. To overcome these limitations, we propose two greedy iterative MUSIC (G-iMUSIC) algorithms, namely OMP-iMUSIC and OLS-iMUSIC, derived from a unified framework that links subspace and greedy estimations. Unlike prior iMUSIC approaches, the proposed methods require only one initial eigen value decomposition (EVD) and avoid computing eigendecomposition at each iteration. They also admit Fast Fourier Transform (FFT)-accelerated implementations for uniform linear arrays (ULAs), enabling low-complexity operation. Monte Carlo simulations demonstrate improved detection and precision over conventional OMP, OLS, and MUSIC, as well as reduced processing time compared to greedy baselines. Finally, we introduce diagnostic metrics that interpret performance across signal correlation and angular proximity regimes, supporting generalization beyond the specific orthogonal frequency-division multiplexing (OFDM) radar scenario considered.
In this letter, we investigate the direction-of-arrival (DOA) estimation problem for wireless sensing with movable antenna (MA) systems in the presence of unknown antenna position errors (APE). To achieve robust wireless sensing, we transform the DOA estimation problem with APE into an optimization problem via the orthogonality between the steering vector and the noise subspace. Then we propose an alternating optimization (AO)-based self-calibration estimation, which consists of two stages and iteratively estimates the APE and DOA. Specifically, in the first stage, by fixing the APE, the problem reduces to the classical DOA estimation problem, which is solved using the multiple signal classification (MUSIC) algorithm. In the second stage, we fix the DOA to estimate the APE. By applying the Lagrange multiplier technique to the subproblem, we obtain a closed-form expression for the APE estimation. Simulation results demonstrate the superior DOA estimation performance of the proposed self-calibration algorithm for MA systems compared to the existing approaches.
High-mobility uncrewed aerial vehicle (UAV) communications in low-altitude wireless networks (LAWN) demand reliable beamforming, while conventional feedback-based schemes suffer from excessive overhead and severe misalignment under rapid trajectory variations. To address this challenge, this paper proposes an SSB-based sensing-assisted predictive robust beamforming framework that replaces explicit channel state information (CSI) feedback with sensing-driven state estimation and uncertainty-aware optimization. Leveraging the periodic 'always-on' synchronization signal block (SSB), a hierarchical sensing algorithm tailored for hybrid digital-analog uniform planar arrays is developed, combining 2D range-velocity profiling and augmented beamspace multiple signal classification (MUSIC). By integrating a locally-focused analog receive beamformer, the proposed sensing design can ensure energy accumulates across different radio-frequency (RF) chains while resolving angular ambiguity. An extended Kalman filter (EKF) is further employed to track UAV states between sparse synchronization-signal (SS) bursts, and a covariance correction is introduced to characterize maneuver-induced prediction uncertainties. Based on the derived statistical distributions of range and angular parameters, the communication channel is modeled through predictive correlation matrices rather than instantaneous CSI, leading to a multi-user robust beamforming formulation that maximizes average network sum-rate under uncertainty. The resulting nonconvex problem is efficiently solved via successive convex approximation and alternating minimization. Simulation results demonstrate that the proposed framework significantly enhances spectral efficiency and link stability compared with feedback-based beamforming and non-robust beamforming design, particularly in high-mobility and large-SSB-interval scenarios.
Over the years, Music Information Retrieval (MIR) research community has released various models pretrained on large amounts of music data. Transfer learning showcases the proven effectiveness of pretrained backend models for a broad spectrum of downstream tasks, including auto-tagging and genre classification. However, MIR papers generally do not explore the efficiency of pretrained models for Music Recommender Systems (MRS). In addition, the Recommender Systems community tends to favour traditional end-to-end neural network training. Our research addresses this gap and evaluates the performance of nine pretrained backend models (MusicFM, Music2Vec, MERT, EncodecMAE, Jukebox, MusiCNN, MULE, MuQ and MuQ-MuLan) in the context of MRS. We assess them using five recommendation approaches: K-Nearest Neighbours (KNN), Shallow Neural Network, Contrastive Multi-Modal projection, a Hybrid model, and BERT4Rec both for the hot and cold-start scenarios. Our findings suggest that pretrained audio representations exhibit significant performance disparity between traditional MIR tasks and both hot and cold music recommendations, indicating that valuable aspects of musical information captured by backend models may differ depending on the task. This study establishes a foundation for further exploration of pretrained audio representations to enhance music recommendation systems.
The relationship between brain lateralization and cognitive functions is well-documented. The left hemisphere primarily handles tasks such as language and arithmetic, while the right hemisphere is involved in creative activities like drawing and music perception. Eye-tracking technology has shown the potential to reveal cognitive states by measuring ocular metrics such as pupil diameter and fixation duration. However, the ability to distinguish lateralized brain activity using these ocular metrics remains underexplored. Here, we demonstrate that pupil diameter and fixation duration can effectively classify left and right brain hemisphere activities. We obtained a considerably high classification performance, with an F1 score of 0.894. The results suggest that ocular metrics are robust indicators of lateralized brain activity and can be applied in cognitive monitoring and neurorehabilitation. Our future work expands on this by integrating these methods into real-time applications EyeBrain, potentially broadening their use across various cognitive and neurological domains.
Automatic music genre classification is a major task in music information retrieval; however, most current benchmarks and models have been developed primarily for Western music, leaving culturally specific traditions underrepresented. In this paper, we introduce the Yemeni Music Information Retrieval (YMIR) dataset, which contains 1,475 carefully selected audio clips covering five traditional Yemeni genres: Sanaani, Hadhrami, Lahji, Tihami, and Adeni. The dataset was labeled by five Yemeni music experts following a clear and structured protocol, resulting in strong inter-annotator agreement (Fleiss kappa = 0.85). We also propose the Yemeni Music Classification Model (YMCM), a convolutional neural network (CNN)-based system designed to classify music genres from time-frequency features. Using a consistent preprocessing pipeline, we perform a systematic comparison across six experimental groups and five different architectures, resulting in a total of 30 experiments. Specifically, we evaluate several feature representations, including Mel-spectrograms, Chroma, FilterBank, and MFCCs with 13, 20, and 40 coefficients, and benchmark YMCM against standard models (AlexNet, VGG16, MobileNet, and a baseline CNN) under the same experimental conditions. The experimental findings reveal that YMCM is the most effective, achieving the highest accuracy of 98.8% with Mel-spectrogram features. The results also provide practical insights into the relationship between feature representation and model capacity. The findings establish YMIR as a useful benchmark and YMCM as a strong baseline for classifying Yemeni music genres.
Modern audio systems universally employ mel-scale representations derived from 1940s Western psychoacoustic studies, potentially encoding cultural biases that create systematic performance disparities. We present a comprehensive evaluation of cross-cultural bias in audio front-ends, comparing mel-scale features with learnable alternatives (LEAF, SincNet) and psychoacoustic variants (ERB, Bark, CQT) across speech recognition (11 languages), music analysis (6 collections), and European acoustic scene classification (10 European cities). Our controlled experiments isolate front-end contributions while holding architecture and training protocols minimal and constant. Results demonstrate that mel-scale features yield 31.2% WER for tonal languages compared to 18.7% for non-tonal languages (12.5% gap), and show 15.7% F1 degradation between Western and non-Western music. Alternative representations significantly reduce these disparities: LEAF reduces the speech gap by 34% through adaptive frequency allocation, CQT achieves 52% reduction in music performance gaps, and ERB-scale filtering cuts disparities by 31% with only 1% computational overhead. We also release FairAudioBench, enabling cross-cultural evaluation, and demonstrate that adaptive frequency decomposition offers practical paths toward equitable audio processing. These findings reveal how foundational signal processing choices propagate bias, providing crucial guidance for developing inclusive audio systems.