Abstract:While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.



Abstract:Algorithms based on deep learning have been widely put forward for automatic music generated. However, few objective approaches have been proposed to assess whether a melody was created by automatons or Homo sapiens. Conference of Sound and Music Technology (2020) provides us a great opportunity to cope with the problem. In this paper, a masked language model based on ALBERT trained with AI-composed single-track MIDI is demonstrated for composers classification tasks. Besides, music tune transposition and MIDI sequence truncation is applied for data augments. To prevent from over-fitting, a refined loss function is proposed and the amount of parameters is reduced. This work provides a new approach to tackle the problem on obtaining features from tiny dataset which is common in music signal analysis and deserve more attention.