Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"music": models, code, and papers

A Novel Tree Model-based DNN to Achieve a High-Resolution DOA Estimation via Massive MIMO receive array

Nov 30, 2023
Yifan Li, Feng Shu, Jun Zou, Wei Gao, Yaoliang Song, Jiangzhou Wang

To satisfy the high-resolution requirements of direction-of-arrival (DOA) estimation, conventional deep neural network (DNN)-based methods using grid idea need to significantly increase the number of output classifications and also produce a huge high model complexity. To address this problem, a multi-level tree-based DNN model (TDNN) is proposed as an alternative, where each level takes small-scale multi-layer neural networks (MLNNs) as nodes to divide the target angular interval into multiple sub-intervals, and each output class is associated to a MLNN at the next level. Then the number of MLNNs is gradually increasing from the first level to the last level, and so increasing the depth of tree will dramatically raise the number of output classes to improve the estimation accuracy. More importantly, this network is extended to make a multi-emitter DOA estimation. Simulation results show that the proposed TDNN performs much better than conventional DNN and root-MUSIC at extremely low signal-to-noise ratio (SNR), and can achieve Cramer-Rao lower bound (CRLB). Additionally, in the multi-emitter scenario, the proposed Q-TDNN has also made a substantial performance enhancement over DNN and Root-MUSIC, and this gain grows as the number of emitters increases.

Via

Access Paper or Ask Questions

Mel-Band RoFormer for Music Source Separation

Oct 03, 2023
Ju-Chiang Wang, Wei-Tsung Lu, Minz Won

Figure 1 for Mel-Band RoFormer for Music Source Separation

Figure 2 for Mel-Band RoFormer for Music Source Separation

Recently, multi-band spectrogram-based approaches such as Band-Split RNN (BSRNN) have demonstrated promising results for music source separation. In our recent work, we introduce the BS-RoFormer model which inherits the idea of band-split scheme in BSRNN at the front-end, and then uses the hierarchical Transformer with Rotary Position Embedding (RoPE) to model the inner-band and inter-band sequences for multi-band mask estimation. This model has achieved state-of-the-art performance, but the band-split scheme is defined empirically, without analytic supports from the literature. In this paper, we propose Mel-RoFormer, which adopts the Mel-band scheme that maps the frequency bins into overlapped subbands according to the mel scale. In contract, the band-split mapping in BSRNN and BS-RoFormer is non-overlapping and designed based on heuristics. Using the MUSDB18HQ dataset for experiments, we demonstrate that Mel-RoFormer outperforms BS-RoFormer in the separation tasks of vocals, drums, and other stems.

* submitted as an ISMIR 2023 late-breaking and demo paper

Via

Access Paper or Ask Questions

Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering

Dec 20, 2023
Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, Meng Wang

This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations(i.e., the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as positivity. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand which objects are exactly relevant to the question and which are making sounds. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance.

* Accepted by AAAI-2024

Via

Access Paper or Ask Questions

JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models

Aug 09, 2023
Peike Li, Boyu Chen, Yao Yao, Yikai Wang, Allen Wang, Alex Wang

Figure 1 for JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models

Figure 2 for JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models

Figure 3 for JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models

Figure 4 for JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models

Music generation has attracted growing interest with the advancement of deep generative models. However, generating music conditioned on textual descriptions, known as text-to-music, remains challenging due to the complexity of musical structures and high sampling rate requirements. Despite the task's significance, prevailing generative models exhibit limitations in music quality, computational efficiency, and generalization. This paper introduces JEN-1, a universal high-fidelity model for text-to-music generation. JEN-1 is a diffusion model incorporating both autoregressive and non-autoregressive training. Through in-context learning, JEN-1 performs various generation tasks including text-guided music generation, music inpainting, and continuation. Evaluations demonstrate JEN-1's superior performance over state-of-the-art methods in text-music alignment and music quality while maintaining computational efficiency. Our demos are available at http://futureverse.com/research/jen/demos/jen1

Via

Access Paper or Ask Questions

Modeling Bends in Popular Music Guitar Tablatures

Aug 22, 2023
Alexandre D'Hooge, Louis Bigo, Ken Déguernel

Tablature notation is widely used in popular music to transcribe and share guitar musical content. As a complement to standard score notation, tablatures transcribe performance gesture information including finger positions and a variety of guitar-specific playing techniques such as slides, hammer-on/pull-off or bends.This paper focuses on bends, which enable to progressively shift the pitch of a note, therefore circumventing physical limitations of the discrete fretted fingerboard. In this paper, we propose a set of 25 high-level features, computed for each note of the tablature, to study how bend occurrences can be predicted from their past and future short-term context. Experiments are performed on a corpus of 932 lead guitar tablatures of popular music and show that a decision tree successfully predicts bend occurrences with an F1 score of 0.71 anda limited amount of false positive predictions, demonstrating promising applications to assist the arrangement of non-guitar music into guitar tablatures.

* 24th International Society for Music Information Retrieval Conference, International Society for Music Information Retrieval, Nov 2023, Milan, Italy

Via

Access Paper or Ask Questions

A Survey of AI Music Generation Tools and Models

Aug 24, 2023
Yueyue Zhu, Jared Baca, Banafsheh Rekabdar, Reza Rawassizadeh

Figure 1 for A Survey of AI Music Generation Tools and Models

Figure 2 for A Survey of AI Music Generation Tools and Models

Figure 3 for A Survey of AI Music Generation Tools and Models

Figure 4 for A Survey of AI Music Generation Tools and Models

In this work, we provide a comprehensive survey of AI music generation tools, including both research projects and commercialized applications. To conduct our analysis, we classified music generation approaches into three categories: parameter-based, text-based, and visual-based classes. Our survey highlights the diverse possibilities and functional features of these tools, which cater to a wide range of users, from regular listeners to professional musicians. We observed that each tool has its own set of advantages and limitations. As a result, we have compiled a comprehensive list of these factors that should be considered during the tool selection process. Moreover, our survey offers critical insights into the underlying mechanisms and challenges of AI music generation.

Via

Access Paper or Ask Questions

Automatic Music Playlist Generation via Simulation-based Reinforcement Learning

Oct 13, 2023
Federico Tomasi, Joseph Cauteruccio, Surya Kanoria, Kamil Ciosek, Matteo Rinaldi, Zhenwen Dai

Figure 1 for Automatic Music Playlist Generation via Simulation-based Reinforcement Learning

Figure 2 for Automatic Music Playlist Generation via Simulation-based Reinforcement Learning

Figure 3 for Automatic Music Playlist Generation via Simulation-based Reinforcement Learning

Figure 4 for Automatic Music Playlist Generation via Simulation-based Reinforcement Learning

Personalization of playlists is a common feature in music streaming services, but conventional techniques, such as collaborative filtering, rely on explicit assumptions regarding content quality to learn how to make recommendations. Such assumptions often result in misalignment between offline model objectives and online user satisfaction metrics. In this paper, we present a reinforcement learning framework that solves for such limitations by directly optimizing for user satisfaction metrics via the use of a simulated playlist-generation environment. Using this simulator we develop and train a modified Deep Q-Network, the action head DQN (AH-DQN), in a manner that addresses the challenges imposed by the large state and action space of our RL formulation. The resulting policy is capable of making recommendations from large and dynamic sets of candidate items with the expectation of maximizing consumption metrics. We analyze and evaluate agents offline via simulations that use environment models trained on both public and proprietary streaming datasets. We show how these agents lead to better user-satisfaction metrics compared to baseline methods during online A/B tests. Finally, we demonstrate that performance assessments produced from our simulator are strongly correlated with observed online metric results.

* 10 pages. KDD 23

Via

Access Paper or Ask Questions

Exploring how a Generative AI interprets music

Jul 31, 2023
Gabriela Barenboim, Luigi Del Debbio, Johannes Hirn, Veronica Sanz

We use Google's MusicVAE, a Variational Auto-Encoder with a 512-dimensional latent space to represent a few bars of music, and organize the latent dimensions according to their relevance in describing music. We find that, on average, most latent neurons remain silent when fed real music tracks: we call these "noise" neurons. The remaining few dozens of latent neurons that do fire are called "music neurons". We ask which neurons carry the musical information and what kind of musical information they encode, namely something that can be identified as pitch, rhythm or melody. We find that most of the information about pitch and rhythm is encoded in the first few music neurons: the neural network has thus constructed a couple of variables that non-linearly encode many human-defined variables used to describe pitch and rhythm. The concept of melody only seems to show up in independent neurons for longer sequences of music.

* 16 pages, 12 figures

Via

Access Paper or Ask Questions

miditok: A Python package for MIDI file tokenization

Oct 26, 2023
Nathan Fradet, Jean-Pierre Briot, Fabien Chhel, Amal El Fallah Seghrouchni, Nicolas Gutowski

Recent progress in natural language processing has been adapted to the symbolic music modality. Language models, such as Transformers, have been used with symbolic music for a variety of tasks among which music generation, modeling or transcription, with state-of-the-art performances. These models are beginning to be used in production products. To encode and decode music for the backbone model, they need to rely on tokenizers, whose role is to serialize music into sequences of distinct elements called tokens. MidiTok is an open-source library allowing to tokenize symbolic music with great flexibility and extended features. It features the most popular music tokenizations, under a unified API. It is made to be easily used and extensible for everyone.

* Updated and comprehensive report. Original ISMIR 2021 document at https://archives.ismir.net/ismir2021/latebreaking/000005.pdf

Via

Access Paper or Ask Questions

Choir Transformer: Generating Polyphonic Music with Relative Attention on Transformer

Aug 01, 2023
Jiuyang Zhou, Hong Zhu, Xingping Wang

Figure 1 for Choir Transformer: Generating Polyphonic Music with Relative Attention on Transformer

Figure 2 for Choir Transformer: Generating Polyphonic Music with Relative Attention on Transformer

Polyphonic music generation is still a challenge direction due to its correct between generating melody and harmony. Most of the previous studies used RNN-based models. However, the RNN-based models are hard to establish the relationship between long-distance notes. In this paper, we propose a polyphonic music generation neural network named Choir Transformer[ https://github.com/Zjy0401/choir-transformer], with relative positional attention to better model the structure of music. We also proposed a music representation suitable for polyphonic music generation. The performance of Choir Transformer surpasses the previous state-of-the-art accuracy of 4.06%. We also measures the harmony metrics of polyphonic music. Experiments show that the harmony metrics are close to the music of Bach. In practical application, the generated melody and rhythm can be adjusted according to the specified input, with different styles of music like folk music or pop music and so on.

Via

Access Paper or Ask Questions