Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"music": models, code, and papers

Machine Learning-based Classification of Birds through Birdsong

Dec 09, 2022
Yueying Chang, Richard O. Sinnott

Figure 1 for Machine Learning-based Classification of Birds through Birdsong

Figure 2 for Machine Learning-based Classification of Birds through Birdsong

Figure 3 for Machine Learning-based Classification of Birds through Birdsong

Audio sound recognition and classification is used for many tasks and applications including human voice recognition, music recognition and audio tagging. In this paper we apply Mel Frequency Cepstral Coefficients (MFCC) in combination with a range of machine learning models to identify (Australian) birds from publicly available audio files of their birdsong. We present approaches used for data processing and augmentation and compare the results of various state of the art machine learning models. We achieve an overall accuracy of 91% for the top-5 birds from the 30 selected as the case study. Applying the models to more challenging and diverse audio files comprising 152 bird species, we achieve an accuracy of 58%

Via

Access Paper or Ask Questions

A Hands-on Comparison of DNNs for Dialog SeparationUsing Transfer Learning from Music Source Separation

Jun 16, 2021
Martin Strauss, Jouni Paulus, Matteo Torcoli, Bernd Edler

Figure 1 for A Hands-on Comparison of DNNs for Dialog SeparationUsing Transfer Learning from Music Source Separation

Figure 2 for A Hands-on Comparison of DNNs for Dialog SeparationUsing Transfer Learning from Music Source Separation

This paper describes a hands-on comparison on using state-of-the-art music source separation deep neural networks (DNNs) before and after task-specific fine-tuning for separating speech content from non-speech content in broadcast audio (i.e., dialog separation). The music separation models are selected as they share the number of channels (2) and sampling rate (44.1 kHz or higher) with the considered broadcast content, and vocals separation in music is considered as a parallel for dialog separation in the target application domain. These similarities are assumed to enable transfer learning between the tasks. Three models pre-trained on music (Open-Unmix, Spleeter, and Conv-TasNet) are considered in the experiments, and fine-tuned with real broadcast data. The performance of the models is evaluated before and after fine-tuning with computational evaluation metrics (SI-SIRi, SI-SDRi, 2f-model), as well as with a listening test simulating an application where the non-speech signal is partially attenuated, e.g., for better speech intelligibility. The evaluations include two reference systems specifically developed for dialog separation. The results indicate that pre-trained music source separation models can be used for dialog separation to some degree, and that they benefit from the fine-tuning, reaching a performance close to task-specific solutions.

* accepted in INTERSPEECH 2021

Via

Access Paper or Ask Questions

Multiple Signal Classification Based Joint Communication and Sensing System

Nov 08, 2022
Xu Chen, Zhiyong Feng, Zhiqing Wei, Xin Yuan, Ping Zhang, J. Andrew Zhang, Heng Yang

Figure 1 for Multiple Signal Classification Based Joint Communication and Sensing System

Figure 2 for Multiple Signal Classification Based Joint Communication and Sensing System

Figure 3 for Multiple Signal Classification Based Joint Communication and Sensing System

Figure 4 for Multiple Signal Classification Based Joint Communication and Sensing System

Joint communication and sensing (JCS) has become a promising technology for mobile networks because of its higher spectrum and energy efficiency. Up to now, the prevalent fast Fourier transform (FFT)-based sensing method for mobile JCS networks is on-grid based, and the grid interval determines the resolution. Because the mobile network usually has limited consecutive OFDM symbols in a downlink (DL) time slot, the sensing accuracy is restricted by the limited resolution, especially for velocity estimation. In this paper, we propose a multiple signal classification (MUSIC)-based JCS system that can achieve higher sensing accuracy for the angle of arrival, range, and velocity estimation, compared with the traditional FFT-based JCS method. We further propose a JCS channel state information (CSI) enhancement method by leveraging the JCS sensing results. Finally, we derive a theoretical lower bound for sensing mean square error (MSE) by using perturbation analysis. Simulation results show that in terms of the sensing MSE performance, the proposed MUSIC-based JCS outperforms the FFT-based one by more than 20 dB. Moreover, the bit error rate (BER) of communication demodulation using the proposed JCS CSI enhancement method is significantly reduced compared with communication using the originally estimated CSI.

* 30 pages, 10 figures, major revision to IEEE Transactions on Wireless Communications

Via

Access Paper or Ask Questions

MYRiAD: A Multi-Array Room Acoustic Database

Jan 30, 2023
Thomas Dietzen, Randall Ali, Maja Taseska, Toon van Waterschoot

Figure 1 for MYRiAD: A Multi-Array Room Acoustic Database

Figure 2 for MYRiAD: A Multi-Array Room Acoustic Database

Figure 3 for MYRiAD: A Multi-Array Room Acoustic Database

Figure 4 for MYRiAD: A Multi-Array Room Acoustic Database

In the development of acoustic signal processing algorithms, their evaluation in various acoustic environments is of utmost importance. In order to advance evaluation in realistic and reproducible scenarios, several high-quality acoustic databases have been developed over the years. In this paper, we present another complementary database of acoustic recordings, referred to as the Multi-arraY Room Acoustic Database (MYRiAD). The MYRiAD database is unique in its diversity of microphone configurations suiting a wide range of enhancement and reproduction applications (such as assistive hearing, teleconferencing, or sound zoning), the acoustics of the two recording spaces, and the variety of contained signals including 1214 room impulse responses (RIRs), reproduced speech, music, and stationary noise, as well as recordings of live cocktail parties held in both rooms. The microphone configurations comprise a dummy head (DH) with in-ear omnidirectional microphones, two behind-the-ear (BTE) pieces equipped with 2 omnidirectional microphones each, 5 external omnidirectional microphones (XMs), and two concentric circular microphone arrays (CMAs) consisting of 12 omnidirectional microphones in total. The two recording spaces, namely the SONORA Audio Laboratory (SAL) and the Alamire Interactive Laboratory (AIL), have reverberation times of 2.1s and 0.5s, respectively. Audio signals were reproduced using 10 movable loudspeakers in the SAL and a built-in array of 24 loudspeakers in the AIL. MATLAB and Python scripts are included for accessing the signals as well as microphone and loudspeaker coordinates. The database is publicly available at [1].

* submitted for publication

Via

Access Paper or Ask Questions

Scaling and compressing melodies using geometric similarity measures

Sep 19, 2022
Luis Evaristo Caraballo, José Miguel Díaz-Báñez, Fabio Rodríguez, Vanesa Sánchez-Canales, Inmaculada Ventura

Figure 1 for Scaling and compressing melodies using geometric similarity measures

Figure 2 for Scaling and compressing melodies using geometric similarity measures

Figure 3 for Scaling and compressing melodies using geometric similarity measures

Melodic similarity measurement is of key importance in music information retrieval. In this paper, we use geometric matching techniques to measure the similarity between two melodies. We represent music as sets of points or sets of horizontal line segments in the Euclidean plane and propose efficient algorithms for optimization problems inspired in two operations on melodies; linear scaling and audio compression. In the scaling problem, an incoming query melody is scaled forward until the similarity measure between the query and a reference melody is minimized. The compression problem asks for a subset of notes of a given melody such that the matching cost between the selected notes and the reference melody is minimized.

Via

Access Paper or Ask Questions

ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit

Sep 16, 2020
Zijie Ye, Haozhe Wu, Jia Jia, Yaohua Bu, Wei Chen, Fanbo Meng, Yanfeng Wang

Figure 1 for ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit

Figure 2 for ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit

Figure 3 for ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit

Figure 4 for ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit

Dance and music are two highly correlated artistic forms. Synthesizing dance motions has attracted much attention recently. Most previous works conduct music-to-dance synthesis via directly music to human skeleton keypoints mapping. Meanwhile, human choreographers design dance motions from music in a two-stage manner: they firstly devise multiple choreographic dance units (CAUs), each with a series of dance motions, and then arrange the CAU sequence according to the rhythm, melody and emotion of the music. Inspired by these, we systematically study such two-stage choreography approach and construct a dataset to incorporate such choreography knowledge. Based on the constructed dataset, we design a two-stage music-to-dance synthesis framework ChoreoNet to imitate human choreography procedure. Our framework firstly devises a CAU prediction model to learn the mapping relationship between music and CAU sequences. Afterwards, we devise a spatial-temporal inpainting model to convert the CAU sequence into continuous dance motions. Experimental results demonstrate that the proposed ChoreoNet outperforms baseline methods (0.622 in terms of CAU BLEU score and 1.59 in terms of user study score).

* 10 pages, 5 figures, Accepted by ACM MM 2020

Via

Access Paper or Ask Questions

Extract fundamental frequency based on CNN combined with PYIN

Aug 17, 2022
Ruowei Xing, Shengchen Li

Figure 1 for Extract fundamental frequency based on CNN combined with PYIN

Figure 2 for Extract fundamental frequency based on CNN combined with PYIN

Figure 3 for Extract fundamental frequency based on CNN combined with PYIN

This paper refers to the extraction of multiple fundamental frequencies (multiple F0) based on PYIN, an algorithm for extracting the fundamental frequency (F0) of monophonic music, and a trained convolutional neural networks (CNN) model, where a pitch salience function of the input signal is produced to estimate the multiple F0. The implementation of these two algorithms and their corresponding advantages and disadvantages are discussed in this article. Analysing the different performance of these two methods, PYIN is applied to supplement the F0 extracted from the trained CNN model to combine the advantages of these two algorithms. For evaluation, four pieces played by two violins are used, and the performance of the models are evaluated accoring to the flatness of the F0 curve extracted. The result shows the combined model outperforms the original algorithms when extracting F0 from monophonic music and polyphonic music.

Via

Access Paper or Ask Questions

Music2Dance: DanceNet for Music-driven Dance Generation

Mar 10, 2020
Wenlin Zhuang, Congyi Wang, Siyu Xia, Jinxiang Chai, Yangang Wang

Figure 1 for Music2Dance: DanceNet for Music-driven Dance Generation

Figure 2 for Music2Dance: DanceNet for Music-driven Dance Generation

Figure 3 for Music2Dance: DanceNet for Music-driven Dance Generation

Figure 4 for Music2Dance: DanceNet for Music-driven Dance Generation

Synthesize human motions from music, i.e., music to dance, is appealing and attracts lots of research interests in recent years. It is challenging due to not only the requirement of realistic and complex human motions for dance, but more importantly, the synthesized motions should be consistent with the style, rhythm and melody of the music. In this paper, we propose a novel autoregressive generative model, DanceNet, to take the style, rhythm and melody of music as the control signals to generate 3D dance motions with high realism and diversity. To boost the performance of our proposed model, we capture several synchronized music-dance pairs by professional dancers, and build a high-quality music-dance pair dataset. Experiments have demonstrated that the proposed method can achieve the state-of-the-art results.

* Our results are shown at https://youtu.be/bTHSrfEHcG8

Via

Access Paper or Ask Questions

Incorporating Music Knowledge in Continual Dataset Augmentation for Music Generation

Jul 17, 2020
Alisa Liu, Alexander Fang, Gaëtan Hadjeres, Prem Seetharaman, Bryan Pardo

Figure 1 for Incorporating Music Knowledge in Continual Dataset Augmentation for Music Generation

Figure 2 for Incorporating Music Knowledge in Continual Dataset Augmentation for Music Generation

Deep learning has rapidly become the state-of-the-art approach for music generation. However, training a deep model typically requires a large training set, which is often not available for specific musical styles. In this paper, we present augmentative generation (Aug-Gen), a method of dataset augmentation for any music generation system trained on a resource-constrained domain. The key intuition of this method is that the training data for a generative system can be augmented by examples the system produces during the course of training, provided these examples are of sufficiently high quality and variety. We apply Aug-Gen to Transformer-based chorale generation in the style of J.S. Bach, and show that this allows for longer training and results in better generative output.

* 2 pages, 2 figures, Machine Learning for Media Discovery (ML4MD) Workshop at ICML 2020

Via

Access Paper or Ask Questions

A Hands-on Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation

Jun 22, 2021
Martin Strauss, Jouni Paulus, Matteo Torcoli, Bernd Edler

* accepted in INTERSPEECH 2021

Via

Access Paper or Ask Questions