Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kyogu Lee

Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations

Feb 02, 2024

Jaeyeon Kim, Injune Hwang, Kyogu Lee

Figure 1 for Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations

Figure 2 for Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations

Figure 3 for Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations

Figure 4 for Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations

Abstract:We propose a framework to learn semantics from raw audio signals using two types of representations, encoding contextual and phonetic information respectively. Specifically, we introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions. For the language model, we adopt a dual-channel architecture to incorporate both types of representation. We also present new training objectives, masked context reconstruction and masked context prediction, that push models to learn semantics effectively. Experiments on the sSIMI metric of Zero Resource Speech Benchmark 2021 and Fluent Speech Command dataset show our framework learns semantics better than models trained with only one type of representation.

* Accepted to ICASSP 2024

Via

Access Paper or Ask Questions

Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training

Jan 27, 2024

Haesun Joung, Kyogu Lee

Figure 1 for Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training

Figure 2 for Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training

Figure 3 for Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training

Figure 4 for Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training

Abstract:Music auto-tagging is crucial for enhancing music discovery and recommendation. Existing models in Music Information Retrieval (MIR) struggle with real-world noise such as environmental and speech sounds in multimedia content. This study proposes a method inspired by speech-related tasks to enhance music auto-tagging performance in noisy settings. The approach integrates Domain Adversarial Training (DAT) into the music domain, enabling robust music representations that withstand noise. Unlike previous research, this approach involves an additional pretraining phase for the domain classifier, to avoid performance degradation in the subsequent phase. Adding various synthesized noisy music data improves the model's generalization across different noise levels. The proposed architecture demonstrates enhanced performance in music auto-tagging by effectively utilizing unlabeled noisy music data. Additional experiments with supplementary unlabeled data further improves the model's performance, underscoring its robust generalization capabilities and broad applicability.

* 5 pages, 3 figures, accepted to ICASSP 2024

Via

Access Paper or Ask Questions

Inverse Nonlinearity Compensation of Hyperelastic Deformation in Dielectric Elastomer for Acoustic Actuation

Jan 08, 2024

Jin Woo Lee, Gwang Seok An, Jeong-Yun Sun, Kyogu Lee

Abstract:This paper delves into the analysis of nonlinear deformation induced by dielectric actuation in pre-stressed ideal dielectric elastomers. It formulates a nonlinear ordinary differential equation governing this deformation based on the hyperelastic model under dielectric stress. Through numerical integration and neural network approximations, the relationship between voltage and stretch is established. Neural networks are employed to approximate solutions for voltage-to-stretch and stretch-to-voltage transformations obtained via an explicit Runge-Kutta method. The effectiveness of these approximations is demonstrated by leveraging them for compensating nonlinearity through the waveshaping of the input signal. The comparative analysis highlights the superior accuracy of the approximated solutions over baseline methods, resulting in minimized harmonic distortions when utilizing dielectric elastomers as acoustic actuators. This study underscores the efficacy of the proposed approach in mitigating nonlinearities and enhancing the performance of dielectric elastomers in acoustic actuation applications.

Via

Access Paper or Ask Questions

DDD: A Perceptually Superior Low-Response-Time DNN-based Declipper

Jan 08, 2024

Jayeon Yi, Junghyun Koo, Kyogu Lee

Abstract:Clipping is a common nonlinear distortion that occurs whenever the input or output of an audio system exceeds the supported range. This phenomenon undermines not only the perception of speech quality but also downstream processes utilizing the disrupted signal. Therefore, a real-time-capable, robust, and low-response-time method for speech declipping (SD) is desired. In this work, we introduce DDD (Demucs-Discriminator-Declipper), a real-time-capable speech-declipping deep neural network (DNN) that requires less response time by design. We first observe that a previously untested real-time-capable DNN model, Demucs, exhibits a reasonable declipping performance. Then we utilize adversarial learning objectives to increase the perceptual quality of output speech without additional inference overhead. Subjective evaluations on harshly clipped speech shows that DDD outperforms the baselines by a wide margin in terms of speech quality. We perform detailed waveform and spectral analyses to gain an insight into the output behavior of DDD in comparison to the baselines. Finally, our streaming simulations also show that DDD is capable of sub-decisecond mean response times, outperforming the state-of-the-art DNN approach by a factor of six.

* To appear, ICASSP 2024. Demo samples at https://stet-stet.github.io/DDD, repo at https://github.com/stet-stet/DDD

Via

Access Paper or Ask Questions

Combinatorial music generation model with song structure graph analysis

Dec 24, 2023

Seonghyeon Go, Kyogu Lee

Abstract:In this work, we propose a symbolic music generation model with the song structure graph analysis network. We construct a graph that uses information such as note sequence and instrument as node features, while the correlation between note sequences acts as the edge feature. We trained a Graph Neural Network to obtain node representation in the graph, then we use node representation as input of Unet to generate CONLON pianoroll image latent. The outcomes of our experimental results show that the proposed model can generate a comprehensive form of music. Our approach represents a promising and innovative method for symbolic music generation and holds potential applications in various fields in Music Information Retreival, including music composition, music classification, and music inpainting systems.

* 5 pages(4 pages of paper and 1 references), 3 figures

Via

Access Paper or Ask Questions

String Sound Synthesizer on GPU-accelerated Finite Difference Scheme

Nov 30, 2023

Jin Woo Lee, Min Jun Choi, Kyogu Lee

Abstract:This paper introduces a nonlinear string sound synthesizer, based on a finite difference simulation of the dynamic behavior of strings under various excitations. The presented synthesizer features a versatile string simulation engine capable of stochastic parameterization, encompassing fundamental frequency modulation, stiffness, tension, frequency-dependent loss, and excitation control. This open-source physical model simulator not only benefits the audio signal processing community but also contributes to the burgeoning field of neural network-based audio synthesis by serving as a novel dataset construction tool. Implemented in PyTorch, this synthesizer offers flexibility, facilitating both CPU and GPU utilization, thereby enhancing its applicability as a simulator. GPU utilization expedites computation by parallelizing operations across spatial and batch dimensions, further enhancing its utility as a data generator.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Beat-Aligned Spectrogram-to-Sequence Generation of Rhythm-Game Charts

Nov 22, 2023

Jayeon Yi, Sungho Lee, Kyogu Lee

Abstract:In the heart of "rhythm games" - games where players must perform actions in sync with a piece of music - are "charts", the directives to be given to players. We newly formulate chart generation as a sequence generation task and train a Transformer using a large dataset. We also introduce tempo-informed preprocessing and training procedures, some of which are suggested to be integral for a successful training. Our model is found to outperform the baselines on a large dataset, and is also found to benefit from pretraining and finetuning.

* ISMIR 2023 LBD. Demo videos and code at stet-stet.github.io/goct

Via

Access Paper or Ask Questions

Yet Another Generative Model For Room Impulse Response Estimation

Nov 05, 2023

Sungho Lee, Hyeong-Seok Choi, Kyogu Lee

Abstract:Recent neural room impulse response (RIR) estimators typically comprise an encoder for reference audio analysis and a generator for RIR synthesis. Especially, it is the performance of the generator that directly influences the overall estimation quality. In this context, we explore an alternate generator architecture for improved performance. We first train an autoencoder with residual quantization to learn a discrete latent token space, where each token represents a small time-frequency patch of the RIR. Then, we cast the RIR estimation problem as a reference-conditioned autoregressive token generation task, employing transformer variants that operate across frequency, time, and quantization depth axes. This way, we address the standard blind estimation task and additional acoustic matching problem, which aims to find an RIR that matches the source signal to the target signal's reverberation characteristics. Experimental results show that our system is preferable to other baselines across various evaluation metrics.

* WASPAA 2023

Via

Access Paper or Ask Questions

Exploiting Time-Frequency Conformers for Music Audio Enhancement

Aug 24, 2023

Yunkee Chae, Junghyun Koo, Sungho Lee, Kyogu Lee

Figure 1 for Exploiting Time-Frequency Conformers for Music Audio Enhancement

Figure 2 for Exploiting Time-Frequency Conformers for Music Audio Enhancement

Figure 3 for Exploiting Time-Frequency Conformers for Music Audio Enhancement

Figure 4 for Exploiting Time-Frequency Conformers for Music Audio Enhancement

Abstract:With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the transformation of degraded audio recordings into pristine high-quality music, has surged to augment the auditory experience. To address this issue, we propose a music enhancement system based on the Conformer architecture that has demonstrated outstanding performance in speech enhancement tasks. Our approach explores the attention mechanisms of the Conformer and examines their performance to discover the best approach for the music enhancement task. Our experimental results show that our proposed model achieves state-of-the-art performance on single-stem music enhancement. Furthermore, our system can perform general music enhancement with multi-track mixtures, which has not been examined in previous work.

* Accepted by ACM Multimedia 2023

Via

Access Paper or Ask Questions

Music De-limiter Networks via Sample-wise Gain Inversion

Aug 02, 2023

Chang-Bin Jeon, Kyogu Lee

Abstract:The loudness war, an ongoing phenomenon in the music industry characterized by the increasing final loudness of music while reducing its dynamic range, has been a controversial topic for decades. Music mastering engineers have used limiters to heavily compress and make music louder, which can induce ear fatigue and hearing loss in listeners. In this paper, we introduce music de-limiter networks that estimate uncompressed music from heavily compressed signals. Inspired by the principle of a limiter, which performs sample-wise gain reduction of a given signal, we propose the framework of sample-wise gain inversion (SGI). We also present the musdb-XL-train dataset, consisting of 300k segments created by applying a commercial limiter plug-in for training real-world friendly de-limiter networks. Our proposed de-limiter network achieves excellent performance with a scale-invariant source-to-distortion ratio (SI-SDR) of 23.8 dB in reconstructing musdb-HQ from musdb- XL data, a limiter-applied version of musdb-HQ. The training data, codes, and model weights are available in our repository (https://github.com/jeonchangbin49/De-limiter).

* Accepted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2023

Via

Access Paper or Ask Questions