Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kyogu Lee

Variable Bitrate Residual Vector Quantization for Audio Coding

Oct 08, 2024

Yunkee Chae, Woosung Choi, Yuhta Takida, Junghyun Koo, Yukara Ikemiya, Zhi Zhong, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Kyogu Lee, Wei-Hsiang Liao(+1 more)

Abstract:Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly in scenarios with simple input audio, such as silence. To address this limitation, we propose variable bitrate RVQ (VRVQ) for audio codecs, which allows for more efficient coding by adapting the number of codebooks used per frame. Furthermore, we propose a gradient estimation method for the non-differentiable masking operation that transforms from the importance map to the binary importance mask, improving model training via a straight-through estimator. We demonstrate that the proposed training framework achieves superior results compared to the baseline method and shows further improvement when applied to the current state-of-the-art codec.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

Hear Your Face: Face-based voice conversion with F0 estimation

Aug 19, 2024

Jaejun Lee, Yoori Oh, Injune Hwang, Kyogu Lee

Figure 1 for Hear Your Face: Face-based voice conversion with F0 estimation

Figure 2 for Hear Your Face: Face-based voice conversion with F0 estimation

Figure 3 for Hear Your Face: Face-based voice conversion with F0 estimation

Figure 4 for Hear Your Face: Face-based voice conversion with F0 estimation

Abstract:This paper delves into the emerging field of face-based voice conversion, leveraging the unique relationship between an individual's facial features and their vocal characteristics. We present a novel face-based voice conversion framework that particularly utilizes the average fundamental frequency of the target speaker, derived solely from their facial images. Through extensive analysis, our framework demonstrates superior speech generation quality and the ability to align facial features with voice characteristics, including tracking of the target speaker's fundamental frequency.

* Interspeech 2024

Via

Access Paper or Ask Questions

GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Aug 06, 2024

Sungho Lee, Marco Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji

Figure 1 for GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Figure 2 for GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Figure 3 for GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Abstract:We present GRAFX, an open-source library designed for handling audio processing graphs in PyTorch. Along with various library functionalities, we describe technical details on the efficient parallel computation of input graphs, signals, and processor parameters in GPU. Then, we show its example use under a music mixing scenario, where parameters of every differentiable processor in a large graph are optimized via gradient descent. The code is available at https://github.com/sh-lee97/grafx.

* Accepted to DAFx 2024 demo

Via

Access Paper or Ask Questions

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Jul 29, 2024

Seungyeon Rhyu, Kichang Yang, Sungjun Cho, Jaehyeon Kim, Kyogu Lee, Moontae Lee

Figure 1 for Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Figure 2 for Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Figure 3 for Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Figure 4 for Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Abstract:Music generation introduces challenging complexities to large language models. Symbolic structures of music often include vertical harmonization as well as horizontal counterpoint, urging various adaptations and enhancements for large-scale Transformers. However, existing works share three major drawbacks: 1) their tokenization requires domain-specific annotations, such as bars and beats, that are typically missing in raw MIDI data; 2) the pure impact of enhancing token embedding methods is hardly examined without domain-specific annotations; and 3) existing works to overcome the aforementioned drawbacks, such as MuseNet, lack reproducibility. To tackle such limitations, we develop a MIDI-based music generation framework inspired by MuseNet, empirically studying two structural embeddings that do not rely on domain-specific annotations. We provide various metrics and insights that can guide suitable encoding to deploy. We also verify that multiple embedding configurations can selectively boost certain musical aspects. By providing open-source implementations via HuggingFace, our findings shed light on leveraging large language models toward practical and reproducible music generation.

* 9 pages, 6 figures, 4 tables

Via

Access Paper or Ask Questions

Wavespace: A Highly Explorable Wavetable Generator

Jul 29, 2024

Hazounne Lee, Kihong Kim, Sungho Lee, Kyogu Lee

Figure 1 for Wavespace: A Highly Explorable Wavetable Generator

Figure 2 for Wavespace: A Highly Explorable Wavetable Generator

Figure 3 for Wavespace: A Highly Explorable Wavetable Generator

Figure 4 for Wavespace: A Highly Explorable Wavetable Generator

Abstract:Wavetable synthesis generates quasi-periodic waveforms of musical tones by interpolating a list of waveforms called wavetable. As generative models that utilize latent representations offer various methods in waveform generation for musical applications, studies in wavetable generation with invertible architecture have also arisen recently. While they are promising, it is still challenging to generate wavetables with detailed controls in disentangling factors within the latent representation. In response, we present Wavespace, a novel framework for wavetable generation that empowers users with enhanced parameter controls. Our model allows users to apply pre-defined conditions to the output wavetables. We employ a variational autoencoder and completely factorize its latent space to different waveform styles. We also condition the generator with auxiliary timbral and morphological descriptors. This way, users can create unique wavetables by independently manipulating each latent subspace and descriptor parameters. Our framework is efficient enough for practical use; we prototyped an oscillator plug-in as a proof of concept for real-time integration of Wavespace within digital audio workspaces (DAWs).

Via

Access Paper or Ask Questions

Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation

Jul 07, 2024

Jin Woo Lee, Jaehyun Park, Min Jun Choi, Kyogu Lee

Figure 1 for Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation

Figure 2 for Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation

Figure 3 for Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation

Figure 4 for Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation

Abstract:While significant advancements have been made in music generation and differentiable sound synthesis within machine learning and computer audition, the simulation of instrument vibration guided by physical laws has been underexplored. To address this gap, we introduce a novel model for simulating the spatio-temporal motion of nonlinear strings, integrating modal synthesis and spectral modeling within a neural network framework. Our model leverages physical properties and fundamental frequencies as inputs, outputting string states across time and space that solve the partial differential equation characterizing the nonlinear string. Empirical evaluations demonstrate that the proposed architecture achieves superior accuracy in string motion simulation compared to existing baseline architectures. The code and demo are available online.

Via

Access Paper or Ask Questions

Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation

Jun 12, 2024

Eungbeom Kim, Hantae Kim, Kyogu Lee

Figure 1 for Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation

Figure 2 for Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation

Figure 3 for Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation

Figure 4 for Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation

Abstract:Transformer encoder with connectionist temporal classification (CTC) framework is widely used for automatic speech recognition (ASR). However, knowledge distillation (KD) for ASR displays a problem of disagreement between teacher-student models in frame-level alignment which ultimately hinders it from improving the student model's performance. In order to resolve this problem, this paper introduces a self-knowledge distillation (SKD) method that guides the frame-level alignment during the training time. In contrast to the conventional method using separate teacher and student models, this study introduces a simple and effective method sharing encoder layers and applying the sub-model as the student model. Overall, our approach is effective in improving both the resource efficiency as well as performance. We also conducted an experimental analysis of the spike timings to illustrate that the proposed method improves performance by reducing the alignment disagreement.

* Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data Manipulation

May 01, 2024

Yoori Oh, Yoseob Han, Kyogu Lee

Figure 1 for Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data Manipulation

Figure 2 for Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data Manipulation

Figure 3 for Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data Manipulation

Figure 4 for Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data Manipulation

Abstract:There has been growing interest in audio-language retrieval research, where the objective is to establish the correlation between audio and text modalities. However, most audio-text paired datasets often lack rich expression of the text data compared to the audio samples. One of the significant challenges facing audio-text datasets is the presence of similar or identical captions despite different audio samples. Therefore, under many-to-one mapping conditions, audio-text datasets lead to poor performance of retrieval tasks. In this paper, we propose a novel approach to tackle the data imbalance problem in audio-language retrieval task. To overcome the limitation, we introduce a method that employs a distance sampling-based paraphraser leveraging ChatGPT, utilizing distance function to generate a controllable distribution of manipulated text data. For a set of sentences with the same context, the distance is used to calculate a degree of manipulation for any two sentences, and ChatGPT's few-shot prompting is performed using a text cluster with a similar distance defined by the Jaccard similarity. Therefore, ChatGPT, when applied to few-shot prompting with text clusters, can adjust the diversity of the manipulated text based on the distance. The proposed approach is shown to significantly enhance performance in audio-text retrieval, outperforming conventional text augmentation techniques.

* Accepted at SIGIR 2024 short paper track

Via

Access Paper or Ask Questions

Multidimensional Interpolants

Apr 22, 2024

Dohoon Lee, Kyogu Lee

Figure 1 for Multidimensional Interpolants

Figure 2 for Multidimensional Interpolants

Figure 3 for Multidimensional Interpolants

Figure 4 for Multidimensional Interpolants

Abstract:In the domain of differential equation-based generative modeling, conventional approaches often rely on single-dimensional scalar values as interpolation coefficients during both training and inference phases. In this work, we introduce, for the first time, a multidimensional interpolant that extends these coefficients into multiple dimensions, leveraging the stochastic interpolant framework. Additionally, we propose a novel path optimization problem tailored to adaptively determine multidimensional inference trajectories, with a predetermined differential equation solver and a fixed number of function evaluations. Our solution involves simulation dynamics coupled with adversarial training to optimize the inference path. Notably, employing a multidimensional interpolant during training improves the model's inference performance, even in the absence of path optimization. When the adaptive, multidimensional path derived from our optimization process is employed, it yields further performance gains, even with fixed solver configurations. The introduction of multidimensional interpolants not only enhances the efficacy of models but also opens up a new domain for exploration in training and inference methodologies, emphasizing the potential of multidimensional paths as an untapped frontier.

* 9 pages

Via

Access Paper or Ask Questions

Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling

Apr 01, 2024

Injune Hwang, Kyogu Lee

Abstract:Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis. However, predicting representations from surrounding representations can inadvertently entangle speaker information in the speech representation. This paper aims to remove speaker information by exploiting the structured nature of speech, composed of discrete units like phonemes with clear boundaries. A neural network predicts these boundaries, enabling variable-length pooling for event-based representation extraction instead of fixed-rate methods. The boundary predictor outputs a probability for the boundary between 0 and 1, making pooling soft. The model is trained to minimize the difference with the pooled representation of the data augmented by time-stretch and pitch-shift. To confirm that the learned representation includes contents information but is independent of speaker information, the model was evaluated with libri-light's phonetic ABX task and SUPERB's speaker identification task.

Via

Access Paper or Ask Questions