Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Choonghyeon Lee

When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds

May 30, 2025

Minsu Kang, Seolhee Lee, Choonghyeon Lee, Namhyun Cho

Abstract:Human to non-human voice conversion (H2NH-VC) transforms human speech into animal or designed vocalizations. Unlike prior studies focused on dog-sounds and 16 or 22.05kHz audio transformation, this work addresses a broader range of non-speech sounds, including natural sounds (lion-roars, birdsongs) and designed voice (synthetic growls). To accomodate generation of diverse non-speech sounds and 44.1kHz high-quality audio transformation, we introduce a preprocessing pipeline and an improved CVAE-based H2NH-VC model, both optimized for human and non-human voices. Experimental results showed that the proposed method outperformed baselines in quality, naturalness, and similarity MOS, achieving effective voice conversion across diverse non-human timbres. Demo samples are available at https://nc-ai.github.io/speech/publications/nonhuman-vc/

* INTERSPEECH 2025 accepted

Via

Access Paper or Ask Questions

U-Singer: Multi-Singer Singing Voice Synthesizer that Controls Emotional Intensity

Mar 02, 2022

Sungjae Kim, Kihyun Na, Choonghyeon Lee, Jehyeon An, Injung Kim

Figure 1 for U-Singer: Multi-Singer Singing Voice Synthesizer that Controls Emotional Intensity

Figure 2 for U-Singer: Multi-Singer Singing Voice Synthesizer that Controls Emotional Intensity

Figure 3 for U-Singer: Multi-Singer Singing Voice Synthesizer that Controls Emotional Intensity

Figure 4 for U-Singer: Multi-Singer Singing Voice Synthesizer that Controls Emotional Intensity

Abstract:We propose U-Singer, the first multi-singer emotional singing voice synthesizer that expresses various levels of emotional intensity. During synthesizing singing voices according to the lyrics, pitch, and duration of the music score, U-Singer reflects singer characteristics and emotional intensity by adding variances in pitch, energy, and phoneme duration according to singer ID and emotional intensity. Representing all attributes by conditional residual embeddings in a single unified embedding space, U-Singer controls mutually correlated style attributes, minimizing interference. Additionally, we apply emotion embedding interpolation and extrapolation techniques that lead the model to learn a linear embedding space and allow the model to express emotional intensity levels not included in the training data. In experiments, U-Singer synthesized high-fidelity singing voices reflecting the singer ID and emotional intensity. The visualization of the unified embedding space exhibits that U-singer estimates the correct variations in pitch and energy highly correlated with the singer ID and emotional intensity level. The audio samples are presented at https://u-singer.github.io.

* 21 pages, 11 figures

Via

Access Paper or Ask Questions