Picture for Yusuke Ijima

Yusuke Ijima

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Add code
Jul 01, 2024
Viaarxiv icon

Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

Add code
Feb 11, 2024
Viaarxiv icon

What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis

Add code
Jan 31, 2024
Viaarxiv icon

Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

Add code
Jan 10, 2024
Figure 1 for Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters
Figure 2 for Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters
Figure 3 for Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters
Figure 4 for Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters
Viaarxiv icon

StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models

Add code
Nov 28, 2023
Viaarxiv icon

SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

Add code
Jun 14, 2023
Figure 1 for SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?
Figure 2 for SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?
Figure 3 for SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?
Figure 4 for SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?
Viaarxiv icon

Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

Add code
Apr 24, 2023
Figure 1 for Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model
Figure 2 for Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model
Figure 3 for Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model
Figure 4 for Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model
Viaarxiv icon

SIMD-size aware weight regularization for fast neural vocoding on CPU

Add code
Nov 02, 2022
Figure 1 for SIMD-size aware weight regularization for fast neural vocoding on CPU
Figure 2 for SIMD-size aware weight regularization for fast neural vocoding on CPU
Figure 3 for SIMD-size aware weight regularization for fast neural vocoding on CPU
Figure 4 for SIMD-size aware weight regularization for fast neural vocoding on CPU
Viaarxiv icon

Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

Add code
Feb 20, 2021
Figure 1 for Model architectures to extrapolate emotional expressions in DNN-based text-to-speech
Figure 2 for Model architectures to extrapolate emotional expressions in DNN-based text-to-speech
Figure 3 for Model architectures to extrapolate emotional expressions in DNN-based text-to-speech
Figure 4 for Model architectures to extrapolate emotional expressions in DNN-based text-to-speech
Viaarxiv icon

V2S attack: building DNN-based voice conversion from automatic speaker verification

Add code
Aug 05, 2019
Figure 1 for V2S attack: building DNN-based voice conversion from automatic speaker verification
Figure 2 for V2S attack: building DNN-based voice conversion from automatic speaker verification
Figure 3 for V2S attack: building DNN-based voice conversion from automatic speaker verification
Figure 4 for V2S attack: building DNN-based voice conversion from automatic speaker verification
Viaarxiv icon