Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jonathan Shen

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Feb 16, 2018

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan(+3 more)

Figure 1 for Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Figure 2 for Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Figure 3 for Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Figure 4 for Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Abstract:This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of $4.53$ comparable to a MOS of $4.58$ for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and $F_0$ features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

* Accepted to ICASSP 2018

Via

Access Paper or Ask Questions

In Teacher We Trust: Learning Compressed Models for Pedestrian Detection

Dec 01, 2016

Jonathan Shen, Noranart Vesdapunt, Vishnu N. Boddeti, Kris M. Kitani

Figure 1 for In Teacher We Trust: Learning Compressed Models for Pedestrian Detection

Figure 2 for In Teacher We Trust: Learning Compressed Models for Pedestrian Detection

Figure 3 for In Teacher We Trust: Learning Compressed Models for Pedestrian Detection

Figure 4 for In Teacher We Trust: Learning Compressed Models for Pedestrian Detection

Abstract:Deep convolutional neural networks continue to advance the state-of-the-art in many domains as they grow bigger and more complex. It has been observed that many of the parameters of a large network are redundant, allowing for the possibility of learning a smaller network that mimics the outputs of the large network through a process called Knowledge Distillation. We show, however, that standard Knowledge Distillation is not effective for learning small models for the task of pedestrian detection. To improve this process, we introduce a higher-dimensional hint layer to increase information flow. We also estimate the variance in the outputs of the large network and propose a loss function to incorporate this uncertainty. Finally, we attempt to boost the complexity of the small network without increasing its size by using as input hand-designed features that have been demonstrated to be effective for pedestrian detection. We succeed in training a model that contains $400\times$ fewer parameters than the large network while outperforming AlexNet on the Caltech Pedestrian Dataset.

Via

Access Paper or Ask Questions