Alert button
Picture for Kyle Kastner

Kyle Kastner

Alert button

NEUROSPIN, PARIETAL

Understanding Shared Speech-Text Representations

Apr 27, 2023
Gary Wang, Kyle Kastner, Ankur Bapna, Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang

Figure 1 for Understanding Shared Speech-Text Representations
Figure 2 for Understanding Shared Speech-Text Representations
Figure 3 for Understanding Shared Speech-Text Representations
Figure 4 for Understanding Shared Speech-Text Representations

Recently, a number of approaches to train speech models by incorpo-rating text into end-to-end models have been developed, with Mae-stro advancing state-of-the-art automatic speech recognition (ASR)and Speech Translation (ST) performance. In this paper, we expandour understanding of the resulting shared speech-text representationswith two types of analyses. First we examine the limits of speech-free domain adaptation, finding that a corpus-specific duration modelfor speech-text alignment is the most important component for learn-ing a shared speech-text representation. Second, we inspect the sim-ilarities between activations of unimodal (speech or text) encodersas compared to the activations of a shared encoder. We find that theshared encoder learns a more compact and overlapping speech-textrepresentation than the uni-modal encoders. We hypothesize that thispartially explains the effectiveness of the Maestro shared speech-textrepresentations.

* Accepted at ICASSP 2023, camera ready 
Viaarxiv icon

R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

Jun 30, 2022
Kyle Kastner, Aaron Courville

Figure 1 for R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS
Figure 2 for R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS
Figure 3 for R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS
Figure 4 for R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

This paper introduces R-MelNet, a two-part autoregressive architecture with a frontend based on the first tier of MelNet and a backend WaveRNN-style audio decoder for neural text-to-speech synthesis. Taking as input a mixed sequence of characters and phonemes, with an optional audio priming sequence, this model produces low-resolution mel-spectral features which are interpolated and used by a WaveRNN decoder to produce an audio waveform. Coupled with half precision training, R-MelNet uses under 11 gigabytes of GPU memory on a single commodity GPU (NVIDIA 2080Ti). We detail a number of critical implementation details for stable half precision training, including an approximate, numerically stable mixture of logistics attention. Using a stochastic, multi-sample per step inference scheme, the resulting model generates highly varied audio, while enabling text and audio based controls to modify output waveforms. Qualitative and quantitative evaluations of an R-MelNet system trained on a single speaker TTS dataset demonstrate the effectiveness of our approach.

Viaarxiv icon

MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling

Dec 17, 2021
Yusong Wu, Ethan Manilow, Yi Deng, Rigel Swavely, Kyle Kastner, Tim Cooijmans, Aaron Courville, Cheng-Zhi Anna Huang, Jesse Engel

Figure 1 for MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling
Figure 2 for MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling
Figure 3 for MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling
Figure 4 for MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling

Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation). This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence. By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience.

Viaarxiv icon

Planning in Dynamic Environments with Conditional Autoregressive Models

Nov 25, 2018
Johanna Hansen, Kyle Kastner, Aaron Courville, Gregory Dudek

Figure 1 for Planning in Dynamic Environments with Conditional Autoregressive Models
Figure 2 for Planning in Dynamic Environments with Conditional Autoregressive Models

We demonstrate the use of conditional autoregressive generative models (van den Oord et al., 2016a) over a discrete latent space (van den Oord et al., 2017b) for forward planning with MCTS. In order to test this method, we introduce a new environment featuring varying difficulty levels, along with moving goals and obstacles. The combination of high-quality frame generation and classical planning approaches nearly matches true environment performance for our task, demonstrating the usefulness of this method for model-based planning in dynamic environments.

* 6 pages, 1 figure, in Proceedings of the Prediction and Generative Modeling in Reinforcement Learning Workshop at the International Conference on Machine Learning (ICML) in 2018 
Viaarxiv icon

Representation Mixing for TTS Synthesis

Nov 24, 2018
Kyle Kastner, João Felipe Santos, Yoshua Bengio, Aaron Courville

Figure 1 for Representation Mixing for TTS Synthesis
Figure 2 for Representation Mixing for TTS Synthesis
Figure 3 for Representation Mixing for TTS Synthesis
Figure 4 for Representation Mixing for TTS Synthesis

Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases. We demonstrate a simple method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations during inference. Experiments and user studies on a public audiobook corpus show the efficacy of our approach.

* 5 pages, 3 figures 
Viaarxiv icon

Harmonic Recomposition using Conditional Autoregressive Modeling

Nov 18, 2018
Kyle Kastner, Rithesh Kumar, Tim Cooijmans, Aaron Courville

Figure 1 for Harmonic Recomposition using Conditional Autoregressive Modeling
Figure 2 for Harmonic Recomposition using Conditional Autoregressive Modeling

We demonstrate a conditional autoregressive pipeline for efficient music recomposition, based on methods presented in van den Oord et al.(2017). Recomposition (Casal & Casey, 2010) focuses on reworking existing musical pieces, adhering to structure at a high level while also re-imagining other aspects of the work. This can involve reuse of pre-existing themes or parts of the original piece, while also requiring the flexibility to generate new content at different levels of granularity. Applying the aforementioned modeling pipeline to recomposition, we show diverse and structured generation conditioned on chord sequence annotations.

* 3 pages, 2 figures. In Proceedings of The Joint Workshop on Machine Learning for Music, ICML 2018 
Viaarxiv icon

Blindfold Baselines for Embodied QA

Nov 12, 2018
Ankesh Anand, Eugene Belilovsky, Kyle Kastner, Hugo Larochelle, Aaron Courville

Figure 1 for Blindfold Baselines for Embodied QA
Figure 2 for Blindfold Baselines for Embodied QA
Figure 3 for Blindfold Baselines for Embodied QA

We explore blindfold (question-only) baselines for Embodied Question Answering. The EmbodiedQA task requires an agent to answer a question by intelligently navigating in a simulated environment, gathering necessary visual information only through first-person vision before finally answering. Consequently, a blindfold baseline which ignores the environment and visual information is a degenerate solution, yet we show through our experiments on the EQAv1 dataset that a simple question-only baseline achieves state-of-the-art results on the EmbodiedQA task in all cases except when the agent is spawned extremely close to the object.

* NIPS 2018 Visually-Grounded Interaction and Language (ViGilL) Workshop 
Viaarxiv icon

Learning Distributed Representations from Reviews for Collaborative Filtering

Jun 18, 2018
Amjad Almahairi, Kyle Kastner, Kyunghyun Cho, Aaron Courville

Figure 1 for Learning Distributed Representations from Reviews for Collaborative Filtering
Figure 2 for Learning Distributed Representations from Reviews for Collaborative Filtering
Figure 3 for Learning Distributed Representations from Reviews for Collaborative Filtering
Figure 4 for Learning Distributed Representations from Reviews for Collaborative Filtering

Recent work has shown that collaborative filter-based recommender systems can be improved by incorporating side information, such as natural language reviews, as a way of regularizing the derived product representations. Motivated by the success of this approach, we introduce two different models of reviews and study their effect on collaborative filtering performance. While the previous state-of-the-art approach is based on a latent Dirichlet allocation (LDA) model of reviews, the models we explore are neural network based: a bag-of-words product-of-experts model and a recurrent neural network. We demonstrate that the increased flexibility offered by the product-of-experts model allowed it to achieve state-of-the-art performance on the Amazon review dataset, outperforming the LDA-based approach. However, interestingly, the greater modeling power offered by the recurrent neural network appears to undermine the model's ability to act as a regularizer of the product representations.

* Published in RecSys 2015 conference 
Viaarxiv icon

Learning to Discover Sparse Graphical Models

Aug 03, 2017
Eugene Belilovsky, Kyle Kastner, Gaël Varoquaux, Matthew Blaschko

Figure 1 for Learning to Discover Sparse Graphical Models
Figure 2 for Learning to Discover Sparse Graphical Models
Figure 3 for Learning to Discover Sparse Graphical Models
Figure 4 for Learning to Discover Sparse Graphical Models

We consider structure discovery of undirected graphical models from observational data. Inferring likely structures from few examples is a complex task often requiring the formulation of priors and sophisticated inference procedures. Popular methods rely on estimating a penalized maximum likelihood of the precision matrix. However, in these approaches structure recovery is an indirect consequence of the data-fit term, the penalty can be difficult to adapt for domain-specific knowledge, and the inference is computationally demanding. By contrast, it may be easier to generate training samples of data that arise from graphs with the desired structure properties. We propose here to leverage this latter source of information as training data to learn a function, parametrized by a neural network that maps empirical covariance matrices to estimated graph structures. Learning this function brings two benefits: it implicitly models the desired structure or sparsity properties to form suitable priors, and it can be tailored to the specific problem of edge structure discovery, rather than maximizing data likelihood. Applying this framework, we find our learnable graph-discovery method trained on synthetic data generalizes well: identifying relevant edges in both synthetic and real data, completely unknown at training time. We find that on genetics, brain imaging, and simulation data we obtain performance generally superior to analytical methods.

Viaarxiv icon

ReSeg: A Recurrent Neural Network-based Model for Semantic Segmentation

May 24, 2016
Francesco Visin, Marco Ciccone, Adriana Romero, Kyle Kastner, Kyunghyun Cho, Yoshua Bengio, Matteo Matteucci, Aaron Courville

Figure 1 for ReSeg: A Recurrent Neural Network-based Model for Semantic Segmentation
Figure 2 for ReSeg: A Recurrent Neural Network-based Model for Semantic Segmentation
Figure 3 for ReSeg: A Recurrent Neural Network-based Model for Semantic Segmentation
Figure 4 for ReSeg: A Recurrent Neural Network-based Model for Semantic Segmentation

We propose a structured prediction architecture, which exploits the local generic features extracted by Convolutional Neural Networks and the capacity of Recurrent Neural Networks (RNN) to retrieve distant dependencies. The proposed architecture, called ReSeg, is based on the recently introduced ReNet model for image classification. We modify and extend it to perform the more challenging task of semantic segmentation. Each ReNet layer is composed of four RNN that sweep the image horizontally and vertically in both directions, encoding patches or activations, and providing relevant global information. Moreover, ReNet layers are stacked on top of pre-trained convolutional layers, benefiting from generic local features. Upsampling layers follow ReNet layers to recover the original image resolution in the final predictions. The proposed ReSeg architecture is efficient, flexible and suitable for a variety of semantic segmentation tasks. We evaluate ReSeg on several widely-used semantic segmentation datasets: Weizmann Horse, Oxford Flower, and CamVid; achieving state-of-the-art performance. Results show that ReSeg can act as a suitable architecture for semantic segmentation tasks, and may have further applications in other structured prediction problems. The source code and model hyperparameters are available on https://github.com/fvisin/reseg.

* In CVPR Deep Vision Workshop, 2016 
Viaarxiv icon