Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Douglas Eck

Learning to Groove with Inverse Sequence Transformations

May 14, 2019

Jon Gillick, Adam Roberts, Jesse Engel, Douglas Eck, David Bamman

Figure 1 for Learning to Groove with Inverse Sequence Transformations

Figure 2 for Learning to Groove with Inverse Sequence Transformations

Figure 3 for Learning to Groove with Inverse Sequence Transformations

Figure 4 for Learning to Groove with Inverse Sequence Transformations

Abstract:We explore models for translating abstract musical ideas (scores, rhythms) into expressive performances using Seq2Seq and recurrent Variational Information Bottleneck (VIB) models. Though Seq2Seq models usually require painstakingly aligned corpora, we show that it is possible to adapt an approach from the Generative Adversarial Network (GAN) literature (e.g. Pix2Pix (Isola et al., 2017) and Vid2Vid (Wang et al. 2018a)) to sequences, creating large volumes of paired data by performing simple transformations and training generative models to plausibly invert these transformations. Music, and drumming in particular, provides a strong test case for this approach because many common transformations (quantization, removing voices) have clear semantics, and models for learning to invert them have real-world applications. Focusing on the case of drum set players, we create and release a new dataset for this purpose, containing over 13 hours of recordings by professional drummers aligned with fine-grained timing and dynamics information. We also explore some of the creative potential of these models, including demonstrating improvements on state-of-the-art methods for Humanization (instantiating a performance from a musical score).

* ICML 2019, 11 pages, 6 figures

Via

Access Paper or Ask Questions

A Learned Representation for Scalable Vector Graphics

Apr 04, 2019

Raphael Gontijo Lopes, David Ha, Douglas Eck, Jonathon Shlens

Figure 1 for A Learned Representation for Scalable Vector Graphics

Figure 2 for A Learned Representation for Scalable Vector Graphics

Figure 3 for A Learned Representation for Scalable Vector Graphics

Figure 4 for A Learned Representation for Scalable Vector Graphics

Abstract:Dramatic advances in generative models have resulted in near photographic quality for artificially rendered faces, animals and other objects in the natural world. In spite of such advances, a higher level understanding of vision and imagery does not arise from exhaustively modeling an object, but instead identifying higher-level attributes that best summarize the aspects of an object. In this work we attempt to model the drawing process of fonts by building sequential generative models of vector graphics. This model has the benefit of providing a scale-invariant representation for imagery whose latent representation may be systematically manipulated and exploited to perform style propagation. We demonstrate these results on a large dataset of fonts and highlight how such a model captures the statistical dependencies and richness of this dataset. We envision that our model can find use as a tool for graphic designers to facilitate font design.

Via

Access Paper or Ask Questions

Counterpoint by Convolution

Mar 18, 2019

Cheng-Zhi Anna Huang, Tim Cooijmans, Adam Roberts, Aaron Courville, Douglas Eck

Figure 1 for Counterpoint by Convolution

Figure 2 for Counterpoint by Convolution

Figure 3 for Counterpoint by Convolution

Figure 4 for Counterpoint by Convolution

Abstract:Machine learning models of music typically break up the task of composition into a chronological process, composing a piece of music in a single pass from beginning to end. On the contrary, human composers write music in a nonlinear fashion, scribbling motifs here and there, often revisiting choices previously made. In order to better approximate this process, we train a convolutional neural network to complete partial musical scores, and explore the use of blocked Gibbs sampling as an analogue to rewriting. Neither the model nor the generative procedure are tied to a particular causal direction of composition. Our model is an instance of orderless NADE (Uria et al., 2014), which allows more direct ancestral sampling. However, we find that Gibbs sampling greatly improves sample quality, which we demonstrate to be due to some conditional distributions being poorly modeled. Moreover, we show that even the cheap approximate blocked Gibbs procedure from Yao et al. (2014) yields better samples than ancestral sampling, based on both log-likelihood and human evaluation.

* Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017

Via

Access Paper or Ask Questions

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Oct 30, 2018

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck

Figure 1 for Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Figure 2 for Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Figure 3 for Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Figure 4 for Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Abstract:Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.

* Examples available at https://goo.gl/magenta/maestro-examples

Via

Access Paper or Ask Questions

Music Transformer

Oct 10, 2018

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, Douglas Eck

Abstract:Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity is quadratic in the sequence length. We propose an algorithm that reduces the intermediate memory requirements to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long (thousands of steps) compositions with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-competition, and obtain state-of-the-art results on the latter.

* Rewrote many sections to clarify the work, and extended relative attention to the local case. Previous title is "An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation"

Via

Access Paper or Ask Questions

Learning via social awareness: Improving a deep generative sketching model with facial feedback

Aug 27, 2018

Natasha Jaques, Jennifer McCleary, Jesse Engel, David Ha, Fred Bertsch, Rosalind Picard, Douglas Eck

Figure 1 for Learning via social awareness: Improving a deep generative sketching model with facial feedback

Figure 2 for Learning via social awareness: Improving a deep generative sketching model with facial feedback

Figure 3 for Learning via social awareness: Improving a deep generative sketching model with facial feedback

Figure 4 for Learning via social awareness: Improving a deep generative sketching model with facial feedback

Abstract:In the quest towards general artificial intelligence (AI), researchers have explored developing loss functions that act as intrinsic motivators in the absence of external rewards. This paper argues that such research has overlooked an important and useful intrinsic motivator: social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for faster learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder (VAE) designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, by optimizing the model to produce sketches that it predicts will lead to more positive facial expressions. We show in multiple independent evaluations that the model trained with facial feedback produced sketches that are more highly rated, and induce significantly more positive facial expressions. Thus, we establish that implicit social feedback can improve the output of a deep learning model.

Via

Access Paper or Ask Questions

This Time with Feeling: Learning Expressive Musical Performance

Aug 10, 2018

Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, Karen Simonyan

Figure 1 for This Time with Feeling: Learning Expressive Musical Performance

Figure 2 for This Time with Feeling: Learning Expressive Musical Performance

Figure 3 for This Time with Feeling: Learning Expressive Musical Performance

Figure 4 for This Time with Feeling: Learning Expressive Musical Performance

Abstract:Music generation has generally been focused on either creating scores or interpreting them. We discuss differences between these two problems and propose that, in fact, it may be valuable to work in the space of direct $\it performance$ generation: jointly predicting the notes $\it and$ $\it also$ their expressive timing and dynamics. We consider the significance and qualities of the data set needed for this. Having identified both a problem domain and characteristics of an appropriate data set, we show an LSTM-based recurrent network model that subjectively performs quite well on this task. Critically, we provide generated examples. We also include feedback from professional composers and musicians about some of these examples.

* Includes links to urls for audio samples

Via

Access Paper or Ask Questions

A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

Jul 30, 2018

Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, Douglas Eck

Figure 1 for A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

Figure 2 for A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

Figure 3 for A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

Figure 4 for A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

Abstract:The Variational Autoencoder (VAE) has proven to be an effective model for producing semantically meaningful latent representations for natural data. However, it has thus far seen limited application to sequential data, and, as we demonstrate, existing recurrent VAE models have difficulty modeling sequences with long-term structure. To address this issue, we propose the use of a hierarchical decoder, which first outputs embeddings for subsequences of the input and then uses these embeddings to generate each subsequence independently. This structure encourages the model to utilize its latent code, thereby avoiding the "posterior collapse" problem which remains an issue for recurrent VAEs. We apply this architecture to modeling sequences of musical notes and find that it exhibits dramatically better sampling, interpolation, and reconstruction performance than a "flat" baseline model. An implementation of our "MusicVAE" is available online at http://g.co/magenta/musicvae-code.

* ICML 2018
* ICML Camera Ready Version

Via

Access Paper or Ask Questions

Onsets and Frames: Dual-Objective Piano Transcription

Jun 05, 2018

Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, Douglas Eck

Figure 1 for Onsets and Frames: Dual-Objective Piano Transcription

Figure 2 for Onsets and Frames: Dual-Objective Piano Transcription

Figure 3 for Onsets and Frames: Dual-Objective Piano Transcription

Figure 4 for Onsets and Frames: Dual-Objective Piano Transcription

Abstract:We advance the state of the art in polyphonic piano music transcription by using a deep convolutional and recurrent neural network which is trained to jointly predict onsets and frames. Our model predicts pitch onset events and then uses those predictions to condition framewise pitch predictions. During inference, we restrict the predictions from the framewise detector by not allowing a new note to start unless the onset detector also agrees that an onset for that pitch is present in the frame. We focus on improving onsets and offsets together instead of either in isolation as we believe this correlates better with human musical perception. Our approach results in over a 100% relative improvement in note F1 score (with offsets) on the MAPS dataset. Furthermore, we extend the model to predict relative velocities of normalized audio which results in more natural-sounding transcriptions.

* Examples available at https://goo.gl/magenta/onsets-frames-examples

Via

Access Paper or Ask Questions

Learning a Latent Space of Multitrack Measures

Jun 01, 2018

Ian Simon, Adam Roberts, Colin Raffel, Jesse Engel, Curtis Hawthorne, Douglas Eck

Figure 1 for Learning a Latent Space of Multitrack Measures

Figure 2 for Learning a Latent Space of Multitrack Measures

Figure 3 for Learning a Latent Space of Multitrack Measures

Figure 4 for Learning a Latent Space of Multitrack Measures

Abstract:Discovering and exploring the underlying structure of multi-instrumental music using learning-based approaches remains an open problem. We extend the recent MusicVAE model to represent multitrack polyphonic measures as vectors in a latent space. Our approach enables several useful operations such as generating plausible measures from scratch, interpolating between measures in a musically meaningful way, and manipulating specific musical attributes. We also introduce chord conditioning, which allows all of these operations to be performed while keeping harmony fixed, and allows chords to be changed while maintaining musical "style". By generating a sequence of measures over a predefined chord progression, our model can produce music with convincing long-term structure. We demonstrate that our latent space model makes it possible to intuitively control and generate musical sequences with rich instrumentation (see https://goo.gl/s2N7dV for generated audio).

Via

Access Paper or Ask Questions