Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kevin Murphy

Google Brain

Towards Differentiable Resampling

Apr 24, 2020

Michael Zhu, Kevin Murphy, Rico Jonschkowski

Figure 1 for Towards Differentiable Resampling

Figure 2 for Towards Differentiable Resampling

Figure 3 for Towards Differentiable Resampling

Figure 4 for Towards Differentiable Resampling

Abstract:Resampling is a key component of sample-based recursive state estimation in particle filters. Recent work explores differentiable particle filters for end-to-end learning. However, resampling remains a challenge in these works, as it is inherently non-differentiable. We address this challenge by replacing traditional resampling with a learned neural network resampler. We present a novel network architecture, the particle transformer, and train it for particle resampling using a likelihood-based loss function over sets of particles. Incorporated into a differentiable particle filter, our model can be end-to-end optimized jointly with the other particle filter components via gradient descent. Our results show that our learned resampler outperforms traditional resampling techniques on synthetic data and in a simulated robot localization task.

Via

Access Paper or Ask Questions

Regularized Autoencoders via Relaxed Injective Probability Flow

Feb 20, 2020

Abhishek Kumar, Ben Poole, Kevin Murphy

Figure 1 for Regularized Autoencoders via Relaxed Injective Probability Flow

Figure 2 for Regularized Autoencoders via Relaxed Injective Probability Flow

Figure 3 for Regularized Autoencoders via Relaxed Injective Probability Flow

Figure 4 for Regularized Autoencoders via Relaxed Injective Probability Flow

Abstract:Invertible flow-based generative models are an effective method for learning to generate samples, while allowing for tractable likelihood computation and inference. However, the invertibility requirement restricts models to have the same latent dimensionality as the inputs. This imposes significant architectural, memory, and computational costs, making them more challenging to scale than other classes of generative models such as Variational Autoencoders (VAEs). We propose a generative model based on probability flows that does away with the bijectivity requirement on the model and only assumes injectivity. This also provides another perspective on regularized autoencoders (RAEs), with our final objectives resembling RAEs with specific regularizers that are derived by lower bounding the probability flow objective. We empirically demonstrate the promise of the proposed model, improving over VAEs and AEs in terms of sample quality.

* AISTATS 2020

Via

Access Paper or Ask Questions

The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction

Dec 17, 2019

Junwei Liang, Lu Jiang, Kevin Murphy, Ting Yu, Alexander Hauptmann

Figure 1 for The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction

Figure 2 for The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction

Figure 3 for The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction

Figure 4 for The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction

Abstract:This paper studies the problem of predicting the distribution over multiple possible future paths of people as they move through various visual scenes. We make two main contributions. The first contribution is a new dataset, created in a realistic 3D simulator, which is based on real world trajectory data, and then extrapolated by human annotators to achieve different latent goals. This provides the first benchmark for quantitative evaluation of the models to predict multi-future trajectories. The second contribution is a new model to generate multiple plausible future trajectories, which contains novel designs of using multi-scale location encodings and convolutional RNNs over graphs. We refer to our model as Multiverse. We show that our model achieves the best results on our dataset, as well as on the real-world VIRAT/ActEV dataset (which just contains one possible future). We will release our data, models and code.

* Code, models and dataset are available at: https://next.cs.cmu.edu/multiverse/index.html

Via

Access Paper or Ask Questions

Unsupervised Learning of Object Structure and Dynamics from Videos

Jun 19, 2019

Matthias Minderer, Chen Sun, Ruben Villegas, Forrester Cole, Kevin Murphy, Honglak Lee

Figure 1 for Unsupervised Learning of Object Structure and Dynamics from Videos

Figure 2 for Unsupervised Learning of Object Structure and Dynamics from Videos

Figure 3 for Unsupervised Learning of Object Structure and Dynamics from Videos

Figure 4 for Unsupervised Learning of Object Structure and Dynamics from Videos

Abstract:Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning. To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. Future frames are reconstructed from the keypoints and a reference frame. By modeling dynamics in the keypoint coordinate space, we achieve stable learning and avoid compounding of errors in pixel space. Our method improves upon unstructured representations both for pixel-level video prediction and for downstream tasks requiring object-level understanding of motion dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset, the Human3.6M dataset, and datasets based on continuous control tasks from the DeepMind Control Suite. The spatially structured representation outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction.

Via

Access Paper or Ask Questions

Language as an Abstraction for Hierarchical Deep Reinforcement Learning

Jun 18, 2019

Yiding Jiang, Shixiang Gu, Kevin Murphy, Chelsea Finn

Figure 1 for Language as an Abstraction for Hierarchical Deep Reinforcement Learning

Figure 2 for Language as an Abstraction for Hierarchical Deep Reinforcement Learning

Figure 3 for Language as an Abstraction for Hierarchical Deep Reinforcement Learning

Figure 4 for Language as an Abstraction for Hierarchical Deep Reinforcement Learning

Abstract:Solving complex, temporally-extended tasks is a long-standing problem in reinforcement learning (RL). We hypothesize that one critical element of solving such problems is the notion of compositionality. With the ability to learn concepts and sub-skills that can be composed to solve longer tasks, i.e. hierarchical RL, we can acquire temporally-extended behaviors. However, acquiring effective yet general abstractions for hierarchical RL is remarkably challenging. In this paper, we propose to use language as the abstraction, as it provides unique compositional structure, enabling fast learning and combinatorial generalization, while retaining tremendous flexibility, making it suitable for a variety of problems. Our approach learns an instruction-following low-level policy and a high-level policy that can reuse abstractions across tasks, in essence, permitting agents to reason using structured language. To study compositional task learning, we introduce an open-source object interaction environment built using the MuJoCo physics engine and the CLEVR engine. We find that, using our approach, agents can learn to solve to diverse, temporally-extended tasks such as object sorting and multi-object rearrangement, including from raw pixel observations. Our analysis find that the compositional nature of language is critical for learning diverse sub-skills and systematically generalizing to new sub-skills in comparison to non-compositional abstractions that use the same supervision.

* 20 pages, 21 figures

Via

Access Paper or Ask Questions

Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction

Jun 16, 2019

Steven Hickson, Karthik Raveendran, Alireza Fathi, Kevin Murphy, Irfan Essa

Figure 1 for Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction

Figure 2 for Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction

Figure 3 for Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction

Figure 4 for Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction

Abstract:We propose 4 insights that help to significantly improve the performance of deep learning models that predict surface normals and semantic labels from a single RGB image. These insights are: (1) denoise the "ground truth" surface normals in the training set to ensure consistency with the semantic labels; (2) concurrently train on a mix of real and synthetic data, instead of pretraining on synthetic and finetuning on real; (3) jointly predict normals and semantics using a shared model, but only backpropagate errors on pixels that have valid training labels; (4) slim down the model and use grayscale instead of color inputs. Despite the simplicity of these steps, we demonstrate consistently improved results on several datasets, using a model that runs at 12 fps on a standard mobile phone.

Via

Access Paper or Ask Questions

Contrastive Bidirectional Transformer for Temporal Representation Learning

Jun 13, 2019

Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid

Figure 1 for Contrastive Bidirectional Transformer for Temporal Representation Learning

Figure 2 for Contrastive Bidirectional Transformer for Temporal Representation Learning

Figure 3 for Contrastive Bidirectional Transformer for Temporal Representation Learning

Figure 4 for Contrastive Bidirectional Transformer for Temporal Representation Learning

Abstract:This paper aims at learning representations for long sequences of continuous signals. Recently, the BERT model has demonstrated the effectiveness of stacked transformers for representing sequences of discrete signals (i.e. word tokens). Inspired by its success, we adopt the stacked transformer architecture, but generalize its training objective to maximize the mutual information between the masked signals, and the bidirectional context, via contrastive loss. This enables the model to handle continuous signals, such as visual features. We further consider the case when there are multiple sequences that are semantically aligned at the sequence-level but not at the element-level (e.g. video and ASR), where we propose to use a Transformer to estimate the mutual information between the two sequences, which is again maximized via contrastive loss. We demonstrate the effectiveness of the learned representations on modeling long video sequences for action anticipation and video captioning. The results show that our method, referred to by Contrastive Bidirectional Transformer ({\bf CBT}), outperforms various baselines significantly. Furthermore, we improve over the state of the art.

Via

Access Paper or Ask Questions

A view of Estimation of Distribution Algorithms through the lens of Expectation-Maximization

Jun 05, 2019

David H. Brookes, Akosua Busia, Clara Fannjiang, Kevin Murphy, Jennifer Listgarten

Figure 1 for A view of Estimation of Distribution Algorithms through the lens of Expectation-Maximization

Figure 2 for A view of Estimation of Distribution Algorithms through the lens of Expectation-Maximization

Figure 3 for A view of Estimation of Distribution Algorithms through the lens of Expectation-Maximization

Abstract:We show that under mild conditions, Estimation of Distribution Algorithms (EDAs) can be written as variational Expectation-Maximization (EM) that uses a mixture of weighted particles as the approximate posterior. In the infinite particle limit, EDAs can be viewed as exact EM. Because EM sits on a rigorous statistical foundation and has been thoroughly analyzed, this connection provides a coherent framework with which to reason about EDAs. Importantly, the connection also suggests avenues for possible improvements to EDAs owing to our ability to leverage general statistical tools and generalizations of EM. For example, we make use of results about known EM convergence properties to propose an adaptive, hybrid EDA-gradient descent algorithm; this hybrid demonstrates better performance than either component of the hybrid on several canonical, non-convex test functions. We also demonstrate empirically that although one might hypothesize that reducing the variational gap could prove useful, it actually degrades performance of EDAs. Finally, we show that the connection between EM and EDAs provides us with a new perspective on why EDAs are performing approximate natural gradient descent.

Via

Access Paper or Ask Questions

Relational Action Forecasting

Apr 08, 2019

Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, Cordelia Schmid

Figure 1 for Relational Action Forecasting

Figure 2 for Relational Action Forecasting

Figure 3 for Relational Action Forecasting

Figure 4 for Relational Action Forecasting

Abstract:This paper focuses on multi-person action forecasting in videos. More precisely, given a history of H previous frames, the goal is to detect actors and to predict their future actions for the next T frames. Our approach jointly models temporal and spatial interactions among different actors by constructing a recurrent graph, using actor proposals obtained with Faster R-CNN as nodes. Our method learns to select a subset of discriminative relations without requiring explicit supervision, thus enabling us to tackle challenging visual data. We refer to our model as Discriminative Relational Recurrent Network (DRRN). Evaluation of action prediction on AVA demonstrates the effectiveness of our proposed method compared to simpler baselines. Furthermore, we significantly improve performance on the task of early action classification on J-HMDB, from the previous SOTA of 48% to 60%.

* CVPR 2019 (oral)

Via

Access Paper or Ask Questions

VideoBERT: A Joint Model for Video and Language Representation Learning

Apr 03, 2019

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid

Figure 1 for VideoBERT: A Joint Model for Video and Language Representation Learning

Figure 2 for VideoBERT: A Joint Model for Video and Language Representation Learning

Figure 3 for VideoBERT: A Joint Model for Video and Language Representation Learning

Figure 4 for VideoBERT: A Joint Model for Video and Language Representation Learning

Abstract:Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube. Whereas most existing approaches learn low-level representations, we propose a joint visual-linguistic model to learn high-level features without any explicit supervision. In particular, inspired by its recent success in language modeling, we build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively. We use this model in a number of tasks, including action classification and video captioning. We show that it can be applied directly to open-vocabulary classification, and confirm that large amounts of training data and cross-modal information are critical to performance. Furthermore, we outperform the state-of-the-art on video captioning, and quantitative results verify that the model learns high-level semantic features.

Via

Access Paper or Ask Questions