Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kevin Murphy

Google Brain

Unsupervised Discovery of Parts, Structure, and Dynamics

Mar 12, 2019

Zhenjia Xu, Zhijian Liu, Chen Sun, Kevin Murphy, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu

Figure 1 for Unsupervised Discovery of Parts, Structure, and Dynamics

Figure 2 for Unsupervised Discovery of Parts, Structure, and Dynamics

Figure 3 for Unsupervised Discovery of Parts, Structure, and Dynamics

Figure 4 for Unsupervised Discovery of Parts, Structure, and Dynamics

Abstract:Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future. In this paper, we propose a novel formulation that simultaneously learns a hierarchical, disentangled object representation and a dynamics model for object parts from unlabeled videos. Our Parts, Structure, and Dynamics (PSD) model learns to, first, recognize the object parts via a layered image representation; second, predict hierarchy via a structural descriptor that composes low-level concepts into a hierarchical structure; and third, model the system dynamics by predicting the future. Experiments on multiple real and synthetic datasets demonstrate that our PSD model works well on all three tasks: segmenting object parts, building their hierarchical structure, and capturing their motion distributions.

* ICLR 2019. The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

Stochastic Prediction of Multi-Agent Interactions from Partial Observations

Feb 25, 2019

Chen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum, Kevin Murphy

Figure 1 for Stochastic Prediction of Multi-Agent Interactions from Partial Observations

Figure 2 for Stochastic Prediction of Multi-Agent Interactions from Partial Observations

Figure 3 for Stochastic Prediction of Multi-Agent Interactions from Partial Observations

Figure 4 for Stochastic Prediction of Multi-Agent Interactions from Partial Observations

Abstract:We present a method that learns to integrate temporal information, from a learned dynamics model, with ambiguous visual information, from a learned vision model, in the context of interacting agents. Our method is based on a graph-structured variational recurrent neural network (Graph-VRNN), which is trained end-to-end to infer the current state of the (partially observed) world, as well as to forecast future states. We show that our method outperforms various baselines on two sports datasets, one based on real basketball trajectories, and one generated by a soccer game engine.

* ICLR 2019 camera ready

Via

Access Paper or Ask Questions

NAS-Bench-101: Towards Reproducible Neural Architecture Search

Feb 25, 2019

Chris Ying, Aaron Klein, Esteban Real, Eric Christiansen, Kevin Murphy, Frank Hutter

Figure 1 for NAS-Bench-101: Towards Reproducible Neural Architecture Search

Figure 2 for NAS-Bench-101: Towards Reproducible Neural Architecture Search

Figure 3 for NAS-Bench-101: Towards Reproducible Neural Architecture Search

Figure 4 for NAS-Bench-101: Towards Reproducible Neural Architecture Search

Abstract:Recent advances in neural architecture search (NAS) demand tremendous computational resources. This makes it difficult to reproduce experiments and imposes a barrier-to-entry to researchers without access to large-scale computation. We aim to ameliorate these problems by introducing NAS-Bench-101, the first public architecture dataset for NAS research. To build NAS-Bench-101, we carefully constructed a compact, yet expressive, search space, exploiting graph isomorphisms to identify 423k unique convolutional architectures. We trained and evaluated all of these architectures multiple times on CIFAR-10 and compiled the results into a large dataset. All together, NAS-Bench-101 contains the metrics of over 5 million models, the largest dataset of its kind thus far. This allows researchers to evaluate the quality of a diverse range of models in milliseconds by querying the pre-computed dataset. We demonstrate its utility by analyzing the dataset as a whole and by benchmarking a range of architecture optimization algorithms.

Via

Access Paper or Ask Questions

Composing Text and Image for Image Retrieval - An Empirical Odyssey

Dec 18, 2018

Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays

Figure 1 for Composing Text and Image for Image Retrieval - An Empirical Odyssey

Figure 2 for Composing Text and Image for Image Retrieval - An Empirical Odyssey

Figure 3 for Composing Text and Image for Image Retrieval - An Empirical Odyssey

Figure 4 for Composing Text and Image for Image Retrieval - An Empirical Odyssey

Abstract:In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image. For example, we may present an image of the Eiffel tower, and ask the system to find images which are visually similar but are modified in small ways, such as being taken at nighttime instead of during the day. To tackle this task, we learn a similarity metric between a target image and a source image plus source text, an embedding and composing function such that target image feature is close to the source image plus text composition feature. We propose a new way to combine image and text using such function that is designed for the retrieval task. We show this outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset we create based on CLEVR. We also show that our approach can be used to classify input queries, in addition to image retrieval.

Via

Access Paper or Ask Questions

Modeling Uncertainty with Hedged Instance Embedding

Oct 19, 2018

Seong Joon Oh, Kevin Murphy, Jiyan Pan, Joseph Roth, Florian Schroff, Andrew Gallagher

Figure 1 for Modeling Uncertainty with Hedged Instance Embedding

Figure 2 for Modeling Uncertainty with Hedged Instance Embedding

Figure 3 for Modeling Uncertainty with Hedged Instance Embedding

Figure 4 for Modeling Uncertainty with Hedged Instance Embedding

Abstract:Instance embeddings are an efficient and versatile image representation that facilitates applications like recognition, verification, retrieval, and clustering. Many metric learning methods represent the input as a single point in the embedding space. Often the distance between points is used as a proxy for match confidence. However, this can fail to represent uncertainty arising when the input is ambiguous, e.g., due to occlusion or blurriness. This work addresses this issue and explicitly models the uncertainty by hedging the location of each input in the embedding space. We introduce the hedged instance embedding (HIB) in which embeddings are modeled as random variables and the model is trained under the variational information bottleneck principle. Empirical results on our new N-digit MNIST dataset show that our method leads to the desired behavior of hedging its bets across the embedding space upon encountering ambiguous inputs. This results in improved performance for image matching and classification tasks, more structure in the learned embedding space, and an ability to compute a per-exemplar uncertainty measure that is correlated with downstream performance.

* 15 pages, 10 figures

Via

Access Paper or Ask Questions

Actor-Centric Relation Network

Jul 28, 2018

Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid

Figure 1 for Actor-Centric Relation Network

Figure 2 for Actor-Centric Relation Network

Figure 3 for Actor-Centric Relation Network

Figure 4 for Actor-Centric Relation Network

Abstract:Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level and model temporal context with 3D ConvNets. Here, we go one step further and model spatio-temporal relations to capture the interactions between human actors, relevant objects and scene elements essential to differentiate similar human actions. Our approach is weakly supervised and mines the relevant elements automatically with an actor-centric relational network (ACRN). ACRN computes and accumulates pair-wise relation information from actor and global scene features, and generates relation features for action classification. It is implemented as neural networks and can be trained jointly with an existing action detection system. We show that ACRN outperforms alternative approaches which capture relation information, and that the proposed framework improves upon the state-of-the-art performance on JHMDB and AVA. A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.

* ECCV 2018 camera ready

Via

Access Paper or Ask Questions

Tracking Emerges by Colorizing Videos

Jul 27, 2018

Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, Kevin Murphy

Figure 1 for Tracking Emerges by Colorizing Videos

Figure 2 for Tracking Emerges by Colorizing Videos

Figure 3 for Tracking Emerges by Colorizing Videos

Figure 4 for Tracking Emerges by Colorizing Videos

Abstract:We use large amounts of unlabeled video to learn models for visual tracking without manual human supervision. We leverage the natural temporal coherency of color to create a model that learns to colorize gray-scale videos by copying colors from a reference frame. Quantitative and qualitative experiments suggest that this task causes the model to automatically learn to track visual regions. Although the model is trained without any ground-truth labels, our method learns to track well enough to outperform the latest methods based on optical flow. Moreover, our results suggest that failures to track are correlated with failures to colorize, indicating that advancing video colorization may further improve self-supervised visual tracking.

* ECCV 2018. Blog post: https://ai.googleblog.com/2018/06/self-supervised-tracking-via-video.html

Via

Access Paper or Ask Questions

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Jul 27, 2018

Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, Kevin Murphy

Figure 1 for Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Figure 2 for Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Figure 3 for Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Figure 4 for Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Abstract:Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity. It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on ImageNet, could be a promising way for spatial and temporal representation learning. However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit. We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices. In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions. Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful. Our conclusion generalizes to datasets with very different properties. When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24).

* ECCV 2018 camera ready

Via

Access Paper or Ask Questions

Progressive Neural Architecture Search

Jul 26, 2018

Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, Kevin Murphy

Figure 1 for Progressive Neural Architecture Search

Figure 2 for Progressive Neural Architecture Search

Figure 3 for Progressive Neural Architecture Search

Figure 4 for Progressive Neural Architecture Search

Abstract:We propose a new method for learning the structure of convolutional neural networks (CNNs) that is more efficient than recent state-of-the-art methods based on reinforcement learning and evolutionary algorithms. Our approach uses a sequential model-based optimization (SMBO) strategy, in which we search for structures in order of increasing complexity, while simultaneously learning a surrogate model to guide the search through structure space. Direct comparison under the same search space shows that our method is up to 5 times more efficient than the RL method of Zoph et al. (2018) in terms of number of models evaluated, and 8 times faster in terms of total compute. The structures we discover in this way achieve state of the art classification accuracies on CIFAR-10 and ImageNet.

* To appear in ECCV 2018 as oral. The code and checkpoint for PNASNet-5 trained on ImageNet (both Mobile and Large) can now be downloaded from https://github.com/tensorflow/models/tree/master/research/slim#Pretrained. Also see https://github.com/chenxi116/PNASNet.TF for refactored and simplified TensorFlow code; see https://github.com/chenxi116/PNASNet.pytorch for exact conversion to PyTorch

Via

Access Paper or Ask Questions

XGAN: Unsupervised Image-to-Image Translation for Many-to-Many Mappings

Jul 10, 2018

Amélie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar Mosseri, Forrester Cole, Kevin Murphy

Figure 1 for XGAN: Unsupervised Image-to-Image Translation for Many-to-Many Mappings

Figure 2 for XGAN: Unsupervised Image-to-Image Translation for Many-to-Many Mappings

Figure 3 for XGAN: Unsupervised Image-to-Image Translation for Many-to-Many Mappings

Figure 4 for XGAN: Unsupervised Image-to-Image Translation for Many-to-Many Mappings

Abstract:Style transfer usually refers to the task of applying color and texture information from a specific style image to a given content image while preserving the structure of the latter. Here we tackle the more generic problem of semantic style transfer: given two unpaired collections of images, we aim to learn a mapping between the corpus-level style of each collection, while preserving semantic content shared across the two domains. We introduce XGAN ("Cross-GAN"), a dual adversarial autoencoder, which captures a shared representation of the common domain semantic content in an unsupervised way, while jointly learning the domain-to-domain image translations in both directions. We exploit ideas from the domain adaptation literature and define a semantic consistency loss which encourages the model to preserve semantics in the learned embedding space. We report promising qualitative results for the task of face-to-cartoon translation. The cartoon dataset, CartoonSet, we collected for this purpose is publicly available at google.github.io/cartoonset/ as a new benchmark for semantic style transfer.

* Domain Adaptation for Visual Understanding at ICML'18

Via

Access Paper or Ask Questions