Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexey Dosovitskiy

Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs

Aug 07, 2017
Maxim Tatarchenko, Alexey Dosovitskiy, Thomas Brox

Figure 1 for Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs

Figure 2 for Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs

Figure 3 for Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs

Figure 4 for Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs

We present a deep convolutional decoder architecture that can generate volumetric 3D outputs in a compute- and memory-efficient manner by using an octree representation. The network learns to predict both the structure of the octree, and the occupancy values of individual cells. This makes it a particularly valuable technique for generating 3D shapes. In contrast to standard decoders acting on regular voxel grids, the architecture does not have cubic complexity. This allows representing much higher resolution outputs with a limited memory budget. We demonstrate this in several application domains, including 3D convolutional autoencoders, generation of objects and whole scenes from high-level representations, and shape from a single image.

Via

Access Paper or Ask Questions

Learning to Generate Chairs, Tables and Cars with Convolutional Networks

Aug 02, 2017
Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim Tatarchenko, Thomas Brox

Figure 1 for Learning to Generate Chairs, Tables and Cars with Convolutional Networks

Figure 2 for Learning to Generate Chairs, Tables and Cars with Convolutional Networks

Figure 3 for Learning to Generate Chairs, Tables and Cars with Convolutional Networks

Figure 4 for Learning to Generate Chairs, Tables and Cars with Convolutional Networks

We train generative 'up-convolutional' neural networks which are able to generate images of objects given object style, viewpoint, and color. We train the networks on rendered 3D models of chairs, tables, and cars. Our experiments show that the networks do not merely learn all images by heart, but rather find a meaningful representation of 3D models allowing them to assess the similarity of different models, interpolate between given views to generate the missing ones, extrapolate views, and invent new objects not present in the training set by recombining training instances, or even two different object classes. Moreover, we show that such generative networks can be used to find correspondences between different objects from the dataset, outperforming existing approaches on this task.

* v4: final PAMI version. New architecture figure

Via

Access Paper or Ask Questions

Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Apr 12, 2017
Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, Jason Yosinski

Figure 1 for Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Figure 2 for Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Figure 3 for Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Figure 4 for Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. (2016) showed one interesting way to synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network. In this paper we extend this method by introducing an additional prior on the latent code, improving both sample quality and sample diversity, leading to a state-of-the-art generative model that produces high quality images at higher resolutions (227x227) than previous generative models, and does so for all 1000 ImageNet categories. In addition, we provide a unified probabilistic interpretation of related activation maximization methods and call the general class of models "Plug and Play Generative Networks". PPGNs are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable "condition" network C that tells the generator what to draw. We demonstrate the generation of images conditioned on a class (when C is an ImageNet or MIT Places classification network) and also conditioned on a caption (when C is an image captioning network). Our method also improves the state of the art of Multifaceted Feature Visualization, which generates the set of synthetic inputs that activate a neuron in order to better understand how deep neural networks operate. Finally, we show that our model performs reasonably well at the task of image inpainting. While image models are used in this paper, the approach is modality-agnostic and can be applied to many types of data.

* CVPR camera-ready

Via

Access Paper or Ask Questions

DeMoN: Depth and Motion Network for Learning Monocular Stereo

Apr 11, 2017
Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, Thomas Brox

Figure 1 for DeMoN: Depth and Motion Network for Learning Monocular Stereo

Figure 2 for DeMoN: Depth and Motion Network for Learning Monocular Stereo

Figure 3 for DeMoN: Depth and Motion Network for Learning Monocular Stereo

Figure 4 for DeMoN: Depth and Motion Network for Learning Monocular Stereo

In this paper we formulate structure from motion as a learning problem. We train a convolutional network end-to-end to compute depth and camera motion from successive, unconstrained image pairs. The architecture is composed of multiple stacked encoder-decoder networks, the core part being an iterative network that is able to improve its own predictions. The network estimates not only depth and motion, but additionally surface normals, optical flow between the images and confidence of the matching. A crucial component of the approach is a training loss based on spatial relative differences. Compared to traditional two-frame structure from motion methods, results are more accurate and more robust. In contrast to the popular depth-from-single-image networks, DeMoN learns the concept of matching and, thus, better generalizes to structures not seen during training.

* Camera ready version for CVPR 2017. Supplementary material included. Project page: http://lmb.informatik.uni-freiburg.de/people/ummenhof/depthmotionnet/

Via

Access Paper or Ask Questions

Learning to Act by Predicting the Future

Feb 14, 2017
Alexey Dosovitskiy, Vladlen Koltun

Figure 1 for Learning to Act by Predicting the Future

Figure 2 for Learning to Act by Predicting the Future

Figure 3 for Learning to Act by Predicting the Future

Figure 4 for Learning to Act by Predicting the Future

We present an approach to sensorimotor control in immersive environments. Our approach utilizes a high-dimensional sensory stream and a lower-dimensional measurement stream. The cotemporal structure of these streams provides a rich supervisory signal, which enables training a sensorimotor control model by interacting with the environment. The model is trained using supervised learning techniques, but without extraneous supervision. It learns to act based on raw sensory input from a complex three-dimensional environment. The presented formulation enables learning without a fixed goal at training time, and pursuing dynamically changing goals at test time. We conduct extensive experiments in three-dimensional simulations based on the classical first-person game Doom. The results demonstrate that the presented approach outperforms sophisticated prior formulations, particularly on challenging tasks. The results also show that trained models successfully generalize across environments and goals. A model trained using the presented approach won the Full Deathmatch track of the Visual Doom AI Competition, which was held in previously unseen environments.

* Published as a conference paper at ICLR 2017

Via

Access Paper or Ask Questions

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

Dec 06, 2016
Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, Thomas Brox

Figure 1 for FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

Figure 2 for FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

Figure 3 for FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

Figure 4 for FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

The FlowNet demonstrated that optical flow estimation can be cast as a learning problem. However, the state of the art with regard to the quality of the flow has still been defined by traditional methods. Particularly on small displacements and real-world data, FlowNet cannot compete with variational methods. In this paper, we advance the concept of end-to-end learning of optical flow and make it work really well. The large improvements in quality and speed are caused by three major contributions: first, we focus on the training data and show that the schedule of presenting data during training is very important. Second, we develop a stacked architecture that includes warping of the second image with intermediate optical flow. Third, we elaborate on small displacements by introducing a sub-network specializing on small motions. FlowNet 2.0 is only marginally slower than the original FlowNet but decreases the estimation error by more than 50%. It performs on par with state-of-the-art methods, while running at interactive frame rates. Moreover, we present faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet.

* Including supplementary material. For the video see: http://lmb.informatik.uni-freiburg.de/Publications/2016/IMKDB16/

Via

Access Paper or Ask Questions

Synthesizing the preferred inputs for neurons in neural networks via deep generator networks

Nov 23, 2016
Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, Jeff Clune

Figure 1 for Synthesizing the preferred inputs for neurons in neural networks via deep generator networks

Figure 2 for Synthesizing the preferred inputs for neurons in neural networks via deep generator networks

Figure 3 for Synthesizing the preferred inputs for neurons in neural networks via deep generator networks

Figure 4 for Synthesizing the preferred inputs for neurons in neural networks via deep generator networks

Deep neural networks (DNNs) have demonstrated state-of-the-art results on many pattern recognition tasks, especially vision classification problems. Understanding the inner workings of such computational brains is both fascinating basic science that is interesting in its own right - similar to why we study the human brain - and will enable researchers to further improve DNNs. One path to understanding how a neural network functions internally is to study what each of its neurons has learned to detect. One such method is called activation maximization (AM), which synthesizes an input (e.g. an image) that highly activates a neuron. Here we dramatically improve the qualitative state of the art of activation maximization by harnessing a powerful, learned prior: a deep generator network (DGN). The algorithm (1) generates qualitatively state-of-the-art synthetic images that look almost real, (2) reveals the features learned by each neuron in an interpretable way, (3) generalizes well to new datasets and somewhat well to different network architectures without requiring the prior to be relearned, and (4) can be considered as a high-quality generative method (in this case, by generating novel, creative, interesting, recognizable images).

* 29 pages, 35 figures, NIPS camera-ready

Via

Access Paper or Ask Questions

Artistic style transfer for videos

Oct 19, 2016
Manuel Ruder, Alexey Dosovitskiy, Thomas Brox

Figure 1 for Artistic style transfer for videos

Figure 2 for Artistic style transfer for videos

Figure 3 for Artistic style transfer for videos

Figure 4 for Artistic style transfer for videos

In the past, manually re-drawing an image in a certain artistic style required a professional artist and a long time. Doing this for a video sequence single-handed was beyond imagination. Nowadays computers provide new possibilities. We present an approach that transfers the style from one image (for example, a painting) to a whole video sequence. We make use of recent advances in style transfer in still images and propose new initializations and loss functions applicable to videos. This allows us to generate consistent and stable stylized video sequences, even in cases with large motion and strong occlusion. We show that the proposed method clearly outperforms simpler baselines both qualitatively and quantitatively.

* German Conference on Pattern Recognition (GCPR), LNCS 9796, pp. 26-36 (2016)
* final version appeared in GCPR-2016; minor changes to improve the clarity

Via

Access Paper or Ask Questions

Multi-view 3D Models from Single Images with a Convolutional Network

Aug 02, 2016
Maxim Tatarchenko, Alexey Dosovitskiy, Thomas Brox

Figure 1 for Multi-view 3D Models from Single Images with a Convolutional Network

Figure 2 for Multi-view 3D Models from Single Images with a Convolutional Network

Figure 3 for Multi-view 3D Models from Single Images with a Convolutional Network

We present a convolutional network capable of inferring a 3D representation of a previously unseen object given a single image of this object. Concretely, the network can predict an RGB image and a depth map of the object as seen from an arbitrary view. Several of these depth maps fused together give a full point cloud of the object. The point cloud can in turn be transformed into a surface mesh. The network is trained on renderings of synthetic 3D models of cars and chairs. It successfully deals with objects on cluttered background and generates reasonable predictions for real images of cars.

Via

Access Paper or Ask Questions

Inverting Visual Representations with Convolutional Networks

Apr 26, 2016
Alexey Dosovitskiy, Thomas Brox

Figure 1 for Inverting Visual Representations with Convolutional Networks

Figure 2 for Inverting Visual Representations with Convolutional Networks

Figure 3 for Inverting Visual Representations with Convolutional Networks

Figure 4 for Inverting Visual Representations with Convolutional Networks

Feature representations, both hand-designed and learned ones, are often hard to analyze and interpret, even when they are extracted from visual data. We propose a new approach to study image representations by inverting them with an up-convolutional neural network. We apply the method to shallow representations (HOG, SIFT, LBP), as well as to deep networks. For shallow representations our approach provides significantly better reconstructions than existing methods, revealing that there is surprisingly rich information contained in these features. Inverting a deep network trained on ImageNet provides several insights into the properties of the feature representation learned by the network. Most strikingly, the colors and the rough contours of an image can be reconstructed from activations in higher network layers and even from the predicted class probabilities.

* Version 4 - final version to appear in CVPR-2016. Visually better results obtained with feature similarity and adversarial training are in a different paper - arXiv:1602.02644

Via

Access Paper or Ask Questions