Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexey Dosovitskiy

Moving Off-the-Grid: Scene-Grounded Video Representations

Nov 08, 2024

Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova, Rishabh Kabra, Carl Doersch, Dilara Gokay, Joseph Heyward, Etienne Pot, Klaus Greff(+6 more)

Figure 1 for Moving Off-the-Grid: Scene-Grounded Video Representations

Figure 2 for Moving Off-the-Grid: Scene-Grounded Video Representations

Figure 3 for Moving Off-the-Grid: Scene-Grounded Video Representations

Figure 4 for Moving Off-the-Grid: Scene-Grounded Video Representations

Abstract:Current vision models typically maintain a fixed correspondence between their representation structure and image space. Each layer comprises a set of tokens arranged "on-the-grid," which biases patches or tokens to encode information at a specific spatio(-temporal) location. In this work we present Moving Off-the-Grid (MooG), a self-supervised video representation model that offers an alternative approach, allowing tokens to move "off-the-grid" to better enable them to represent scene elements consistently, even as they move across the image plane through time. By using a combination of cross-attention and positional embeddings we disentangle the representation structure and image structure. We find that a simple self-supervised objective--next frame prediction--trained on video data, results in a set of latent tokens which bind to specific scene structures and track them as they move. We demonstrate the usefulness of MooG's learned representation both qualitatively and quantitatively by training readouts on top of the learned representation on a variety of downstream tasks. We show that MooG can provide a strong foundation for different vision tasks when compared to "on-the-grid" baselines.

* Accepted to NeurIPS 2024 (spotlight). Project page: https://moog-paper.github.io/

Via

Access Paper or Ask Questions

ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization

Jun 06, 2024

Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, Zeynep Akata

Figure 1 for ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization

Figure 2 for ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization

Figure 3 for ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization

Figure 4 for ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization

Abstract:Text-to-Image (T2I) models have made significant advancements in recent years, but they still struggle to accurately capture intricate details specified in complex compositional prompts. While fine-tuning T2I models with reward objectives has shown promise, it suffers from "reward hacking" and may not generalize well to unseen prompt distributions. In this work, we propose Reward-based Noise Optimization (ReNO), a novel approach that enhances T2I models at inference by optimizing the initial noise based on the signal from one or multiple human preference reward models. Remarkably, solving this optimization problem with gradient ascent for 50 iterations yields impressive results on four different one-step models across two competitive benchmarks, T2I-CompBench and GenEval. Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models. Extensive user studies demonstrate that our model is preferred nearly twice as often compared to the popular SDXL model and is on par with the proprietary Stable Diffusion 3 with 8B parameters. Moreover, given the same computational resources, a ReNO-optimized one-step model outperforms widely-used open-source models such as SDXL and PixArt-$\alpha$, highlighting the efficiency and effectiveness of ReNO in enhancing T2I model performance at inference time. Code is available at https://github.com/ExplainableML/ReNO.

* Preprint

Via

Access Paper or Ask Questions

Simple Open-Vocabulary Object Detection with Vision Transformers

May 12, 2022

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen(+4 more)

Figure 1 for Simple Open-Vocabulary Object Detection with Vision Transformers

Figure 2 for Simple Open-Vocabulary Object Detection with Vision Transformers

Figure 3 for Simple Open-Vocabulary Object Detection with Vision Transformers

Figure 4 for Simple Open-Vocabulary Object Detection with Vision Transformers

Abstract:Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.

Via

Access Paper or Ask Questions

Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations

Nov 29, 2021

Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario Lucic, Daniel Duckworth, Alexey Dosovitskiy(+3 more)

Figure 1 for Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations

Figure 2 for Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations

Figure 3 for Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations

Figure 4 for Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations

Abstract:A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene. In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error. We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery.

* Project website: https://srt-paper.github.io/

Via

Access Paper or Ask Questions

Conditional Object-Centric Learning from Video

Nov 24, 2021

Thomas Kipf, Gamaleldin F. Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, Klaus Greff

Figure 1 for Conditional Object-Centric Learning from Video

Figure 2 for Conditional Object-Centric Learning from Video

Figure 3 for Conditional Object-Centric Learning from Video

Figure 4 for Conditional Object-Centric Learning from Video

Abstract:Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simple 2D and 3D datasets has shown that models with object-centric inductive biases can learn to segment and represent meaningful objects from the statistical structure of the data alone without the need for any supervision. However, such fully-unsupervised methods still fail to scale to diverse realistic data, despite the use of increasingly complex inductive biases such as priors for the size of objects or the 3D geometry of the scene. In this paper, we instead take a weakly-supervised approach and focus on how 1) using the temporal dynamics of video data in the form of optical flow and 2) conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data. We introduce a sequential extension to Slot Attention which we train to predict optical flow for realistic looking synthetic scenes and show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation. These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences. We also find that such initial-state-conditioning can be used during inference as a flexible interface to query the model for specific objects or parts of objects, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models.

* Project page at https://slot-attention-video.github.io/

Via

Access Paper or Ask Questions

Do Vision Transformers See Like Convolutional Neural Networks?

Aug 19, 2021

Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, Alexey Dosovitskiy

Figure 1 for Do Vision Transformers See Like Convolutional Neural Networks?

Figure 2 for Do Vision Transformers See Like Convolutional Neural Networks?

Figure 3 for Do Vision Transformers See Like Convolutional Neural Networks?

Figure 4 for Do Vision Transformers See Like Convolutional Neural Networks?

Abstract:Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as ViT having more uniform representations across all layers. We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and ViT residual connections, which strongly propagate features from lower to higher layers. We study the ramifications for spatial localization, demonstrating ViTs successfully preserve input spatial information, with noticeable effects from different classification methods. Finally, we study the effect of (pretraining) dataset scale on intermediate features and transfer learning, and conclude with a discussion on connections to new architectures such as the MLP-Mixer.

Via

Access Paper or Ask Questions

MLP-Mixer: An all-MLP Architecture for Vision

May 17, 2021

Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit(+2 more)

Figure 1 for MLP-Mixer: An all-MLP Architecture for Vision

Figure 2 for MLP-Mixer: An all-MLP Architecture for Vision

Figure 3 for MLP-Mixer: An all-MLP Architecture for Vision

Figure 4 for MLP-Mixer: An all-MLP Architecture for Vision

Abstract:Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

* Fixed parameter counts in Table 1

Via

Access Paper or Ask Questions

Differentiable Patch Selection for Image Recognition

Apr 07, 2021

Jean-Baptiste Cordonnier, Aravindh Mahendran, Alexey Dosovitskiy, Dirk Weissenborn, Jakob Uszkoreit, Thomas Unterthiner

Figure 1 for Differentiable Patch Selection for Image Recognition

Figure 2 for Differentiable Patch Selection for Image Recognition

Figure 3 for Differentiable Patch Selection for Image Recognition

Figure 4 for Differentiable Patch Selection for Image Recognition

Abstract:Neural Networks require large amounts of memory and compute to process high resolution images, even when only a small part of the image is actually informative for the task at hand. We propose a method based on a differentiable Top-K operator to select the most relevant parts of the input to efficiently process high resolution images. Our method may be interfaced with any downstream neural network, is able to aggregate information from different patches in a flexible way, and allows the whole model to be trained end-to-end using backpropagation. We show results for traffic sign recognition, inter-patch relationship reasoning, and fine-grained recognition without using object/part bounding box annotations during training.

* Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021. Code available at https://github.com/google-research/google-research/tree/master/ptopk_patch_selection/

Via

Access Paper or Ask Questions

Learning Object-Centric Video Models by Contrasting Sets

Nov 20, 2020

Sindy Löwe, Klaus Greff, Rico Jonschkowski, Alexey Dosovitskiy, Thomas Kipf

Figure 1 for Learning Object-Centric Video Models by Contrasting Sets

Figure 2 for Learning Object-Centric Video Models by Contrasting Sets

Figure 3 for Learning Object-Centric Video Models by Contrasting Sets

Figure 4 for Learning Object-Centric Video Models by Contrasting Sets

Abstract:Contrastive, self-supervised learning of object representations recently emerged as an attractive alternative to reconstruction-based training. Prior approaches focus on contrasting individual object representations (slots) against one another. However, a fundamental problem with this approach is that the overall contrastive loss is the same for (i) representing a different object in each slot, as it is for (ii) (re-)representing the same object in all slots. Thus, this objective does not inherently push towards the emergence of object-centric representations in the slots. We address this problem by introducing a global, set-based contrastive loss: instead of contrasting individual slot representations against one another, we aggregate the representations and contrast the joined sets against one another. Additionally, we introduce attention-based encoders to this contrastive setup which simplifies training and provides interpretable object masks. Our results on two synthetic video datasets suggest that this approach compares favorably against previous contrastive methods in terms of reconstruction, future prediction and object separation performance.

* NeurIPS 2020 Workshop on Object Representations for Learning and Reasoning

Via

Access Paper or Ask Questions

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Oct 22, 2020

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly(+2 more)

Figure 1 for An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Figure 2 for An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Figure 3 for An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Figure 4 for An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Abstract:While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

* Fine-tuning code and pre-trained models are available at https://github.com/google-research/vision_transformer

Via

Access Paper or Ask Questions