Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexei A. Efros

Test-Time Training with Masked Autoencoders

Sep 15, 2022

Yossi Gandelsman, Yu Sun, Xinlei Chen, Alexei A. Efros

Figure 1 for Test-Time Training with Masked Autoencoders

Figure 2 for Test-Time Training with Masked Autoencoders

Figure 3 for Test-Time Training with Masked Autoencoders

Figure 4 for Test-Time Training with Masked Autoencoders

Abstract:Test-time training adapts to a new test distribution on the fly by optimizing a model for each test input using self-supervision. In this paper, we use masked autoencoders for this one-sample learning problem. Empirically, our simple method improves generalization on many visual benchmarks for distribution shifts. Theoretically, we characterize this improvement in terms of the bias-variance trade-off.

* Project page: https://yossigandelsman.github.io/ttt_mae/index.html

Via

Access Paper or Ask Questions

Studying Bias in GANs through the Lens of Race

Sep 15, 2022

Vongani H. Maluleke, Neerja Thakkar, Tim Brooks, Ethan Weber, Trevor Darrell, Alexei A. Efros, Angjoo Kanazawa, Devin Guillory

Figure 1 for Studying Bias in GANs through the Lens of Race

Figure 2 for Studying Bias in GANs through the Lens of Race

Figure 3 for Studying Bias in GANs through the Lens of Race

Figure 4 for Studying Bias in GANs through the Lens of Race

Abstract:In this work, we study how the performance and evaluation of generative image models are impacted by the racial composition of their training datasets. By examining and controlling the racial distributions in various training datasets, we are able to observe the impacts of different training distributions on generated image quality and the racial distributions of the generated images. Our results show that the racial compositions of generated images successfully preserve that of the training data. However, we observe that truncation, a technique used to generate higher quality images during inference, exacerbates racial imbalances in the data. Lastly, when examining the relationship between image quality and race, we find that the highest perceived visual quality images of a given race come from a distribution where that race is well-represented, and that annotators consistently prefer generated images of white people over those of Black people.

* ECCV 2022. Project Page: https://neerja.me/bias-gans/

Via

Access Paper or Ask Questions

Visual Prompting via Image Inpainting

Sep 01, 2022

Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, Alexei A. Efros

Figure 1 for Visual Prompting via Image Inpainting

Figure 2 for Visual Prompting via Image Inpainting

Figure 3 for Visual Prompting via Image Inpainting

Figure 4 for Visual Prompting via Image Inpainting

Abstract:How does one adapt a pre-trained visual model to novel downstream tasks without task-specific finetuning or any model modification? Inspired by prompting in NLP, this paper investigates visual prompting: given input-output image example(s) of a new task at test time and a new input image, the goal is to automatically produce the output image, consistent with the given examples. We show that posing this problem as simple image inpainting - literally just filling in a hole in a concatenated visual prompt image - turns out to be surprisingly effective, provided that the inpainting algorithm has been trained on the right data. We train masked auto-encoders on a new dataset that we curated - 88k unlabeled figures from academic papers sources on Arxiv. We apply visual prompting to these pretrained models and demonstrate results on various downstream image-to-image tasks, including foreground segmentation, single object detection, colorization, edge detection, etc.

* Project page: https://yossigandelsman.github.io/visual_prompt

Via

Access Paper or Ask Questions

Generating Long Videos of Dynamic Scenes

Jun 09, 2022

Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei A. Efros, Tero Karras

Figure 1 for Generating Long Videos of Dynamic Scenes

Figure 2 for Generating Long Videos of Dynamic Scenes

Figure 3 for Generating Long Videos of Dynamic Scenes

Figure 4 for Generating Long Videos of Dynamic Scenes

Abstract:We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time. Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence. A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency, such as a single latent code that dictates content for the entire video. On the other extreme, without long-term consistency, generated videos may morph unrealistically between different scenes. To address these limitations, we prioritize the time axis by redesigning the temporal latent representation and learning long-term consistency from data by training on longer videos. To this end, we leverage a two-phase training strategy, where we separately train using longer videos at a low resolution and shorter videos at a high resolution. To evaluate the capabilities of our model, we introduce two new benchmark datasets with explicit focus on long-term temporal dynamics.

Via

Access Paper or Ask Questions

BlobGAN: Spatially Disentangled Scene Representations

May 05, 2022

Dave Epstein, Taesung Park, Richard Zhang, Eli Shechtman, Alexei A. Efros

Figure 1 for BlobGAN: Spatially Disentangled Scene Representations

Figure 2 for BlobGAN: Spatially Disentangled Scene Representations

Figure 3 for BlobGAN: Spatially Disentangled Scene Representations

Figure 4 for BlobGAN: Spatially Disentangled Scene Representations

Abstract:We propose an unsupervised, mid-level representation for a generative model of scenes. The representation is mid-level in that it is neither per-pixel nor per-image; rather, scenes are modeled as a collection of spatial, depth-ordered "blobs" of features. Blobs are differentiably placed onto a feature grid that is decoded into an image by a generative adversarial network. Due to the spatial uniformity of blobs and the locality inherent to convolution, our network learns to associate different blobs with different entities in a scene and to arrange these blobs to capture scene layout. We demonstrate this emergent behavior by showing that, despite training without any supervision, our method enables applications such as easy manipulation of objects within a scene (e.g., moving, removing, and restyling furniture), creation of feasible scenes given constraints (e.g., plausible rooms with drawers at a particular location), and parsing of real-world images into constituent parts. On a challenging multi-category dataset of indoor scenes, BlobGAN outperforms StyleGAN2 in image quality as measured by FID. See our project page for video results and interactive demo: http://www.dave.ml/blobgan

* Project webpage available at http://www.dave.ml/blobgan

Via

Access Paper or Ask Questions

Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency

Apr 21, 2022

Tom Monnier, Matthew Fisher, Alexei A. Efros, Mathieu Aubry

Figure 1 for Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency

Figure 2 for Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency

Figure 3 for Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency

Figure 4 for Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency

Abstract:Approaches to single-view reconstruction typically rely on viewpoint annotations, silhouettes, the absence of background, multiple views of the same instance, a template shape, or symmetry. We avoid all of these supervisions and hypotheses by leveraging explicitly the consistency between images of different object instances. As a result, our method can learn from large collections of unlabelled images depicting the same object category. Our main contributions are two approaches to leverage cross-instance consistency: (i) progressive conditioning, a training strategy to gradually specialize the model from category to instances in a curriculum learning fashion; (ii) swap reconstruction, a loss enforcing consistency between instances having similar shape or texture. Critical to the success of our method are also: our structured autoencoding architecture decomposing an image into explicit shape, texture, pose, and background; an adapted formulation of differential rendering, and; a new optimization scheme alternating between 3D and pose learning. We compare our approach, UNICORN, both on the diverse synthetic ShapeNet dataset - the classical benchmark for methods requiring multiple views as supervision - and on standard real-image benchmarks (Pascal3D+ Car, CUB-200) for which most methods require known templates and silhouette annotations. We also showcase applicability to more challenging real-world collections (CompCars, LSUN), where silhouettes are not available and images are not cropped around the object.

* Project webpage with code and videos: http://imagine.enpc.fr/~monniert/UNICORN/

Via

Access Paper or Ask Questions

Dataset Distillation by Matching Training Trajectories

Mar 22, 2022

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, Jun-Yan Zhu

Figure 1 for Dataset Distillation by Matching Training Trajectories

Figure 2 for Dataset Distillation by Matching Training Trajectories

Figure 3 for Dataset Distillation by Matching Training Trajectories

Figure 4 for Dataset Distillation by Matching Training Trajectories

Abstract:Dataset distillation is the task of synthesizing a small dataset such that a model trained on the synthetic set will match the test accuracy of the model trained on the full dataset. In this paper, we propose a new formulation that optimizes our distilled data to guide networks to a similar state as those trained on real data across many training steps. Given a network, we train it for several iterations on our distilled data and optimize the distilled data with respect to the distance between the synthetically trained parameters and the parameters trained on real data. To efficiently obtain the initial and target network parameters for large-scale datasets, we pre-compute and store training trajectories of expert networks trained on the real dataset. Our method handily outperforms existing methods and also allows us to distill higher-resolution visual data.

* CVPR 2022 website: https://georgecazenavette.github.io/mtt-distillation/ code: https://github.com/GeorgeCazenavette/mtt-distillation

Via

Access Paper or Ask Questions

Learning Pixel Trajectories with Multiscale Contrastive Random Walks

Jan 20, 2022

Zhangxing Bian, Allan Jabri, Alexei A. Efros, Andrew Owens

Figure 1 for Learning Pixel Trajectories with Multiscale Contrastive Random Walks

Figure 2 for Learning Pixel Trajectories with Multiscale Contrastive Random Walks

Figure 3 for Learning Pixel Trajectories with Multiscale Contrastive Random Walks

Figure 4 for Learning Pixel Trajectories with Multiscale Contrastive Random Walks

Abstract:A range of video modeling tasks, from optical flow to multiple object tracking, share the same fundamental challenge: establishing space-time correspondence. Yet, approaches that dominate each space differ. We take a step towards bridging this gap by extending the recent contrastive random walk formulation to much denser, pixel-level space-time graphs. The main contribution is introducing hierarchy into the search problem by computing the transition matrix between two frames in a coarse-to-fine manner, forming a multiscale contrastive random walk when extended in time. This establishes a unified technique for self-supervised learning of optical flow, keypoint tracking, and video object segmentation. Experiments demonstrate that, for each of these tasks, the unified model achieves performance competitive with strong self-supervised approaches specific to that task. Project site: https://jasonbian97.github.io/flowwalk

Via

Access Paper or Ask Questions

Hallucinating Pose-Compatible Scenes

Dec 13, 2021

Tim Brooks, Alexei A. Efros

Figure 1 for Hallucinating Pose-Compatible Scenes

Figure 2 for Hallucinating Pose-Compatible Scenes

Figure 3 for Hallucinating Pose-Compatible Scenes

Figure 4 for Hallucinating Pose-Compatible Scenes

Abstract:What does human pose tell us about a scene? We propose a task to answer this question: given human pose as input, hallucinate a compatible scene. Subtle cues captured by human pose -- action semantics, environment affordances, object interactions -- provide surprising insight into which scenes are compatible. We present a large-scale generative adversarial network for pose-conditioned scene generation. We significantly scale the size and complexity of training data, curating a massive meta-dataset containing over 19 million frames of humans in everyday environments. We double the capacity of our model with respect to StyleGAN2 to handle such complex data, and design a pose conditioning mechanism that drives our model to learn the nuanced relationship between pose and scene. We leverage our trained model for various applications: hallucinating pose-compatible scene(s) with or without humans, visualizing incompatible scenes and poses, placing a person from one generated image into another scene, and animating pose. Our model produces diverse samples and outperforms pose-conditioned StyleGAN2 and Pix2Pix baselines in terms of accurate human placement (percent of correct keypoints) and image quality (Frechet inception distance).

Via

Access Paper or Ask Questions

Learning Co-segmentation by Segment Swapping for Retrieval and Discovery

Oct 29, 2021

Xi Shen, Alexei A. Efros, Armand Joulin, Mathieu Aubry

Figure 1 for Learning Co-segmentation by Segment Swapping for Retrieval and Discovery

Figure 2 for Learning Co-segmentation by Segment Swapping for Retrieval and Discovery

Figure 3 for Learning Co-segmentation by Segment Swapping for Retrieval and Discovery

Figure 4 for Learning Co-segmentation by Segment Swapping for Retrieval and Discovery

Abstract:The goal of this work is to efficiently identify visually similar patterns from a pair of images, e.g. identifying an artwork detail copied between an engraving and an oil painting, or matching a night-time photograph with its daytime counterpart. Lack of training data is a key challenge for this co-segmentation task. We present a simple yet surprisingly effective approach to overcome this difficulty: we generate synthetic training pairs by selecting object segments in an image and copy-pasting them into another image. We then learn to predict the repeated object masks. We find that it is crucial to predict the correspondences as an auxiliary task and to use Poisson blending and style transfer on the training pairs to generalize on real data. We analyse results with two deep architectures relevant to our joint image analysis task: a transformer-based architecture and Sparse Nc-Net, a recent network designed to predict coarse correspondences using 4D convolutions. We show our approach provides clear improvements for artwork details retrieval on the Brueghel dataset and achieves competitive performance on two place recognition benchmarks, Tokyo247 and Pitts30K. We then demonstrate the potential of our approach by performing object discovery on the Internet object discovery dataset and the Brueghel dataset. Our code and data are available at http://imagine.enpc.fr/~shenx/SegSwap/.

Via

Access Paper or Ask Questions