Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexei A. Efros

Test-Time Training on Video Streams

Jul 12, 2023

Renhao Wang, Yu Sun, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang

Figure 1 for Test-Time Training on Video Streams

Figure 2 for Test-Time Training on Video Streams

Figure 3 for Test-Time Training on Video Streams

Figure 4 for Test-Time Training on Video Streams

Abstract:Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is trained on the same instance using a self-supervised task, such as image reconstruction with masked autoencoders. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The relative improvement is 45% and 66% for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses more information, training on all frames from the entire test video regardless of temporal order. This differs from previous findings using synthetic videos. We conceptualize locality as the advantage of online over offline TTT. We analyze the role of locality with ablations and a theory based on bias-variance trade-off.

* Project website with videos, dataset and code: https://video-ttt.github.io/

Via

Access Paper or Ask Questions

Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives

Jul 11, 2023

Tom Monnier, Jake Austin, Angjoo Kanazawa, Alexei A. Efros, Mathieu Aubry

Figure 1 for Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives

Figure 2 for Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives

Figure 3 for Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives

Figure 4 for Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives

Abstract:Given a set of calibrated images of a scene, we present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives. While many approaches focus on recovering high-fidelity 3D scenes, we focus on parsing a scene into mid-level 3D representations made of a small set of textured primitives. Such representations are interpretable, easy to manipulate and suited for physics-based simulations. Moreover, unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images through differentiable rendering. Specifically, we model primitives as textured superquadric meshes and optimize their parameters from scratch with an image rendering loss. We highlight the importance of modeling transparency for each primitive, which is critical for optimization and also enables handling varying numbers of primitives. We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. We compare our approach to the state of the art on diverse scenes from DTU, and demonstrate its robustness on real-life captures from BlendedMVS and Nerfstudio. We also showcase how our results can be used to effortlessly edit a scene or perform physical simulations. Code and video results are available at https://www.tmonnier.com/DBW .

* Project webpage with code and videos: https://www.tmonnier.com/DBW

Via

Access Paper or Ask Questions

Rosetta Neurons: Mining the Common Units in a Model Zoo

Jun 16, 2023

Amil Dravid, Yossi Gandelsman, Alexei A. Efros, Assaf Shocher

Abstract:Do different neural networks, trained for various vision tasks, share some common representations? In this paper, we demonstrate the existence of common features we call "Rosetta Neurons" across a range of models with different architectures, different tasks (generative and discriminative), and different types of supervision (class-supervised, text-supervised, self-supervised). We present an algorithm for mining a dictionary of Rosetta Neurons across several popular vision models: Class Supervised-ResNet50, DINO-ResNet50, DINO-ViT, MAE, CLIP-ResNet50, BigGAN, StyleGAN-2, StyleGAN-XL. Our findings suggest that certain visual concepts and structures are inherently embedded in the natural world and can be learned by different models regardless of the specific task or architecture, and without the use of semantic labels. We can visualize shared concepts directly due to generative models included in our analysis. The Rosetta Neurons facilitate model-to-model translation enabling various inversion-based manipulations, including cross-class alignments, shifting, zooming, and more, without the need for specialized training.

* Project page: https://yossigandelsman.github.io/rosetta_neurons/

Via

Access Paper or Ask Questions

Evaluating Data Attribution for Text-to-Image Models

Jun 15, 2023

Sheng-Yu Wang, Alexei A. Efros, Jun-Yan Zhu, Richard Zhang

Figure 1 for Evaluating Data Attribution for Text-to-Image Models

Figure 2 for Evaluating Data Attribution for Text-to-Image Models

Figure 3 for Evaluating Data Attribution for Text-to-Image Models

Figure 4 for Evaluating Data Attribution for Text-to-Image Models

Abstract:While large text-to-image models are able to synthesize "novel" images, these images are necessarily a reflection of the training data. The problem of data attribution in such models -- which of the images in the training set are most responsible for the appearance of a given generated image -- is a difficult yet important one. As an initial step toward this problem, we evaluate attribution through "customization" methods, which tune an existing large-scale model toward a given exemplar object or style. Our key insight is that this allows us to efficiently create synthetic images that are computationally influenced by the exemplar by construction. With our new dataset of such exemplar-influenced images, we are able to evaluate various data attribution algorithms and different possible feature spaces. Furthermore, by training on our dataset, we can tune standard models, such as DINO, CLIP, and ViT, toward the attribution problem. Even though the procedure is tuned towards small exemplar sets, we show generalization to larger sets. Finally, by taking into account the inherent uncertainty of the problem, we can assign soft attribution scores over a set of training images.

* Project page: https://peterwang512.github.io/GenDataAttribution

Via

Access Paper or Ask Questions

Diffusion Self-Guidance for Controllable Image Generation

Jun 11, 2023

Dave Epstein, Allan Jabri, Ben Poole, Alexei A. Efros, Aleksander Holynski

Abstract:Large-scale generative models are capable of producing high-quality images from detailed text descriptions. However, many aspects of an image are difficult or impossible to convey through text. We introduce self-guidance, a method that provides greater control over generated images by guiding the internal representations of diffusion models. We demonstrate that properties such as the shape, location, and appearance of objects can be extracted from these representations and used to steer sampling. Self-guidance works similarly to classifier guidance, but uses signals present in the pretrained model itself, requiring no additional models or training. We show how a simple set of properties can be composed to perform challenging image manipulations, such as modifying the position or size of objects, merging the appearance of objects in one image with the layout of another, composing objects from many images into one, and more. We also show that self-guidance can be used to edit real images. For results and an interactive demo, see our project page at https://dave.ml/selfguidance/

* Project page at https://dave.ml/selfguidance/

Via

Access Paper or Ask Questions

Generalizing Dataset Distillation via Deep Generative Prior

May 03, 2023

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, Jun-Yan Zhu

Abstract:Dataset Distillation aims to distill an entire dataset's knowledge into a few synthetic images. The idea is to synthesize a small number of synthetic data points that, when given to a learning algorithm as training data, result in a model approximating one trained on the original data. Despite recent progress in the field, existing dataset distillation methods fail to generalize to new architectures and scale to high-resolution datasets. To overcome the above issues, we propose to use the learned prior from pre-trained deep generative models to synthesize the distilled data. To achieve this, we present a new optimization algorithm that distills a large number of images into a few intermediate feature vectors in the generative model's latent space. Our method augments existing techniques, significantly improving cross-architecture generalization in all settings.

* CVPR 2023; Project Page at https://georgecazenavette.github.io/glad Code at https://github.com/GeorgeCazenavette/glad

Via

Access Paper or Ask Questions

Putting People in Their Place: Affordance-Aware Human Insertion into Scenes

Apr 27, 2023

Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A. Efros, Krishna Kumar Singh

Figure 1 for Putting People in Their Place: Affordance-Aware Human Insertion into Scenes

Abstract:We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes. Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances. Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition. We set up the task in a self-supervised fashion by learning to re-pose humans in video clips. We train a large-scale diffusion model on a dataset of 2.4M video clips that produces diverse plausible poses while respecting the scene context. Given the learned human-scene composition, our model can also hallucinate realistic people and scenes when prompted without conditioning and also enables interactive editing. A quantitative evaluation shows that our method synthesizes more realistic human appearance and more natural human-scene interactions than prior work.

* CVPR 2023. Project page with code: https://sumith1896.github.io/affordance-insertion/

Via

Access Paper or Ask Questions

Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions

Mar 22, 2023

Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, Angjoo Kanazawa

Abstract:We propose a method for editing NeRF scenes with text-instructions. Given a NeRF of a scene and the collection of images used to reconstruct it, our method uses an image-conditioned diffusion model (InstructPix2Pix) to iteratively edit the input images while optimizing the underlying scene, resulting in an optimized 3D scene that respects the edit instruction. We demonstrate that our proposed method is able to edit large-scale, real-world scenes, and is able to accomplish more realistic, targeted edits than prior work.

* Project website: https://instruct-nerf2nerf.github.io

Via

Access Paper or Ask Questions

Internet Explorer: Targeted Representation Learning on the Open Web

Feb 27, 2023

Alexander C. Li, Ellis Brown, Alexei A. Efros, Deepak Pathak

Figure 1 for Internet Explorer: Targeted Representation Learning on the Open Web

Figure 2 for Internet Explorer: Targeted Representation Learning on the Open Web

Figure 3 for Internet Explorer: Targeted Representation Learning on the Open Web

Figure 4 for Internet Explorer: Targeted Representation Learning on the Open Web

Abstract:Modern vision models typically rely on fine-tuning general-purpose models pre-trained on large, static datasets. These general-purpose models only capture the knowledge within their pre-training datasets, which are tiny, out-of-date snapshots of the Internet -- where billions of images are uploaded each day. We suggest an alternate approach: rather than hoping our static datasets transfer to our desired tasks after large-scale pre-training, we propose dynamically utilizing the Internet to quickly train a small-scale model that does extremely well on the task at hand. Our approach, called Internet Explorer, explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images were useful, and prioritizing what to search for next. We evaluate Internet Explorer across several datasets and show that it outperforms or matches CLIP oracle performance by using just a single GPU desktop to actively query the Internet for 30--40 hours. Results, visualizations, and videos at https://internet-explorer-ssl.github.io/

* Website at https://internet-explorer-ssl.github.io/

Via

Access Paper or Ask Questions

InstructPix2Pix: Learning to Follow Image Editing Instructions

Nov 17, 2022

Tim Brooks, Aleksander Holynski, Alexei A. Efros

Figure 1 for InstructPix2Pix: Learning to Follow Image Editing Instructions

Figure 2 for InstructPix2Pix: Learning to Follow Image Editing Instructions

Figure 3 for InstructPix2Pix: Learning to Follow Image Editing Instructions

Figure 4 for InstructPix2Pix: Learning to Follow Image Editing Instructions

Abstract:We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models -- a language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.

* Project page: https://www.timothybrooks.com/instruct-pix2pix

Via

Access Paper or Ask Questions