Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Antonio Torralba

Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

Jan 03, 2024

Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, Karsten Kreis

Figure 1 for Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

Figure 2 for Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

Figure 3 for Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

Figure 4 for Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

Abstract:Text-guided diffusion models have revolutionized image and video generation and have also been successfully used for optimization-based 3D object synthesis. Here, we instead focus on the underexplored text-to-4D setting and synthesize dynamic, animated 3D objects using score distillation methods with an additional temporal dimension. Compared to previous work, we pursue a novel compositional generation-based approach, and combine text-to-image, text-to-video, and 3D-aware multiview diffusion models to provide feedback during 4D object optimization, thereby simultaneously enforcing temporal consistency, high-quality visual appearance and realistic geometry. Our method, called Align Your Gaussians (AYG), leverages dynamic 3D Gaussian Splatting with deformation fields as 4D representation. Crucial to AYG is a novel method to regularize the distribution of the moving 3D Gaussians and thereby stabilize the optimization and induce motion. We also propose a motion amplification mechanism as well as a new autoregressive synthesis scheme to generate and combine multiple 4D sequences for longer generation. These techniques allow us to synthesize vivid dynamic scenes, outperform previous work qualitatively and quantitatively and achieve state-of-the-art text-to-4D performance. Due to the Gaussian 4D representation, different 4D animations can be seamlessly combined, as we demonstrate. AYG opens up promising avenues for animation, simulation and digital content creation as well as synthetic data generation.

* Project page: https://research.nvidia.com/labs/toronto-ai/AlignYourGaussians/

Via

Access Paper or Ask Questions

A Vision Check-up for Language Models

Jan 03, 2024

Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrian Rodriguez-Munoz, Shivam Duggal, Phillip Isola, Antonio Torralba

Figure 1 for A Vision Check-up for Language Models

Figure 2 for A Vision Check-up for Language Models

Figure 3 for A Vision Check-up for Language Models

Figure 4 for A Vision Check-up for Language Models

Abstract:What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.

Via

Access Paper or Ask Questions

Customizing Motion in Text-to-Video Diffusion Models

Dec 07, 2023

Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, Bryan Russell

Figure 1 for Customizing Motion in Text-to-Video Diffusion Models

Figure 2 for Customizing Motion in Text-to-Video Diffusion Models

Figure 3 for Customizing Motion in Text-to-Video Diffusion Models

Figure 4 for Customizing Motion in Text-to-Video Diffusion Models

Abstract:We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios. Our contributions are threefold. First, to achieve our results, we finetune an existing text-to-video model to learn a novel mapping between the depicted motion in the input examples to a new unique token. To avoid overfitting to the new custom motion, we introduce an approach for regularization over videos. Second, by leveraging the motion priors in a pretrained model, our method can produce novel videos featuring multiple people doing the custom motion, and can invoke the motion in combination with other motions. Furthermore, our approach extends to the multimodal customization of motion and appearance of individualized subjects, enabling the generation of videos featuring unique characters and distinct motions. Third, to validate our method, we introduce an approach for quantitatively evaluating the learned custom motion and perform a systematic ablation study. We show that our method significantly outperforms prior appearance-based customization approaches when extended to the motion customization task.

* Project page: this website https://joaanna.github.io/customizing_motion/

Via

Access Paper or Ask Questions

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

Nov 27, 2023

Rohit Gandikota, Joanna Materzynska, Tingrui Zhou, Antonio Torralba, David Bau

Figure 1 for Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

Figure 2 for Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

Figure 3 for Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

Figure 4 for Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

Abstract:We present a method to create interpretable concept sliders that enable precise control over attributes in image generations from diffusion models. Our approach identifies a low-rank parameter direction corresponding to one concept while minimizing interference with other attributes. A slider is created using a small set of prompts or sample images; thus slider directions can be created for either textual or visual concepts. Concept Sliders are plug-and-play: they can be composed efficiently and continuously modulated, enabling precise control over image generation. In quantitative experiments comparing to previous editing techniques, our sliders exhibit stronger targeted edits with lower interference. We showcase sliders for weather, age, styles, and expressions, as well as slider compositions. We show how sliders can transfer latents from StyleGAN for intuitive editing of visual concepts for which textual description is difficult. We also find that our method can help address persistent quality issues in Stable Diffusion XL including repair of object deformations and fixing distorted hands. Our code, data, and trained sliders are available at https://sliders.baulab.info/

Via

Access Paper or Ask Questions

ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

Sep 28, 2023

Qiao Gu, Alihusein Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa(+6 more)

Figure 1 for ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

Figure 2 for ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

Figure 3 for ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

Figure 4 for ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

Abstract:For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, which do not scale well in larger environments, nor do they contain semantic spatial relationships between entities in the environment, which are useful for downstream planning. In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models. We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc )

* Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc

Via

Access Paper or Ask Questions

A Function Interpretation Benchmark for Evaluating Interpretability Methods

Sep 07, 2023

Sarah Schwettmann, Tamar Rott Shaham, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba

Figure 1 for A Function Interpretation Benchmark for Evaluating Interpretability Methods

Figure 2 for A Function Interpretation Benchmark for Evaluating Interpretability Methods

Figure 3 for A Function Interpretation Benchmark for Evaluating Interpretability Methods

Figure 4 for A Function Interpretation Benchmark for Evaluating Interpretability Methods

Abstract:Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions are procedurally constructed across textual and numeric domains, and involve a range of real-world complexities, including noise, composition, approximation, and bias. We evaluate new and existing methods that use language models (LMs) to produce code-based and language descriptions of function behavior. We find that an off-the-shelf LM augmented with only black-box access to functions can sometimes infer their structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, LM-based descriptions tend to capture global function behavior and miss local corruptions. These results show that FIND will be useful for characterizing the performance of more sophisticated interpretability methods before they are applied to real-world models.

* 25 pages, 7 figures

Via

Access Paper or Ask Questions

Follow Anything: Open-set detection, tracking, and following in real-time

Aug 10, 2023

Alaa Maalouf, Ninad Jadhav, Krishna Murthy Jatavallabhula, Makram Chahine, Daniel M. Vogt, Robert J. Wood, Antonio Torralba, Daniela Rus

Figure 1 for Follow Anything: Open-set detection, tracking, and following in real-time

Figure 2 for Follow Anything: Open-set detection, tracking, and following in real-time

Figure 3 for Follow Anything: Open-set detection, tracking, and following in real-time

Figure 4 for Follow Anything: Open-set detection, tracking, and following in real-time

Abstract:Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this paper, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed ``follow anything'' (FAn), is an open-vocabulary and multimodal model -- it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models (foundation models), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial vehicle) and report its ability to seamlessly follow the objects of interest in a real-time control loop. FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second. To enable rapid adoption, deployment, and extensibility, we open-source all our code on our project webpage at https://github.com/alaamaalouf/FollowAnything . We also encourage the reader the watch our 5-minutes explainer video in this https://www.youtube.com/watch?v=6Mgt3EPytrw .

* Project webpage: https://github.com/alaamaalouf/FollowAnything Explainer video: https://www.youtube.com/watch?v=6Mgt3EPytrw

Via

Access Paper or Ask Questions

Multimodal Neurons in Pretrained Text-Only Transformers

Aug 03, 2023

Sarah Schwettmann, Neil Chowdhury, Antonio Torralba

Abstract:Language models demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities. Can we trace this ability to individual neurons? We study the case where a frozen text transformer is augmented with vision using a self-supervised visual encoder and a single linear projection learned on an image-to-text task. Outputs of the projection layer are not immediately decodable into language describing image content; instead, we find that translation between modalities occurs deeper within the transformer. We introduce a procedure for identifying "multimodal neurons" that convert visual representations into corresponding text, and decoding the concepts they inject into the model's residual stream. In a series of experiments, we show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.

Via

Access Paper or Ask Questions

DreamTeacher: Pretraining Image Backbones with Deep Generative Models

Jul 14, 2023

Daiqing Li, Huan Ling, Amlan Kar, David Acuna, Seung Wook Kim, Karsten Kreis, Antonio Torralba, Sanja Fidler

Figure 1 for DreamTeacher: Pretraining Image Backbones with Deep Generative Models

Figure 2 for DreamTeacher: Pretraining Image Backbones with Deep Generative Models

Figure 3 for DreamTeacher: Pretraining Image Backbones with Deep Generative Models

Figure 4 for DreamTeacher: Pretraining Image Backbones with Deep Generative Models

Abstract:In this work, we introduce a self-supervised feature representation learning framework DreamTeacher that utilizes generative networks for pre-training downstream image backbones. We propose to distill knowledge from a trained generative model into standard image backbones that have been well engineered for specific perception tasks. We investigate two types of knowledge distillation: 1) distilling learned generative features onto target image backbones as an alternative to pretraining these backbones on large labeled datasets such as ImageNet, and 2) distilling labels obtained from generative networks with task heads onto logits of target backbones. We perform extensive analyses on multiple generative models, dense prediction benchmarks, and several pre-training regimes. We empirically find that our DreamTeacher significantly outperforms existing self-supervised representation learning approaches across the board. Unsupervised ImageNet pre-training with DreamTeacher leads to significant improvements over ImageNet classification pre-training on downstream datasets, showcasing generative models, and diffusion generative models specifically, as a promising approach to representation learning on large, diverse datasets without requiring manual annotation.

* Project page: https://research.nvidia.com/labs/toronto-ai/DreamTeacher/

Via

Access Paper or Ask Questions

Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models

Jun 08, 2023

Nan Liu, Yilun Du, Shuang Li, Joshua B. Tenenbaum, Antonio Torralba

Abstract:Text-to-image generative models have enabled high-resolution image synthesis across different domains, but require users to specify the content they wish to generate. In this paper, we consider the inverse problem -- given a collection of different images, can we discover the generative concepts that represent each image? We present an unsupervised approach to discover generative concepts from a collection of images, disentangling different art styles in paintings, objects, and lighting from kitchen scenes, and discovering image classes given ImageNet images. We show how such generative concepts can accurately represent the content of images, be recombined and composed to generate new artistic and hybrid images, and be further used as a representation for downstream classification tasks.

* Project Webpage: https://energy-based-model.github.io/unsupervised-concept-discovery/

Via

Access Paper or Ask Questions