Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Varun Jampani

Alchemist: Parametric Control of Material Properties with Diffusion Models

Dec 05, 2023

Prafull Sharma, Varun Jampani, Yuanzhen Li, Xuhui Jia, Dmitry Lagun, Fredo Durand, William T. Freeman, Mark Matthews

Figure 1 for Alchemist: Parametric Control of Material Properties with Diffusion Models

Figure 2 for Alchemist: Parametric Control of Material Properties with Diffusion Models

Figure 3 for Alchemist: Parametric Control of Material Properties with Diffusion Models

Figure 4 for Alchemist: Parametric Control of Material Properties with Diffusion Models

Abstract:We propose a method to control material attributes of objects like roughness, metallic, albedo, and transparency in real images. Our method capitalizes on the generative prior of text-to-image models known for photorealism, employing a scalar value and instructions to alter low-level material properties. Addressing the lack of datasets with controlled material attributes, we generated an object-centric synthetic dataset with physically-based materials. Fine-tuning a modified pre-trained text-to-image model on this synthetic dataset enables us to edit material properties in real-world images while preserving all other attributes. We show the potential application of our model to material edited NeRFs.

Via

Access Paper or Ask Questions

UniGS: Unified Representation for Image Generation and Segmentation

Dec 04, 2023

Lu Qi, Lehan Yang, Weidong Guo, Yu Xu, Bo Du, Varun Jampani, Ming-Hsuan Yang

Abstract:This paper introduces a novel unified representation of diffusion models for image generation and segmentation. Specifically, we use a colormap to represent entity-level masks, addressing the challenge of varying entity numbers while aligning the representation closely with the image RGB domain. Two novel modules, including the location-aware color palette and progressive dichotomy module, are proposed to support our mask representation. On the one hand, a location-aware palette guarantees the colors' consistency to entities' locations. On the other hand, the progressive dichotomy module can efficiently decode the synthesized colormap to high-quality entity-level masks in a depth-first binary search without knowing the cluster numbers. To tackle the issue of lacking large-scale segmentation training data, we employ an inpainting pipeline and then improve the flexibility of diffusion models across various tasks, including inpainting, image synthesis, referring segmentation, and entity segmentation. Comprehensive experiments validate the efficiency of our approach, demonstrating comparable segmentation mask quality to state-of-the-art and adaptability to multiple tasks. The code will be released at \href{https://github.com/qqlu/Entity}{https://github.com/qqlu/Entity}.

Via

Access Paper or Ask Questions

One-Shot Open Affordance Learning with Foundation Models

Nov 29, 2023

Gen Li, Deqing Sun, Laura Sevilla-Lara, Varun Jampani

Figure 1 for One-Shot Open Affordance Learning with Foundation Models

Figure 2 for One-Shot Open Affordance Learning with Foundation Models

Figure 3 for One-Shot Open Affordance Learning with Foundation Models

Figure 4 for One-Shot Open Affordance Learning with Foundation Models

Abstract:We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category, but is expected to identify novel objects and affordances. While vision-language models excel at recognizing novel objects and scenes, they often struggle to understand finer levels of granularity such as affordances. To handle this issue, we conduct a comprehensive analysis of existing foundation models, to explore their inherent understanding of affordances and assess the potential for data-limited affordance learning. We then propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data, and exhibits reasonable generalization capability on unseen objects and affordances.

Via

Access Paper or Ask Questions

Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

Nov 28, 2023

Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun, Ming-Hsuan Yang

Figure 1 for Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

Figure 2 for Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

Figure 3 for Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

Figure 4 for Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

Abstract:While pre-trained large-scale vision models have shown significant promise for semantic correspondence, their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporating this information can markedly enhance semantic correspondence performance with simple but effective solutions in both zero-shot and supervised settings. We also construct a new challenging benchmark for semantic correspondence built from an existing animal pose estimation dataset, for both pre-training validating models. Our method achieves a PCK@0.10 score of 64.2 (zero-shot) and 85.6 (supervised) on the challenging SPair-71k dataset, outperforming the state-of-the-art by 4.3p and 11.0p absolute gains, respectively. Our code and datasets will be publicly available.

* Project page: https://telling-left-from-right.github.io/

Via

Access Paper or Ask Questions

Exploring Attribute Variations in Style-based GANs using Diffusion Models

Nov 27, 2023

Rishubh Parihar, Prasanna Balaji, Raghav Magazine, Sarthak Vora, Tejan Karmali, Varun Jampani, R. Venkatesh Babu

Abstract:Existing attribute editing methods treat semantic attributes as binary, resulting in a single edit per attribute. However, attributes such as eyeglasses, smiles, or hairstyles exhibit a vast range of diversity. In this work, we formulate the task of \textit{diverse attribute editing} by modeling the multidimensional nature of attribute edits. This enables users to generate multiple plausible edits per attribute. We capitalize on disentangled latent spaces of pretrained GANs and train a Denoising Diffusion Probabilistic Model (DDPM) to learn the latent distribution for diverse edits. Specifically, we train DDPM over a dataset of edit latent directions obtained by embedding image pairs with a single attribute change. This leads to latent subspaces that enable diverse attribute editing. Applying diffusion in the highly compressed latent space allows us to model rich distributions of edits within limited computational resources. Through extensive qualitative and quantitative experiments conducted across a range of datasets, we demonstrate the effectiveness of our approach for diverse attribute editing. We also showcase the results of our method applied for 3D editing of various face attributes.

* Neurips Workshop on Diffusion Models 2023

Via

Access Paper or Ask Questions

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Nov 25, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts(+2 more)

Figure 1 for Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Figure 2 for Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Figure 3 for Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Figure 4 for Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Abstract:We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at https://github.com/Stability-AI/generative-models .

Via

Access Paper or Ask Questions

ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

Nov 22, 2023

Viraj Shah, Nataniel Ruiz, Forrester Cole, Erika Lu, Svetlana Lazebnik, Yuanzhen Li, Varun Jampani

Figure 1 for ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

Figure 2 for ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

Figure 3 for ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

Figure 4 for ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

Abstract:Methods for finetuning generative models for concept-driven personalization generally achieve strong results for subject-driven or style-driven generation. Recently, low-rank adaptations (LoRA) have been proposed as a parameter-efficient way of achieving concept-driven personalization. While recent work explores the combination of separate LoRAs to achieve joint generation of learned styles and subjects, existing techniques do not reliably address the problem; they often compromise either subject fidelity or style fidelity. We propose ZipLoRA, a method to cheaply and effectively merge independently trained style and subject LoRAs in order to achieve generation of any user-provided subject in any user-provided style. Experiments on a wide range of subject and style combinations show that ZipLoRA can generate compelling results with meaningful improvements over baselines in subject and style fidelity while preserving the ability to recontextualize. Project page: https://ziplora.github.io

* Project page: https://ziplora.github.io

Via

Access Paper or Ask Questions

OmniControl: Control Any Joint at Any Time for Human Motion Generation

Oct 12, 2023

Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, Huaizu Jiang

Figure 1 for OmniControl: Control Any Joint at Any Time for Human Motion Generation

Figure 2 for OmniControl: Control Any Joint at Any Time for Human Motion Generation

Figure 3 for OmniControl: Control Any Joint at Any Time for Human Motion Generation

Figure 4 for OmniControl: Control Any Joint at Any Time for Human Motion Generation

Abstract:We present a novel approach named OmniControl for incorporating flexible spatial control signals into a text-conditioned human motion generation model based on the diffusion process. Unlike previous methods that can only control the pelvis trajectory, OmniControl can incorporate flexible spatial control signals over different joints at different times with only one model. Specifically, we propose analytic spatial guidance that ensures the generated motion can tightly conform to the input control signals. At the same time, realism guidance is introduced to refine all the joints to generate more coherent motion. Both the spatial and realism guidance are essential and they are highly complementary for balancing control accuracy and motion realism. By combining them, OmniControl generates motions that are realistic, coherent, and consistent with the spatial constraints. Experiments on HumanML3D and KIT-ML datasets show that OmniControl not only achieves significant improvement over state-of-the-art methods on pelvis control but also shows promising results when incorporating the constraints over other joints.

* Project page: https://neu-vi.github.io/omnicontrol/

Via

Access Paper or Ask Questions

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

Jul 13, 2023

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, Kfir Aberman

Figure 1 for HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

Figure 2 for HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

Figure 3 for HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

Figure 4 for HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

Abstract:Personalization has emerged as a prominent aspect within the field of generative AI, enabling the synthesis of individuals in diverse contexts and styles, while retaining high-fidelity to their identities. However, the process of personalization presents inherent challenges in terms of time and memory requirements. Fine-tuning each personalized model needs considerable GPU time investment, and storing a personalized model per subject can be demanding in terms of storage capacity. To overcome these challenges, we propose HyperDreamBooth-a hypernetwork capable of efficiently generating a small set of personalized weights from a single image of a person. By composing these weights into the diffusion model, coupled with fast finetuning, HyperDreamBooth can generate a person's face in various contexts and styles, with high subject details while also preserving the model's crucial knowledge of diverse styles and semantic modifications. Our method achieves personalization on faces in roughly 20 seconds, 25x faster than DreamBooth and 125x faster than Textual Inversion, using as few as one reference image, with the same quality and style diversity as DreamBooth. Also our method yields a model that is 10000x smaller than a normal DreamBooth model. Project page: https://hyperdreambooth.github.io

* project page: https://hyperdreambooth.github.io

Via

Access Paper or Ask Questions

NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations

Jun 15, 2023

Varun Jampani, Kevis-Kokitsi Maninis, Andreas Engelhardt, Arjun Karpur, Karen Truong, Kyle Sargent, Stefan Popov, André Araujo, Ricardo Martin-Brualla, Kaushal Patel(+6 more)

Figure 1 for NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations

Figure 2 for NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations

Figure 3 for NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations

Figure 4 for NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations

Abstract:Recent advances in neural reconstruction enable high-quality 3D object reconstruction from casually captured image collections. Current techniques mostly analyze their progress on relatively simple image collections where Structure-from-Motion (SfM) techniques can provide ground-truth (GT) camera poses. We note that SfM techniques tend to fail on in-the-wild image collections such as image search results with varying backgrounds and illuminations. To enable systematic research progress on 3D reconstruction from casual image captures, we propose NAVI: a new dataset of category-agnostic image collections of objects with high-quality 3D scans along with per-image 2D-3D alignments providing near-perfect GT camera parameters. These 2D-3D alignments allow us to extract accurate derivative annotations such as dense pixel correspondences, depth and segmentation maps. We demonstrate the use of NAVI image collections on different problem settings and show that NAVI enables more thorough evaluations that were not possible with existing datasets. We believe NAVI is beneficial for systematic research progress on 3D reconstruction and correspondence estimation. Project page: https://navidataset.github.io

* Project page: https://navidataset.github.io

Via

Access Paper or Ask Questions