Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gal Chechik

Bar Ilan University and Google research

Point-Cloud Completion with Pretrained Text-to-image Diffusion Models

Jun 18, 2023

Yoni Kasten, Ohad Rahamim, Gal Chechik

Abstract:Point-cloud data collected in real-world applications are often incomplete. Data is typically missing due to objects being observed from partial viewpoints, which only capture a specific perspective or angle. Additionally, data can be incomplete due to occlusion and low-resolution sampling. Existing completion approaches rely on datasets of predefined objects to guide the completion of noisy and incomplete, point clouds. However, these approaches perform poorly when tested on Out-Of-Distribution (OOD) objects, that are poorly represented in the training dataset. Here we leverage recent advances in text-guided image generation, which lead to major breakthroughs in text-guided shape generation. We describe an approach called SDS-Complete that uses a pre-trained text-to-image diffusion model and leverages the text semantics of a given incomplete point cloud of an object, to obtain a complete surface representation. SDS-Complete can complete a variety of objects using test-time optimization without expensive collection of 3D information. We evaluate SDS Complete on incomplete scanned objects, captured by real-world depth sensors and LiDAR scanners. We find that it effectively reconstructs objects that are absent from common datasets, reducing Chamfer loss by 50% on average compared with current methods. Project page: https://sds-complete.github.io/

Via

Access Paper or Ask Questions

Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

Jun 15, 2023

Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfogel, Yoav Goldberg, Gal Chechik

Abstract:Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one notable example, a query like ``a pink sunflower and a yellow flamingo'' may incorrectly produce an image of a yellow sunflower and a pink flamingo. To remedy this issue, we propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifiers, and then uses a novel loss function that encourages the cross-attention maps to agree with the linguistic binding reflected by the syntax. Specifically, we encourage large overlap between attention maps of entities and their modifiers, and small overlap with other entities and modifier words. The loss is optimized during inference, without retraining or fine-tuning the model. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods. This work highlights how making use of sentence structure during inference can efficiently and substantially improve the faithfulness of text-to-image generation.

* We make our code publicly available https://github.com/RoyiRa/Syntax-Guided-Generation

Via

Access Paper or Ask Questions

Norm-guided latent space exploration for text-to-image generation

Jun 14, 2023

Dvir Samuel, Rami Ben-Ari, Nir Darshan, Haggai Maron, Gal Chechik

Figure 1 for Norm-guided latent space exploration for text-to-image generation

Figure 2 for Norm-guided latent space exploration for text-to-image generation

Figure 3 for Norm-guided latent space exploration for text-to-image generation

Figure 4 for Norm-guided latent space exploration for text-to-image generation

Abstract:Text-to-image diffusion models show great potential in synthesizing a large variety of concepts in new compositions and scenarios. However, their latent seed space is still not well understood and has been shown to have an impact in generating new and rare concepts. Specifically, simple operations like interpolation and centroid finding work poorly with the standard Euclidean and spherical metrics in the latent space. This paper makes the observation that current training procedures make diffusion models biased toward inputs with a narrow range of norm values. This has strong implications for methods that rely on seed manipulation for image generation that can be further applied to few-shot and long-tail learning tasks. To address this issue, we propose a novel method for interpolating between two seeds and demonstrate that it defines a new non-Euclidean metric that takes into account a norm-based prior on seeds. We describe a simple yet efficient algorithm for approximating this metric and use it to further define centroids in the latent seed space. We show that our new interpolation and centroid evaluation techniques significantly enhance the generation of rare concept images. This further leads to state-of-the-art performance on few-shot and long-tail benchmarks, improving prior approach in terms of generation speed, image quality, and semantic content.

Via

Access Paper or Ask Questions

DisCLIP: Open-Vocabulary Referring Expression Generation

May 30, 2023

Lior Bracha, Eitan Shaar, Aviv Shamsian, Ethan Fetaya, Gal Chechik

Figure 1 for DisCLIP: Open-Vocabulary Referring Expression Generation

Figure 2 for DisCLIP: Open-Vocabulary Referring Expression Generation

Figure 3 for DisCLIP: Open-Vocabulary Referring Expression Generation

Figure 4 for DisCLIP: Open-Vocabulary Referring Expression Generation

Abstract:Referring Expressions Generation (REG) aims to produce textual descriptions that unambiguously identifies specific objects within a visual scene. Traditionally, this has been achieved through supervised learning methods, which perform well on specific data distributions but often struggle to generalize to new images and concepts. To address this issue, we present a novel approach for REG, named DisCLIP, short for discriminative CLIP. We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image while avoiding other distracting concepts. Notably, this optimization happens at inference time and does not require additional training or tuning of learned parameters. We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene. To achieve this, we use a frozen zero-shot comprehension module as a critique of our generated referring expressions. We evaluate DisCLIP on multiple referring expression benchmarks through human evaluation and show that it significantly outperforms previous methods on out-of-domain datasets. Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.

Via

Access Paper or Ask Questions

CALM: Conditional Adversarial Latent Models for Directable Virtual Characters

May 02, 2023

Chen Tessler, Yoni Kasten, Yunrong Guo, Shie Mannor, Gal Chechik, Xue Bin Peng

Figure 1 for CALM: Conditional Adversarial Latent Models for Directable Virtual Characters

Figure 2 for CALM: Conditional Adversarial Latent Models for Directable Virtual Characters

Figure 3 for CALM: Conditional Adversarial Latent Models for Directable Virtual Characters

Figure 4 for CALM: Conditional Adversarial Latent Models for Directable Virtual Characters

Abstract:In this work, we present Conditional Adversarial Latent Models (CALM), an approach for generating diverse and directable behaviors for user-controlled interactive virtual characters. Using imitation learning, CALM learns a representation of movement that captures the complexity and diversity of human motion, and enables direct control over character movements. The approach jointly learns a control policy and a motion encoder that reconstructs key characteristics of a given motion without merely replicating it. The results show that CALM learns a semantic motion representation, enabling control over the generated motions and style-conditioning for higher-level task training. Once trained, the character can be controlled using intuitive interfaces, akin to those found in video games.

* Accepted to SIGGRAPH 2023

Via

Access Paper or Ask Questions

Key-Locked Rank One Editing for Text-to-Image Personalization

May 02, 2023

Yoad Tewel, Rinon Gal, Gal Chechik, Yuval Atzmon

Figure 1 for Key-Locked Rank One Editing for Text-to-Image Personalization

Figure 2 for Key-Locked Rank One Editing for Text-to-Image Personalization

Figure 3 for Key-Locked Rank One Editing for Text-to-Image Personalization

Figure 4 for Key-Locked Rank One Editing for Text-to-Image Personalization

Abstract:Text-to-image models (T2I) offer a new level of flexibility by allowing users to guide the creative process through natural language. However, personalizing these models to align with user-provided visual concepts remains a challenging problem. The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size. We present Perfusion, a T2I personalization method that addresses these challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion avoids overfitting by introducing a new mechanism that "locks" new concepts' cross-attention Keys to their superordinate category. Additionally, we develop a gated rank-1 approach that enables us to control the influence of a learned concept during inference time and to combine multiple concepts. This allows runtime-efficient balancing of visual-fidelity and textual-alignment with a single 100KB trained model, which is five orders of magnitude smaller than the current state of the art. Moreover, it can span different operating points across the Pareto front without additional training. Finally, we show that Perfusion outperforms strong baselines in both qualitative and quantitative terms. Importantly, key-locking leads to novel results compared to traditional approaches, allowing to portray personalized object interactions in unprecedented ways, even in one-shot settings.

* Accepted to SIGGRAPH 2023. Project page is in https://research.nvidia.com/labs/par/Perfusion/

Via

Access Paper or Ask Questions

It is all about where you start: Text-to-image generation with seed selection

Apr 27, 2023

Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, Gal Chechik

Figure 1 for It is all about where you start: Text-to-image generation with seed selection

Figure 2 for It is all about where you start: Text-to-image generation with seed selection

Figure 3 for It is all about where you start: Text-to-image generation with seed selection

Figure 4 for It is all about where you start: Text-to-image generation with seed selection

Abstract:Text-to-image diffusion models can synthesize a large variety of concepts in new compositions and scenarios. However, they still struggle with generating uncommon concepts, rare unusual combinations, or structured concepts like hand palms. Their limitation is partly due to the long-tail nature of their training data: web-crawled data sets are strongly unbalanced, causing models to under-represent concepts from the tail of the distribution. Here we characterize the effect of unbalanced training data on text-to-image models and offer a remedy. We show that rare concepts can be correctly generated by carefully selecting suitable generation seeds in the noise space, a technique that we call SeedSelect. SeedSelect is efficient and does not require retraining the diffusion model. We evaluate the benefit of SeedSelect on a series of problems. First, in few-shot semantic data augmentation, where we generate semantically correct images for few-shot and long-tail benchmarks. We show classification improvement on all classes, both from the head and tail of the training data of diffusion models. We further evaluate SeedSelect on correcting images of hands, a well-known pitfall of current diffusion models, and show that it improves hand generation substantially.

Via

Access Paper or Ask Questions

Graph Positional Encoding via Random Feature Propagation

Mar 08, 2023

Moshe Eliasof, Fabrizio Frasca, Beatrice Bevilacqua, Eran Treister, Gal Chechik, Haggai Maron

Abstract:Two main families of node feature augmentation schemes have been explored for enhancing GNNs: random features and spectral positional encoding. Surprisingly, however, there is still no clear understanding of the relation between these two augmentation schemes. Here we propose a novel family of positional encoding schemes which draws a link between the above two approaches and improves over both. The new approach, named Random Feature Propagation (RFP), is inspired by the power iteration method and its generalizations. It concatenates several intermediate steps of an iterative algorithm for computing the dominant eigenvectors of a propagation matrix, starting from random node features. Notably, these propagation steps are based on graph-dependent propagation operators that can be either predefined or learned. We explore the theoretical and empirical benefits of RFP. First, we provide theoretical justifications for using random features, for incorporating early propagation steps, and for using multiple random initializations. Then, we empirically demonstrate that RFP significantly outperforms both spectral PE and random features in multiple node classification and graph classification benchmarks.

Via

Access Paper or Ask Questions

Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models

Mar 05, 2023

Rinon Gal, Moab Arar, Yuval Atzmon, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or

Abstract:Text-to-image personalization aims to teach a pre-trained diffusion model to reason about novel, user provided concepts, embedding them into new scenes guided by natural language prompts. However, current personalization approaches struggle with lengthy training times, high storage requirements or loss of identity. To overcome these limitations, we propose an encoder-based domain-tuning approach. Our key insight is that by underfitting on a large set of concepts from a given domain, we can improve generalization and create a model that is more amenable to quickly adding novel concepts from the same domain. Specifically, we employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain, e.g. a specific face, and learns to map it into a word-embedding representing the concept. Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts. Together, these components are used to guide the learning of unseen concepts, allowing us to personalize a model using only a single image and as few as 5 training steps - accelerating personalization from dozens of minutes to seconds, while preserving quality.

* Project page at https://tuning-encoder.github.io/

Via

Access Paper or Ask Questions

Guided Deep Kernel Learning

Feb 19, 2023

Idan Achituve, Gal Chechik, Ethan Fetaya

Abstract:Combining Gaussian processes with the expressive power of deep neural networks is commonly done nowadays through deep kernel learning (DKL). Unfortunately, due to the kernel optimization process, this often results in losing their Bayesian benefits. In this study, we present a novel approach for learning deep kernels by utilizing infinite-width neural networks. We propose to use the Neural Network Gaussian Process (NNGP) model as a guide to the DKL model in the optimization process. Our approach harnesses the reliable uncertainty estimation of the NNGPs to adapt the DKL target confidence when it encounters novel data points. As a result, we get the best of both worlds, we leverage the Bayesian behavior of the NNGP, namely its robustness to overfitting, and accurate uncertainty estimation, while maintaining the generalization abilities, scalability, and flexibility of deep kernels. Empirically, we show on multiple benchmark datasets of varying sizes and dimensionality, that our method is robust to overfitting, has good predictive performance, and provides reliable uncertainty estimations.

Via

Access Paper or Ask Questions