Alert button
Picture for Gal Chechik

Gal Chechik

Alert button

Data Augmentations in Deep Weight Spaces

Nov 15, 2023
Aviv Shamsian, David W. Zhang, Aviv Navon, Yan Zhang, Miltiadis Kofinas, Idan Achituve, Riccardo Valperga, Gertjan J. Burghouts, Efstratios Gavves, Cees G. M. Snoek, Ethan Fetaya, Gal Chechik, Haggai Maron

Learning in weight spaces, where neural networks process the weights of other deep neural networks, has emerged as a promising research direction with applications in various fields, from analyzing and editing neural fields and implicit neural representations, to network pruning and quantization. Recent works designed architectures for effective learning in that space, which takes into account its unique, permutation-equivariant, structure. Unfortunately, so far these architectures suffer from severe overfitting and were shown to benefit from large datasets. This poses a significant challenge because generating data for this learning setup is laborious and time-consuming since each data sample is a full set of network weights that has to be trained. In this paper, we address this difficulty by investigating data augmentations for weight spaces, a set of techniques that enable generating new data examples on the fly without having to train additional input weight space elements. We first review several recently proposed data augmentation schemes %that were proposed recently and divide them into categories. We then introduce a novel augmentation scheme based on the Mixup method. We evaluate the performance of these techniques on existing benchmarks as well as new benchmarks we generate, which can be valuable for future studies.

* Accepted to NeurIPS 2023 Workshop on Symmetry and Geometry in Neural Representations 
Viaarxiv icon

Equivariant Deep Weight Space Alignment

Oct 20, 2023
Aviv Navon, Aviv Shamsian, Ethan Fetaya, Gal Chechik, Nadav Dym, Haggai Maron

Permutation symmetries of deep networks make simple operations like model averaging and similarity estimation challenging. In many cases, aligning the weights of the networks, i.e., finding optimal permutations between their weights, is necessary. More generally, weight alignment is essential for a wide range of applications, from model merging, through exploring the optimization landscape of deep neural networks, to defining meaningful distance functions between neural networks. Unfortunately, weight alignment is an NP-hard problem. Prior research has mainly focused on solving relaxed versions of the alignment problem, leading to either time-consuming methods or sub-optimal solutions. To accelerate the alignment process and improve its quality, we propose a novel framework aimed at learning to solve the weight alignment problem, which we name Deep-Align. To that end, we first demonstrate that weight alignment adheres to two fundamental symmetries and then, propose a deep architecture that respects these symmetries. Notably, our framework does not require any labeled data. We provide a theoretical analysis of our approach and evaluate Deep-Align on several types of network architectures and learning setups. Our experimental results indicate that a feed-forward pass with Deep-Align produces better or equivalent alignments compared to those produced by current optimization algorithms. Additionally, our alignments can be used as an initialization for other methods to gain even better solutions with a significant speedup in convergence.

Viaarxiv icon

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

Jul 13, 2023
Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Cohen-Or, Ariel Shamir, Amit H. Bermano

Figure 1 for Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models
Figure 2 for Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models
Figure 3 for Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models
Figure 4 for Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.

* Project page at https://datencoder.github.io 
Viaarxiv icon

Point-Cloud Completion with Pretrained Text-to-image Diffusion Models

Jun 18, 2023
Yoni Kasten, Ohad Rahamim, Gal Chechik

Figure 1 for Point-Cloud Completion with Pretrained Text-to-image Diffusion Models
Figure 2 for Point-Cloud Completion with Pretrained Text-to-image Diffusion Models
Figure 3 for Point-Cloud Completion with Pretrained Text-to-image Diffusion Models
Figure 4 for Point-Cloud Completion with Pretrained Text-to-image Diffusion Models

Point-cloud data collected in real-world applications are often incomplete. Data is typically missing due to objects being observed from partial viewpoints, which only capture a specific perspective or angle. Additionally, data can be incomplete due to occlusion and low-resolution sampling. Existing completion approaches rely on datasets of predefined objects to guide the completion of noisy and incomplete, point clouds. However, these approaches perform poorly when tested on Out-Of-Distribution (OOD) objects, that are poorly represented in the training dataset. Here we leverage recent advances in text-guided image generation, which lead to major breakthroughs in text-guided shape generation. We describe an approach called SDS-Complete that uses a pre-trained text-to-image diffusion model and leverages the text semantics of a given incomplete point cloud of an object, to obtain a complete surface representation. SDS-Complete can complete a variety of objects using test-time optimization without expensive collection of 3D information. We evaluate SDS Complete on incomplete scanned objects, captured by real-world depth sensors and LiDAR scanners. We find that it effectively reconstructs objects that are absent from common datasets, reducing Chamfer loss by 50% on average compared with current methods. Project page: https://sds-complete.github.io/

Viaarxiv icon

Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

Jun 15, 2023
Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfogel, Yoav Goldberg, Gal Chechik

Figure 1 for Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment
Figure 2 for Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment
Figure 3 for Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment
Figure 4 for Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one notable example, a query like ``a pink sunflower and a yellow flamingo'' may incorrectly produce an image of a yellow sunflower and a pink flamingo. To remedy this issue, we propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifiers, and then uses a novel loss function that encourages the cross-attention maps to agree with the linguistic binding reflected by the syntax. Specifically, we encourage large overlap between attention maps of entities and their modifiers, and small overlap with other entities and modifier words. The loss is optimized during inference, without retraining or fine-tuning the model. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods. This work highlights how making use of sentence structure during inference can efficiently and substantially improve the faithfulness of text-to-image generation.

* We make our code publicly available https://github.com/RoyiRa/Syntax-Guided-Generation 
Viaarxiv icon

Norm-guided latent space exploration for text-to-image generation

Jun 14, 2023
Dvir Samuel, Rami Ben-Ari, Nir Darshan, Haggai Maron, Gal Chechik

Figure 1 for Norm-guided latent space exploration for text-to-image generation
Figure 2 for Norm-guided latent space exploration for text-to-image generation
Figure 3 for Norm-guided latent space exploration for text-to-image generation
Figure 4 for Norm-guided latent space exploration for text-to-image generation

Text-to-image diffusion models show great potential in synthesizing a large variety of concepts in new compositions and scenarios. However, their latent seed space is still not well understood and has been shown to have an impact in generating new and rare concepts. Specifically, simple operations like interpolation and centroid finding work poorly with the standard Euclidean and spherical metrics in the latent space. This paper makes the observation that current training procedures make diffusion models biased toward inputs with a narrow range of norm values. This has strong implications for methods that rely on seed manipulation for image generation that can be further applied to few-shot and long-tail learning tasks. To address this issue, we propose a novel method for interpolating between two seeds and demonstrate that it defines a new non-Euclidean metric that takes into account a norm-based prior on seeds. We describe a simple yet efficient algorithm for approximating this metric and use it to further define centroids in the latent seed space. We show that our new interpolation and centroid evaluation techniques significantly enhance the generation of rare concept images. This further leads to state-of-the-art performance on few-shot and long-tail benchmarks, improving prior approach in terms of generation speed, image quality, and semantic content.

Viaarxiv icon

DisCLIP: Open-Vocabulary Referring Expression Generation

May 30, 2023
Lior Bracha, Eitan Shaar, Aviv Shamsian, Ethan Fetaya, Gal Chechik

Figure 1 for DisCLIP: Open-Vocabulary Referring Expression Generation
Figure 2 for DisCLIP: Open-Vocabulary Referring Expression Generation
Figure 3 for DisCLIP: Open-Vocabulary Referring Expression Generation
Figure 4 for DisCLIP: Open-Vocabulary Referring Expression Generation

Referring Expressions Generation (REG) aims to produce textual descriptions that unambiguously identifies specific objects within a visual scene. Traditionally, this has been achieved through supervised learning methods, which perform well on specific data distributions but often struggle to generalize to new images and concepts. To address this issue, we present a novel approach for REG, named DisCLIP, short for discriminative CLIP. We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image while avoiding other distracting concepts. Notably, this optimization happens at inference time and does not require additional training or tuning of learned parameters. We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene. To achieve this, we use a frozen zero-shot comprehension module as a critique of our generated referring expressions. We evaluate DisCLIP on multiple referring expression benchmarks through human evaluation and show that it significantly outperforms previous methods on out-of-domain datasets. Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.

Viaarxiv icon

Key-Locked Rank One Editing for Text-to-Image Personalization

May 02, 2023
Yoad Tewel, Rinon Gal, Gal Chechik, Yuval Atzmon

Figure 1 for Key-Locked Rank One Editing for Text-to-Image Personalization
Figure 2 for Key-Locked Rank One Editing for Text-to-Image Personalization
Figure 3 for Key-Locked Rank One Editing for Text-to-Image Personalization
Figure 4 for Key-Locked Rank One Editing for Text-to-Image Personalization

Text-to-image models (T2I) offer a new level of flexibility by allowing users to guide the creative process through natural language. However, personalizing these models to align with user-provided visual concepts remains a challenging problem. The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size. We present Perfusion, a T2I personalization method that addresses these challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion avoids overfitting by introducing a new mechanism that "locks" new concepts' cross-attention Keys to their superordinate category. Additionally, we develop a gated rank-1 approach that enables us to control the influence of a learned concept during inference time and to combine multiple concepts. This allows runtime-efficient balancing of visual-fidelity and textual-alignment with a single 100KB trained model, which is five orders of magnitude smaller than the current state of the art. Moreover, it can span different operating points across the Pareto front without additional training. Finally, we show that Perfusion outperforms strong baselines in both qualitative and quantitative terms. Importantly, key-locking leads to novel results compared to traditional approaches, allowing to portray personalized object interactions in unprecedented ways, even in one-shot settings.

* Accepted to SIGGRAPH 2023. Project page is in https://research.nvidia.com/labs/par/Perfusion/ 
Viaarxiv icon

CALM: Conditional Adversarial Latent Models for Directable Virtual Characters

May 02, 2023
Chen Tessler, Yoni Kasten, Yunrong Guo, Shie Mannor, Gal Chechik, Xue Bin Peng

Figure 1 for CALM: Conditional Adversarial Latent Models for Directable Virtual Characters
Figure 2 for CALM: Conditional Adversarial Latent Models for Directable Virtual Characters
Figure 3 for CALM: Conditional Adversarial Latent Models for Directable Virtual Characters
Figure 4 for CALM: Conditional Adversarial Latent Models for Directable Virtual Characters

In this work, we present Conditional Adversarial Latent Models (CALM), an approach for generating diverse and directable behaviors for user-controlled interactive virtual characters. Using imitation learning, CALM learns a representation of movement that captures the complexity and diversity of human motion, and enables direct control over character movements. The approach jointly learns a control policy and a motion encoder that reconstructs key characteristics of a given motion without merely replicating it. The results show that CALM learns a semantic motion representation, enabling control over the generated motions and style-conditioning for higher-level task training. Once trained, the character can be controlled using intuitive interfaces, akin to those found in video games.

* Accepted to SIGGRAPH 2023 
Viaarxiv icon

It is all about where you start: Text-to-image generation with seed selection

Apr 27, 2023
Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, Gal Chechik

Figure 1 for It is all about where you start: Text-to-image generation with seed selection
Figure 2 for It is all about where you start: Text-to-image generation with seed selection
Figure 3 for It is all about where you start: Text-to-image generation with seed selection
Figure 4 for It is all about where you start: Text-to-image generation with seed selection

Text-to-image diffusion models can synthesize a large variety of concepts in new compositions and scenarios. However, they still struggle with generating uncommon concepts, rare unusual combinations, or structured concepts like hand palms. Their limitation is partly due to the long-tail nature of their training data: web-crawled data sets are strongly unbalanced, causing models to under-represent concepts from the tail of the distribution. Here we characterize the effect of unbalanced training data on text-to-image models and offer a remedy. We show that rare concepts can be correctly generated by carefully selecting suitable generation seeds in the noise space, a technique that we call SeedSelect. SeedSelect is efficient and does not require retraining the diffusion model. We evaluate the benefit of SeedSelect on a series of problems. First, in few-shot semantic data augmentation, where we generate semantically correct images for few-shot and long-tail benchmarks. We show classification improvement on all classes, both from the head and tail of the training data of diffusion models. We further evaluate SeedSelect on correcting images of hands, a well-known pitfall of current diffusion models, and show that it improves hand generation substantially.

Viaarxiv icon