Alert button
Picture for Ariel Shamir

Ariel Shamir

Alert button

Semantify: Simplifying the Control of 3D Morphable Models using CLIP

Aug 14, 2023
Omer Gralnik, Guy Gafni, Ariel Shamir

We present Semantify: a self-supervised method that utilizes the semantic power of CLIP language-vision foundation model to simplify the control of 3D morphable models. Given a parametric model, training data is created by randomly sampling the model's parameters, creating various shapes and rendering them. The similarity between the output images and a set of word descriptors is calculated in CLIP's latent space. Our key idea is first to choose a small set of semantically meaningful and disentangled descriptors that characterize the 3DMM, and then learn a non-linear mapping from scores across this set to the parametric coefficients of the given 3DMM. The non-linear mapping is defined by training a neural network without a human-in-the-loop. We present results on numerous 3DMMs: body shape models, face shape and expression models, as well as animal shapes. We demonstrate how our method defines a simple slider interface for intuitive modeling, and show how the mapping can be used to instantly fit a 3D parametric body shape to in-the-wild images.

Viaarxiv icon

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

Jul 13, 2023
Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Cohen-Or, Ariel Shamir, Amit H. Bermano

Figure 1 for Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models
Figure 2 for Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models
Figure 3 for Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models
Figure 4 for Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.

* Project page at https://datencoder.github.io 
Viaarxiv icon

HoughLaneNet: Lane Detection with Deep Hough Transform and Dynamic Convolution

Jul 07, 2023
Jia-Qi Zhang, Hao-Bin Duan, Jun-Long Chen, Ariel Shamir, Miao Wang

Figure 1 for HoughLaneNet: Lane Detection with Deep Hough Transform and Dynamic Convolution
Figure 2 for HoughLaneNet: Lane Detection with Deep Hough Transform and Dynamic Convolution
Figure 3 for HoughLaneNet: Lane Detection with Deep Hough Transform and Dynamic Convolution
Figure 4 for HoughLaneNet: Lane Detection with Deep Hough Transform and Dynamic Convolution

The task of lane detection has garnered considerable attention in the field of autonomous driving due to its complexity. Lanes can present difficulties for detection, as they can be narrow, fragmented, and often obscured by heavy traffic. However, it has been observed that the lanes have a geometrical structure that resembles a straight line, leading to improved lane detection results when utilizing this characteristic. To address this challenge, we propose a hierarchical Deep Hough Transform (DHT) approach that combines all lane features in an image into the Hough parameter space. Additionally, we refine the point selection method and incorporate a Dynamic Convolution Module to effectively differentiate between lanes in the original image. Our network architecture comprises a backbone network, either a ResNet or Pyramid Vision Transformer, a Feature Pyramid Network as the neck to extract multi-scale features, and a hierarchical DHT-based feature aggregation head to accurately segment each lane. By utilizing the lane features in the Hough parameter space, the network learns dynamic convolution kernel parameters corresponding to each lane, allowing the Dynamic Convolution Module to effectively differentiate between lane features. Subsequently, the lane features are fed into the feature decoder, which predicts the final position of the lane. Our proposed network structure demonstrates improved performance in detecting heavily occluded or worn lane images, as evidenced by our extensive experimental results, which show that our method outperforms or is on par with state-of-the-art techniques.

Viaarxiv icon

Concept Decomposition for Visual Exploration and Inspiration

May 31, 2023
Yael Vinker, Andrey Voynov, Daniel Cohen-Or, Ariel Shamir

Figure 1 for Concept Decomposition for Visual Exploration and Inspiration
Figure 2 for Concept Decomposition for Visual Exploration and Inspiration
Figure 3 for Concept Decomposition for Visual Exploration and Inspiration
Figure 4 for Concept Decomposition for Visual Exploration and Inspiration

A creative idea is often born from transforming, combining, and modifying ideas from existing visual examples capturing various concepts. However, one cannot simply copy the concept as a whole, and inspiration is achieved by examining certain aspects of the concept. Hence, it is often necessary to separate a concept into different aspects to provide new perspectives. In this paper, we propose a method to decompose a visual concept, represented as a set of images, into different visual aspects encoded in a hierarchical tree structure. We utilize large vision-language models and their rich latent space for concept decomposition and generation. Each node in the tree represents a sub-concept using a learned vector embedding injected into the latent space of a pretrained text-to-image model. We use a set of regularizations to guide the optimization of the embedding vectors encoded in the nodes to follow the hierarchical structure of the tree. Our method allows to explore and discover new concepts derived from the original one. The tree provides the possibility of endless visual sampling at each node, allowing the user to explore the hidden sub-concepts of the object of interest. The learned aspects in each node can be combined within and across trees to create new visual ideas, and can be used in natural language sentences to apply such aspects to new designs.

* https://inspirationtree.github.io/inspirationtree/ 
Viaarxiv icon

PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point Clouds

Mar 17, 2023
Anran Qi, Sauradip Nag, Xiatian Zhu, Ariel Shamir

Figure 1 for PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point Clouds
Figure 2 for PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point Clouds
Figure 3 for PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point Clouds
Figure 4 for PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point Clouds

Garment pattern design aims to convert a 3D garment to the corresponding 2D panels and their sewing structure. Existing methods rely either on template fitting with heuristics and prior assumptions, or on model learning with complicated shape parameterization. Importantly, both approaches do not allow for personalization of the output garment, which today has increasing demands. To fill this demand, we introduce PersonalTailor: a personalized 2D pattern design method, where the user can input specific constraints or demands (in language or sketch) for personal 2D panel fabrication from 3D point clouds. PersonalTailor first learns a multi-modal panel embeddings based on unsupervised cross-modal association and attentive fusion. It then predicts a binary panel masks individually using a transformer encoder-decoder framework. Extensive experiments show that our PersonalTailor excels on both personalized and standard pattern fabrication tasks.

* Technical Report 
Viaarxiv icon

Word-As-Image for Semantic Typography

Mar 06, 2023
Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, Ariel Shamir

Figure 1 for Word-As-Image for Semantic Typography
Figure 2 for Word-As-Image for Semantic Typography
Figure 3 for Word-As-Image for Semantic Typography
Figure 4 for Word-As-Image for Semantic Typography

A word-as-image is a semantic typography technique where a word illustration presents a visualization of the meaning of the word, while also preserving its readability. We present a method to create word-as-image illustrations automatically. This task is highly challenging as it requires semantic understanding of the word and a creative idea of where and how to depict these semantics in a visually pleasing and legible manner. We rely on the remarkable ability of recent large pretrained language-vision models to distill textual concepts visually. We target simple, concise, black-and-white designs that convey the semantics clearly. We deliberately do not change the color or texture of the letters and do not use embellishments. Our method optimizes the outline of each letter to convey the desired concept, guided by a pretrained Stable Diffusion model. We incorporate additional loss terms to ensure the legibility of the text and the preservation of the style of the font. We show high quality and engaging results on numerous examples and compare to alternative techniques.

Viaarxiv icon

ARO-Net: Learning Neural Fields from Anchored Radial Observations

Dec 19, 2022
Yizhi Wang, Zeyu Huang, Ariel Shamir, Hui Huang, Hao Zhang, Ruizhen Hu

Figure 1 for ARO-Net: Learning Neural Fields from Anchored Radial Observations
Figure 2 for ARO-Net: Learning Neural Fields from Anchored Radial Observations
Figure 3 for ARO-Net: Learning Neural Fields from Anchored Radial Observations
Figure 4 for ARO-Net: Learning Neural Fields from Anchored Radial Observations

We introduce anchored radial observations (ARO), a novel shape encoding for learning neural field representation of shapes that is category-agnostic and generalizable amid significant shape variations. The main idea behind our work is to reason about shapes through partial observations from a set of viewpoints, called anchors. We develop a general and unified shape representation by employing a fixed set of anchors, via Fibonacci sampling, and designing a coordinate-based deep neural network to predict the occupancy value of a query point in space. Differently from prior neural implicit models, that use global shape feature, our shape encoder operates on contextual, query-specific features. To predict point occupancy, locally observed shape information from the perspective of the anchors surrounding the input query point are encoded and aggregated through an attention module, before implicit decoding is performed. We demonstrate the quality and generality of our network, coined ARO-Net, on surface reconstruction from sparse point clouds, with tests on novel and unseen object categories, "one-shape" training, and comparisons to state-of-the-art neural and classical methods for reconstruction and tessellation.

Viaarxiv icon

PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

Dec 08, 2022
Roei Herzig, Ofir Abramovich, Elad Ben-Avraham, Assaf Arbelle, Leonid Karlinsky, Ariel Shamir, Trevor Darrell, Amir Globerson

Figure 1 for PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data
Figure 2 for PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data
Figure 3 for PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data
Figure 4 for PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide powerful alternatives for generating scene-level annotations across multiple tasks. In this work, we propose an approach to leverage synthetic scene data for improving video understanding. We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task. Specifically, we add a set of ``task prompts'', each corresponding to a different task, and let each prompt predict task-related annotations. This design allows the model to capture information shared among synthetic scene tasks as well as information shared between synthetic scene tasks and a real video downstream task throughout the entire network. We refer to this approach as ``Promptonomy'', since the prompts model a task-related structure. We propose the PromptonomyViT model (PViT), a video transformer that incorporates various types of scene-level information from synthetic data using the ``Promptonomy'' approach. PViT shows strong performance improvements on multiple video understanding tasks and datasets.

* Tech report 
Viaarxiv icon

Prediction of Scene Plausibility

Dec 06, 2022
Or Nachmias, Ohad Fried, Ariel Shamir

Figure 1 for Prediction of Scene Plausibility
Figure 2 for Prediction of Scene Plausibility
Figure 3 for Prediction of Scene Plausibility
Figure 4 for Prediction of Scene Plausibility

Understanding the 3D world from 2D images involves more than detection and segmentation of the objects within the scene. It also includes the interpretation of the structure and arrangement of the scene elements. Such understanding is often rooted in recognizing the physical world and its limitations, and in prior knowledge as to how similar typical scenes are arranged. In this research we pose a new challenge for neural network (or other) scene understanding algorithms - can they distinguish between plausible and implausible scenes? Plausibility can be defined both in terms of physical properties and in terms of functional and typical arrangements. Hence, we define plausibility as the probability of encountering a given scene in the real physical world. We build a dataset of synthetic images containing both plausible and implausible scenes, and test the success of various vision models in the task of recognizing and understanding plausibility.

Viaarxiv icon

CLIPascene: Scene Sketching with Different Types and Levels of Abstraction

Nov 30, 2022
Yael Vinker, Yuval Alaluf, Daniel Cohen-Or, Ariel Shamir

Figure 1 for CLIPascene: Scene Sketching with Different Types and Levels of Abstraction
Figure 2 for CLIPascene: Scene Sketching with Different Types and Levels of Abstraction
Figure 3 for CLIPascene: Scene Sketching with Different Types and Levels of Abstraction
Figure 4 for CLIPascene: Scene Sketching with Different Types and Levels of Abstraction

In this paper, we present a method for converting a given scene image into a sketch using different types and multiple levels of abstraction. We distinguish between two types of abstraction. The first considers the fidelity of the sketch, varying its representation from a more precise portrayal of the input to a looser depiction. The second is defined by the visual simplicity of the sketch, moving from a detailed depiction to a sparse sketch. Using an explicit disentanglement into two abstraction axes -- and multiple levels for each one -- provides users additional control over selecting the desired sketch based on their personal goals and preferences. To form a sketch at a given level of fidelity and simplification, we train two MLP networks. The first network learns the desired placement of strokes, while the second network learns to gradually remove strokes from the sketch without harming its recognizability and semantics. Our approach is able to generate sketches of complex scenes including those with complex backgrounds (e.g., natural and urban settings) and subjects (e.g., animals and people) while depicting gradual abstractions of the input scene in terms of fidelity and simplicity.

* Project page available at https://clipascene.github.io/CLIPascene/ 
Viaarxiv icon