"photo": models, code, and papers

Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting

Sep 03, 2022
Christina O. Tze, Panagiotis P. Filntisis, Athanasia-Lida Dimou, Anastasios Roussos, Petros Maragos

Figure 1 for Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting
Figure 2 for Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting
Figure 3 for Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting
Figure 4 for Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting

In this paper, we introduce a neural rendering pipeline for transferring the facial expressions, head pose and body movements of one person in a source video to another in a target video. We apply our method to the challenging case of Sign Language videos: given a source video of a sign language user, we can faithfully transfer the performed manual (e.g. handshape, palm orientation, movement, location) and non-manual (e.g. eye gaze, facial expressions, head movements) signs to a target video in a photo-realistic manner. To effectively capture the aforementioned cues, which are crucial for sign language communication, we build upon an effective combination of the most robust and reliable deep learning methods for body, hand and face tracking that have been introduced lately. Using a 3D-aware representation, the estimated motions of the body parts are combined and retargeted to the target signer. They are then given as conditional input to our Video Rendering Network, which generates temporally consistent and photo-realistic videos. We conduct detailed qualitative and quantitative evaluations and comparisons, which demonstrate the effectiveness of our approach and its advantages over existing approaches. Our method yields promising results of unprecedented realism and can be used for Sign Language Anonymization. In addition, it can be readily applicable to reenactment of other types of full body activities (dancing, acting performance, exercising, etc.), as well as to the synthesis module of Sign Language Production systems.

Towards Local Underexposed Photo Enhancement

Aug 17, 2022
Yizhan Huang, Xiaogang Xu

Figure 1 for Towards Local Underexposed Photo Enhancement
Figure 2 for Towards Local Underexposed Photo Enhancement
Figure 3 for Towards Local Underexposed Photo Enhancement
Figure 4 for Towards Local Underexposed Photo Enhancement

Inspired by the ability of deep generative models to generate highly realistic images, much recent work has made progress in enhancing underexposed images globally. However, the local image enhancement approach has not been explored, although they are requisite in the real-world scenario, e.g., fixing local underexposure. In this work, we define a new task setting for underexposed image enhancement where users are able to control which region to be enlightened with an input mask. As indicated by the mask, an image can be divided into three areas, including Masked Area A, Transition Area B, and Unmasked Area C. As a result, Area A should be enlightened to the desired lighting, and there shall be a smooth transition (Area B) from the enlightened area (Area A) to the unchanged region (Area C). To finish this task, we propose two methods: Concatenate the mask as additional channels (MConcat), Mask-based Normlization (MNorm). While MConcat simply append the mask channels to the input images, MNorm can dynamically enhance the spatial-varying pixels, guaranteeing the enhanced images are consistent with the requirement indicated by the input mask. Moreover, MConcat serves as a play-and-plug module, and can be incorporated with existing networks, which globally enhance images, to achieve the local enhancement. And the overall network can be trained with three kinds of loss functions in Area A, Area B, and Area C, which are unified for various model structures. We perform extensive experiments on public datasets with various parametric approaches for low-light enhancement, %the Convolutional-Neutral-Network-based model and Transformer-based model, demonstrating the effectiveness of our methods.

Explicitly Controllable 3D-Aware Portrait Generation

Sep 12, 2022
Junshu Tang, Bo Zhang, Binxin Yang, Ting Zhang, Dong Chen, Lizhuang Ma, Fang Wen

Figure 1 for Explicitly Controllable 3D-Aware Portrait Generation
Figure 2 for Explicitly Controllable 3D-Aware Portrait Generation
Figure 3 for Explicitly Controllable 3D-Aware Portrait Generation
Figure 4 for Explicitly Controllable 3D-Aware Portrait Generation

In contrast to the traditional avatar creation pipeline which is a costly process, contemporary generative approaches directly learn the data distribution from photographs and the state of the arts can now yield highly photo-realistic images. While plenty of works attempt to extend the unconditional generative models and achieve some level of controllability, it is still challenging to ensure multi-view consistency, especially in large poses. In this work, we propose a 3D portrait generation network that produces 3D consistent portraits while being controllable according to semantic parameters regarding pose, identity, expression and lighting. The generative network uses neural scene representation to model portraits in 3D, whose generation is guided by a parametric face model that supports explicit control. While the latent disentanglement can be further enhanced by contrasting images with partially different attributes, there still exists noticeable inconsistency in non-face areas, e.g., hair and background, when animating expressions. We solve this by proposing a volume blending strategy in which we form a composite output by blending the dynamic and static radiance fields, with two parts segmented from the jointly learned semantic field. Our method outperforms prior arts in extensive experiments, producing realistic portraits with vivid expression in natural lighting when viewed in free viewpoint. The proposed method also demonstrates generalization ability to real images as well as out-of-domain cartoon faces, showing great promise in real applications. Additional video results and code will be available on the project webpage.

* Project webpage: https://junshutang.github.io/control/index.html 

What does a platypus look like? Generating customized prompts for zero-shot image classification

Sep 07, 2022
Sarah Pratt, Rosanne Liu, Ali Farhadi

Figure 1 for What does a platypus look like? Generating customized prompts for zero-shot image classification
Figure 2 for What does a platypus look like? Generating customized prompts for zero-shot image classification
Figure 3 for What does a platypus look like? Generating customized prompts for zero-shot image classification
Figure 4 for What does a platypus look like? Generating customized prompts for zero-shot image classification

Open vocabulary models are a promising new paradigm for image classification. Unlike traditional classification models, open vocabulary models classify among any arbitrary set of categories specified with natural language during inference. This natural language, called "prompts", typically consists of a set of hand-written templates (e.g., "a photo of a {}") which are completed with each of the category names. This work introduces a simple method to generate higher accuracy prompts, without using explicit knowledge of the image domain and with far fewer hand-constructed sentences. To achieve this, we combine open vocabulary models with large language models (LLMs) to create Customized Prompts via Language models (CuPL, pronounced "couple"). In particular, we leverage the knowledge contained in LLMs in order to generate many descriptive sentences that are customized for each object category. We find that this straightforward and general approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet. Finally, this method requires no additional training and remains completely zero-shot. Code is available at https://github.com/sarahpratt/CuPL.

LogicRank: Logic Induced Reranking for Generative Text-to-Image Systems

Aug 29, 2022
Björn Deiseroth, Patrick Schramowski, Hikaru Shindo, Devendra Singh Dhami, Kristian Kersting

Figure 1 for LogicRank: Logic Induced Reranking for Generative Text-to-Image Systems
Figure 2 for LogicRank: Logic Induced Reranking for Generative Text-to-Image Systems
Figure 3 for LogicRank: Logic Induced Reranking for Generative Text-to-Image Systems
Figure 4 for LogicRank: Logic Induced Reranking for Generative Text-to-Image Systems

Text-to-image models have recently achieved remarkable success with seemingly accurate samples in photo-realistic quality. However as state-of-the-art language models still struggle evaluating precise statements consistently, so do language model based image generation processes. In this work we showcase problems of state-of-the-art text-to-image models like DALL-E with generating accurate samples from statements related to the draw bench benchmark. Furthermore we show that CLIP is not able to rerank those generated samples consistently. To this end we propose LogicRank, a neuro-symbolic reasoning framework that can result in a more accurate ranking-system for such precision-demanding settings. LogicRank integrates smoothly into the generation process of text-to-image models and moreover can be used to further fine-tune towards a more logical precise model.

Scale-free Photo-realistic Adversarial Pattern Attack

Aug 12, 2022
Xiangbo Gao, Weicheng Xie, Minmin Liu, Cheng Luo, Qinliang Lin, Linlin Shen, Keerthy Kusumam, Siyang Song

Figure 1 for Scale-free Photo-realistic Adversarial Pattern Attack
Figure 2 for Scale-free Photo-realistic Adversarial Pattern Attack
Figure 3 for Scale-free Photo-realistic Adversarial Pattern Attack
Figure 4 for Scale-free Photo-realistic Adversarial Pattern Attack

Traditional pixel-wise image attack algorithms suffer from poor robustness to defense algorithms, i.e., the attack strength degrades dramatically when defense algorithms are applied. Although Generative Adversarial Networks (GAN) can partially address this problem by synthesizing a more semantically meaningful texture pattern, the main limitation is that existing generators can only generate images of a specific scale. In this paper, we propose a scale-free generation-based attack algorithm that synthesizes semantically meaningful adversarial patterns globally to images with arbitrary scales. Our generative attack approach consistently outperforms the state-of-the-art methods on a wide range of attack settings, i.e. the proposed approach largely degraded the performance of various image classification, object detection, and instance segmentation algorithms under different advanced defense methods.

Understanding Aesthetics with Language: A Photo Critique Dataset for Aesthetic Assessment

Jun 17, 2022
Daniel Vera Nieto, Luigi Celona, Clara Fernandez-Labrador

Figure 1 for Understanding Aesthetics with Language: A Photo Critique Dataset for Aesthetic Assessment
Figure 2 for Understanding Aesthetics with Language: A Photo Critique Dataset for Aesthetic Assessment
Figure 3 for Understanding Aesthetics with Language: A Photo Critique Dataset for Aesthetic Assessment
Figure 4 for Understanding Aesthetics with Language: A Photo Critique Dataset for Aesthetic Assessment

Computational inference of aesthetics is an ill-defined task due to its subjective nature. Many datasets have been proposed to tackle the problem by providing pairs of images and aesthetic scores based on human ratings. However, humans are better at expressing their opinion, taste, and emotions by means of language rather than summarizing them in a single number. In fact, photo critiques provide much richer information as they reveal how and why users rate the aesthetics of visual stimuli. In this regard, we propose the Reddit Photo Critique Dataset (RPCD), which contains tuples of image and photo critiques. RPCD consists of 74K images and 220K comments and is collected from a Reddit community used by hobbyists and professional photographers to improve their photography skills by leveraging constructive community feedback. The proposed dataset differs from previous aesthetics datasets mainly in three aspects, namely (i) the large scale of the dataset and the extension of the comments criticizing different aspects of the image, (ii) it contains mostly UltraHD images, and (iii) it can easily be extended to new data as it is collected through an automatic pipeline. To the best of our knowledge, in this work, we propose the first attempt to estimate the aesthetic quality of visual stimuli from the critiques. To this end, we exploit the polarity of the sentiment of criticism as an indicator of aesthetic judgment. We demonstrate how sentiment polarity correlates positively with the aesthetic judgment available for two aesthetic assessment benchmarks. Finally, we experiment with several models by using the sentiment scores as a target for ranking images. Dataset and baselines are available (https://github.com/mediatechnologycenter/aestheval).

Learning 6D Pose Estimation from Synthetic RGBD Images for Robotic Applications

Aug 30, 2022
Hongpeng Cao, Lukas Dirnberger, Daniele Bernardini, Cristina Piazza, Marco Caccamo

Figure 1 for Learning 6D Pose Estimation from Synthetic RGBD Images for Robotic Applications
Figure 2 for Learning 6D Pose Estimation from Synthetic RGBD Images for Robotic Applications
Figure 3 for Learning 6D Pose Estimation from Synthetic RGBD Images for Robotic Applications
Figure 4 for Learning 6D Pose Estimation from Synthetic RGBD Images for Robotic Applications

In this work, we propose a data generation pipeline by leveraging the 3D suite Blender to produce synthetic RGBD image datasets with 6D poses for robotic picking. The proposed pipeline can efficiently generate large amounts of photo-realistic RGBD images for the object of interest. In addition, a collection of domain randomization techniques is introduced to bridge the gap between real and synthetic data. Furthermore, we develop a real-time two-stage 6D pose estimation approach by integrating the object detector YOLO-V4-tiny and the 6D pose estimation algorithm PVN3D for time sensitive robotics applications. With the proposed data generation pipeline, our pose estimation approach can be trained from scratch using only synthetic data without any pre-trained models. The resulting network shows competitive performance compared to state-of-the-art methods when evaluated on LineMod dataset. We also demonstrate the proposed approach in a robotic experiment, grasping a household object from cluttered background under different lighting conditions.

4D LUT: Learnable Context-Aware 4D Lookup Table for Image Enhancement

Sep 05, 2022
Chengxu Liu, Huan Yang, Jianlong Fu, Xueming Qian

Figure 1 for 4D LUT: Learnable Context-Aware 4D Lookup Table for Image Enhancement
Figure 2 for 4D LUT: Learnable Context-Aware 4D Lookup Table for Image Enhancement
Figure 3 for 4D LUT: Learnable Context-Aware 4D Lookup Table for Image Enhancement
Figure 4 for 4D LUT: Learnable Context-Aware 4D Lookup Table for Image Enhancement

Image enhancement aims at improving the aesthetic visual quality of photos by retouching the color and tone, and is an essential technology for professional digital photography. Recent years deep learning-based image enhancement algorithms have achieved promising performance and attracted increasing popularity. However, typical efforts attempt to construct a uniform enhancer for all pixels' color transformation. It ignores the pixel differences between different content (e.g., sky, ocean, etc.) that are significant for photographs, causing unsatisfactory results. In this paper, we propose a novel learnable context-aware 4-dimensional lookup table (4D LUT), which achieves content-dependent enhancement of different contents in each image via adaptively learning of photo context. In particular, we first introduce a lightweight context encoder and a parameter encoder to learn a context map for the pixel-level category and a group of image-adaptive coefficients, respectively. Then, the context-aware 4D LUT is generated by integrating multiple basis 4D LUTs via the coefficients. Finally, the enhanced image can be obtained by feeding the source image and context map into fused context-aware 4D~LUT via quadrilinear interpolation. Compared with traditional 3D LUT, i.e., RGB mapping to RGB, which is usually used in camera imaging pipeline systems or tools, 4D LUT, i.e., RGBC(RGB+Context) mapping to RGB, enables finer control of color transformations for pixels with different content in each image, even though they have the same RGB values. Experimental results demonstrate that our method outperforms other state-of-the-art methods in widely-used benchmarks.

Significance of Skeleton-based Features in Virtual Try-On

Sep 01, 2022
Debapriya Roy, Sanchayan Santra, Diganta Mukherjee, Bhabatosh Chanda

Figure 1 for Significance of Skeleton-based Features in Virtual Try-On
Figure 2 for Significance of Skeleton-based Features in Virtual Try-On
Figure 3 for Significance of Skeleton-based Features in Virtual Try-On
Figure 4 for Significance of Skeleton-based Features in Virtual Try-On

The idea of \textit{Virtual Try-ON} (VTON) benefits e-retailing by giving an user the convenience of trying a clothing at the comfort of their home. In general, most of the existing VTON methods produce inconsistent results when a person posing with his arms folded i.e., bent or crossed, wants to try an outfit. The problem becomes severe in the case of long-sleeved outfits. As then, for crossed arm postures, overlap among different clothing parts might happen. The existing approaches, especially the warping-based methods employing \textit{Thin Plate Spline (TPS)} transform can not tackle such cases. To this end, we attempt a solution approach where the clothing from the source person is segmented into semantically meaningful parts and each part is warped independently to the shape of the person. To address the bending issue, we employ hand-crafted geometric features consistent with human body geometry for warping the source outfit. In addition, we propose two learning-based modules: a synthesizer network and a mask prediction network. All these together attempt to produce a photo-realistic, pose-robust VTON solution without requiring any paired training data. Comparison with some of the benchmark methods clearly establishes the effectiveness of the approach.