"photo": models, code, and papers

FCA: Learning a 3D Full-coverage Vehicle Camouflage for Multi-view Physical Adversarial Attack

Sep 15, 2021
DonghuaWang, Tingsong Jiang, Jialiang Sun, Weien Zhou, Xiaoya Zhang, Zhiqiang Gong, Wen Yao, Xiaoqian Chen

Figure 1 for FCA: Learning a 3D Full-coverage Vehicle Camouflage for Multi-view Physical Adversarial Attack
Figure 2 for FCA: Learning a 3D Full-coverage Vehicle Camouflage for Multi-view Physical Adversarial Attack
Figure 3 for FCA: Learning a 3D Full-coverage Vehicle Camouflage for Multi-view Physical Adversarial Attack
Figure 4 for FCA: Learning a 3D Full-coverage Vehicle Camouflage for Multi-view Physical Adversarial Attack

Physical adversarial attacks in object detection have attracted increasing attention. However, most previous works focus on hiding the objects from the detector by generating an individual adversarial patch, which only covers the planar part of the vehicle's surface and fails to attack the detector in physical scenarios for multi-view, long-distance and partially occluded objects. To bridge the gap between digital attacks and physical attacks, we exploit the full 3D vehicle surface to propose a robust Full-coverage Camouflage Attack (FCA) to fool detectors. Specifically, we first try rendering the non-planar camouflage texture over the full vehicle surface. To mimic the real-world environment conditions, we then introduce a transformation function to transfer the rendered camouflaged vehicle into a photo-realistic scenario. Finally, we design an efficient loss function to optimize the camouflage texture. Experiments show that the full-coverage camouflage attack can not only outperform state-of-the-art methods under various test cases but also generalize to different environments, vehicles, and object detectors.

* 9 pages, 5 figures 

Unpaired Photo-to-manga Translation Based on The Methodology of Manga Drawing

Apr 22, 2020
Hao Su, Jianwei Niu, Xuefeng Liu, Qingfeng Li, Jiahe Cui, Ji Wan

Figure 1 for Unpaired Photo-to-manga Translation Based on The Methodology of Manga Drawing
Figure 2 for Unpaired Photo-to-manga Translation Based on The Methodology of Manga Drawing
Figure 3 for Unpaired Photo-to-manga Translation Based on The Methodology of Manga Drawing
Figure 4 for Unpaired Photo-to-manga Translation Based on The Methodology of Manga Drawing

Manga is a world popular comic form originated in Japan, which typically employs black-and-white stroke lines and geometric exaggeration to describe humans' appearances, poses, and actions. In this paper, we propose MangaGAN, the first method based on Generative Adversarial Network (GAN) for unpaired photo-to-manga translation. Inspired by how experienced manga artists draw manga, MangaGAN generates the geometric features of manga face by a designed GAN model and delicately translates each facial region into the manga domain by a tailored multi-GANs architecture. For training MangaGAN, we construct a new dataset collected from a popular manga work, containing manga facial features, landmarks, bodies, and so on. Moreover, to produce high-quality manga faces, we further propose a structural smoothing loss to smooth stroke-lines and avoid noisy pixels, and a similarity preserving module to improve the similarity between domains of photo and manga. Extensive experiments show that MangaGAN can produce high-quality manga faces which preserve both the facial similarity and a popular manga style, and outperforms other related state-of-the-art methods.

* 17 pages 

Towards Unsupervised Sketch-based Image Retrieval

May 18, 2021
Conghui Hu, Yongxin Yang, Yunpeng Li, Timothy M. Hospedales, Yi-Zhe Song

Figure 1 for Towards Unsupervised Sketch-based Image Retrieval
Figure 2 for Towards Unsupervised Sketch-based Image Retrieval
Figure 3 for Towards Unsupervised Sketch-based Image Retrieval
Figure 4 for Towards Unsupervised Sketch-based Image Retrieval

Current supervised sketch-based image retrieval (SBIR) methods achieve excellent performance. However, the cost of data collection and labeling imposes an intractable barrier to practical deployment of real applications. In this paper, we present the first attempt at unsupervised SBIR to remove the labeling cost (category annotations and sketch-photo pairings) that is conventionally needed for training. Existing single-domain unsupervised representation learning methods perform poorly in this application, due to the unique cross-domain (sketch and photo) nature of the problem. We therefore introduce a novel framework that simultaneously performs unsupervised representation learning and sketch-photo domain alignment. Technically this is underpinned by exploiting joint distribution optimal transport (JDOT) to align data from different domains during representation learning, which we extend with trainable cluster prototypes and feature memory banks to further improve scalability and efficacy. Extensive experiments show that our framework achieves excellent performance in the new unsupervised setting, and performs comparably or better than state-of-the-art in the zero-shot setting.

* 10 pages, 6 figures 

Pose with Style: Detail-Preserving Pose-Guided Image Synthesis with Conditional StyleGAN

Sep 13, 2021
Badour AlBahar, Jingwan Lu, Jimei Yang, Zhixin Shu, Eli Shechtman, Jia-Bin Huang

Figure 1 for Pose with Style: Detail-Preserving Pose-Guided Image Synthesis with Conditional StyleGAN
Figure 2 for Pose with Style: Detail-Preserving Pose-Guided Image Synthesis with Conditional StyleGAN
Figure 3 for Pose with Style: Detail-Preserving Pose-Guided Image Synthesis with Conditional StyleGAN
Figure 4 for Pose with Style: Detail-Preserving Pose-Guided Image Synthesis with Conditional StyleGAN

We present an algorithm for re-rendering a person from a single image under arbitrary poses. Existing methods often have difficulties in hallucinating occluded contents photo-realistically while preserving the identity and fine details in the source image. We first learn to inpaint the correspondence field between the body surface texture and the source image with a human body symmetry prior. The inpainted correspondence field allows us to transfer/warp local features extracted from the source to the target view even under large pose changes. Directly mapping the warped local features to an RGB image using a simple CNN decoder often leads to visible artifacts. Thus, we extend the StyleGAN generator so that it takes pose as input (for controlling poses) and introduces a spatially varying modulation for the latent space using the warped local features (for controlling appearances). We show that our method compares favorably against the state-of-the-art algorithms in both quantitative evaluation and visual comparison.

* SIGGRAPH Asia 2021. Project page: https://pose-with-style.github.io/ 

Uncertainty-aware GAN with Adaptive Loss for Robust MRI Image Enhancement

Oct 07, 2021
Uddeshya Upadhyay, Viswanath P. Sudarshan, Suyash P. Awate

Figure 1 for Uncertainty-aware GAN with Adaptive Loss for Robust MRI Image Enhancement
Figure 2 for Uncertainty-aware GAN with Adaptive Loss for Robust MRI Image Enhancement
Figure 3 for Uncertainty-aware GAN with Adaptive Loss for Robust MRI Image Enhancement
Figure 4 for Uncertainty-aware GAN with Adaptive Loss for Robust MRI Image Enhancement

Image-to-image translation is an ill-posed problem as unique one-to-one mapping may not exist between the source and target images. Learning-based methods proposed in this context often evaluate the performance on test data that is similar to the training data, which may be impractical. This demands robust methods that can quantify uncertainty in the prediction for making informed decisions, especially for critical areas such as medical imaging. Recent works that employ conditional generative adversarial networks (GANs) have shown improved performance in learning photo-realistic image-to-image mappings between the source and the target images. However, these methods do not focus on (i)~robustness of the models to out-of-distribution (OOD)-noisy data and (ii)~uncertainty quantification. This paper proposes a GAN-based framework that (i)~models an adaptive loss function for robustness to OOD-noisy data that automatically tunes the spatially varying norm for penalizing the residuals and (ii)~estimates the per-voxel uncertainty in the predictions. We demonstrate our method on two key applications in medical imaging: (i)~undersampled magnetic resonance imaging (MRI) reconstruction (ii)~MRI modality propagation. Our experiments with two different real-world datasets show that the proposed method (i)~is robust to OOD-noisy test data and provides improved accuracy and (ii)~quantifies voxel-level uncertainty in the predictions.

* Accepted at IEEE ICCV-2021 workshop on Computer Vision for Automated Medical Diagnosis 

Sample-specific repetitive learning for photo aesthetic assessment and highlight region extraction

Sep 18, 2019
Ying Dai

Figure 1 for Sample-specific repetitive learning for photo aesthetic assessment and highlight region extraction
Figure 2 for Sample-specific repetitive learning for photo aesthetic assessment and highlight region extraction
Figure 3 for Sample-specific repetitive learning for photo aesthetic assessment and highlight region extraction
Figure 4 for Sample-specific repetitive learning for photo aesthetic assessment and highlight region extraction

Aesthetic assessment is subjective, and the distribution of the aesthetic levels is imbalanced. In order to realize the auto-assessment of photo aesthetics, we focus on retraining the CNN-based aesthetic assessment model by dropping out the unavailable samples in the middle levels from the training data set repetitively to overcome the effect of imbalanced aesthetic data on classification. Further, the method of extracting aesthetics highlight region of the photo image by using the two repetitively trained models is presented. Therefore, the correlation of the extracted region with the aesthetic levels is analyzed to illustrate what aesthetics features influence the aesthetic quality of the photo. Moreover, the testing data set is from the different data source called 500px. Experimental results show that the proposed method is effective.

* 15 pages, 9 figures, 3 tables 

FA-GAN: Feature-Aware GAN for Text to Image Synthesis

Sep 02, 2021
Eunyeong Jeon, Kunhee Kim, Daijin Kim

Figure 1 for FA-GAN: Feature-Aware GAN for Text to Image Synthesis
Figure 2 for FA-GAN: Feature-Aware GAN for Text to Image Synthesis
Figure 3 for FA-GAN: Feature-Aware GAN for Text to Image Synthesis
Figure 4 for FA-GAN: Feature-Aware GAN for Text to Image Synthesis

Text-to-image synthesis aims to generate a photo-realistic image from a given natural language description. Previous works have made significant progress with Generative Adversarial Networks (GANs). Nonetheless, it is still hard to generate intact objects or clear textures (Fig 1). To address this issue, we propose Feature-Aware Generative Adversarial Network (FA-GAN) to synthesize a high-quality image by integrating two techniques: a self-supervised discriminator and a feature-aware loss. First, we design a self-supervised discriminator with an auxiliary decoder so that the discriminator can extract better representation. Secondly, we introduce a feature-aware loss to provide the generator more direct supervision by employing the feature representation from the self-supervised discriminator. Experiments on the MS-COCO dataset show that our proposed method significantly advances the state-of-the-art FID score from 28.92 to 24.58.

* ICIP 2021 

PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering

Sep 17, 2021
Yurui Ren, Ge Li, Yuanqi Chen, Thomas H. Li, Shan Liu

Figure 1 for PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering
Figure 2 for PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering
Figure 3 for PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering
Figure 4 for PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering

Generating portrait images by controlling the motions of existing faces is an important task of great consequence to social media industries. For easy use and intuitive control, semantically meaningful and fully disentangled parameters should be used as modifications. However, many existing techniques do not provide such fine-grained controls or use indirect editing methods i.e. mimic motions of other individuals. In this paper, a Portrait Image Neural Renderer (PIRenderer) is proposed to control the face motions with the parameters of three-dimensional morphable face models (3DMMs). The proposed model can generate photo-realistic portrait images with accurate movements according to intuitive modifications. Experiments on both direct and indirect editing tasks demonstrate the superiority of this model. Meanwhile, we further extend this model to tackle the audio-driven facial reenactment task by extracting sequential motions from audio inputs. We show that our model can generate coherent videos with convincing movements from only a single reference image and a driving audio stream. Our source code is available at https://github.com/RenYurui/PIRender.

Multi-Level Visual Similarity Based Personalized Tourist Attraction Recommendation Using Geo-Tagged Photos

Sep 17, 2021
Ling Chen, Dandan Lyu, Shanshan Yu, Gencai Chen

Figure 1 for Multi-Level Visual Similarity Based Personalized Tourist Attraction Recommendation Using Geo-Tagged Photos
Figure 2 for Multi-Level Visual Similarity Based Personalized Tourist Attraction Recommendation Using Geo-Tagged Photos
Figure 3 for Multi-Level Visual Similarity Based Personalized Tourist Attraction Recommendation Using Geo-Tagged Photos
Figure 4 for Multi-Level Visual Similarity Based Personalized Tourist Attraction Recommendation Using Geo-Tagged Photos

Geo-tagged photo based tourist attraction recommendation can discover users' travel preferences from their taken photos, so as to recommend suitable tourist attractions to them. However, existing visual content based methods cannot fully exploit the user and tourist attraction information of photos to extract visual features, and do not differentiate the significances of different photos. In this paper, we propose multi-level visual similarity based personalized tourist attraction recommendation using geo-tagged photos (MEAL). MEAL utilizes the visual contents of photos and interaction behavior data to obtain the final embeddings of users and tourist attractions, which are then used to predict the visit probabilities. Specifically, by crossing the user and tourist attraction information of photos, we define four visual similarity levels and introduce a corresponding quintuplet loss to embed the visual contents of photos. In addition, to capture the significances of different photos, we exploit the self-attention mechanism to obtain the visual representations of users and tourist attractions. We conducted experiments on a dataset crawled from Flickr, and the experimental results proved the advantage of this method.

* 17 pages, 4 figures 

Relightable Neural Video Portrait

Jul 30, 2021
Youjia Wang, Taotao Zhou, Minzhang Li, Teng Xu, Minye Wu, Lan Xu, Jingyi Yu

Figure 1 for Relightable Neural Video Portrait
Figure 2 for Relightable Neural Video Portrait
Figure 3 for Relightable Neural Video Portrait
Figure 4 for Relightable Neural Video Portrait

Photo-realistic facial video portrait reenactment benefits virtual production and numerous VR/AR experiences. The task remains challenging as the portrait should maintain high realism and consistency with the target environment. In this paper, we present a relightable neural video portrait, a simultaneous relighting and reenactment scheme that transfers the head pose and facial expressions from a source actor to a portrait video of a target actor with arbitrary new backgrounds and lighting conditions. Our approach combines 4D reflectance field learning, model-based facial performance capture and target-aware neural rendering. Specifically, we adopt a rendering-to-video translation network to first synthesize high-quality OLAT imagesets and alpha mattes from hybrid facial performance capture results. We then design a semantic-aware facial normalization scheme to enable reliable explicit control as well as a multi-frame multi-task learning strategy to encode content, segmentation and temporal information simultaneously for high-quality reflectance field inference. After training, our approach further enables photo-realistic and controllable video portrait editing of the target performer. Reliable face poses and expression editing is obtained by applying the same hybrid facial capture and normalization scheme to the source video input, while our explicit alpha and OLAT output enable high-quality relit and background editing. With the ability to achieve simultaneous relighting and reenactment, we are able to improve the realism in a variety of virtual production and video rewrite applications.