"photo": models, code, and papers

Beyond PRNU: Learning Robust Device-Specific Fingerprint for Source Camera Identification

Nov 03, 2021
Manisha, Chang-Tsun Li, Xufeng Lin, Karunakar A. Kotegar

Figure 1 for Beyond PRNU: Learning Robust Device-Specific Fingerprint for Source Camera Identification
Figure 2 for Beyond PRNU: Learning Robust Device-Specific Fingerprint for Source Camera Identification
Figure 3 for Beyond PRNU: Learning Robust Device-Specific Fingerprint for Source Camera Identification
Figure 4 for Beyond PRNU: Learning Robust Device-Specific Fingerprint for Source Camera Identification

Source camera identification tools assist image forensic investigators to associate an image in question with a suspect camera. Various techniques have been developed based on the analysis of the subtle traces left in the images during the acquisition. The Photo Response Non Uniformity (PRNU) noise pattern caused by sensor imperfections has been proven to be an effective way to identify the source camera. The existing literature suggests that the PRNU is the only fingerprint that is device-specific and capable of identifying the exact source device. However, the PRNU is susceptible to camera settings, image content, image processing operations, and counter-forensic attacks. A forensic investigator unaware of counter-forensic attacks or incidental image manipulations is at the risk of getting misled. The spatial synchronization requirement during the matching of two PRNUs also represents a major limitation of the PRNU. In recent years, deep learning based approaches have been successful in identifying source camera models. However, the identification of individual cameras of the same model through these data-driven approaches remains unsatisfactory. In this paper, we bring to light the existence of a new robust data-driven device-specific fingerprint in digital images which is capable of identifying the individual cameras of the same model. It is discovered that the new device fingerprint is location-independent, stochastic, and globally available, which resolve the spatial synchronization issue. Unlike the PRNU, which resides in the high-frequency band, the new device fingerprint is extracted from the low and mid-frequency bands, which resolves the fragility issue that the PRNU is unable to contend with. Our experiments on various datasets demonstrate that the new fingerprint is highly resilient to image manipulations such as rotation, gamma correction, and aggressive JPEG compression.

* 11 pages 

An implementation of ROS Autonomous Navigation on Parallax Eddie platform

Aug 28, 2021
Hafiq Anas, Wee Hong Ong

Figure 1 for An implementation of ROS Autonomous Navigation on Parallax Eddie platform
Figure 2 for An implementation of ROS Autonomous Navigation on Parallax Eddie platform
Figure 3 for An implementation of ROS Autonomous Navigation on Parallax Eddie platform
Figure 4 for An implementation of ROS Autonomous Navigation on Parallax Eddie platform

This paper presents an implementation of autonomous navigation functionality based on Robot Operating System (ROS) on a wheeled differential drive mobile platform called Eddie robot. ROS is a framework that contains many reusable software stacks as well as visualization and debugging tools that provides an ideal environment for any robotic project development. The main contribution of this paper is the description of the customized hardware and software system setup of Eddie robot to work with an autonomous navigation system in ROS called Navigation Stack and to implement one application use case for autonomous navigation. For this paper, photo taking is chosen to demonstrate a use case of the mobile robot.

* 12 pages, 23 figures, 9 tables, 24 equations 

CRD-CGAN: Category-Consistent and Relativistic Constraints for Diverse Text-to-Image Generation

Jul 28, 2021
Tao Hu, Chengjiang Long, Chunxia Xiao

Figure 1 for CRD-CGAN: Category-Consistent and Relativistic Constraints for Diverse Text-to-Image Generation
Figure 2 for CRD-CGAN: Category-Consistent and Relativistic Constraints for Diverse Text-to-Image Generation
Figure 3 for CRD-CGAN: Category-Consistent and Relativistic Constraints for Diverse Text-to-Image Generation
Figure 4 for CRD-CGAN: Category-Consistent and Relativistic Constraints for Diverse Text-to-Image Generation

Generating photo-realistic images from a text description is a challenging problem in computer vision. Previous works have shown promising performance to generate synthetic images conditional on text by Generative Adversarial Networks (GANs). In this paper, we focus on the category-consistent and relativistic diverse constraints to optimize the diversity of synthetic images. Based on those constraints, a category-consistent and relativistic diverse conditional GAN (CRD-CGAN) is proposed to synthesize $K$ photo-realistic images simultaneously. We use the attention loss and diversity loss to improve the sensitivity of the GAN to word attention and noises. Then, we employ the relativistic conditional loss to estimate the probability of relatively real or fake for synthetic images, which can improve the performance of basic conditional loss. Finally, we introduce a category-consistent loss to alleviate the over-category issues between K synthetic images. We evaluate our approach using the Birds-200-2011, Oxford-102 flower and MSCOCO 2014 datasets, and the extensive experiments demonstrate superiority of the proposed method in comparison with state-of-the-art methods in terms of photorealistic and diversity of the generated synthetic images.

Identity-Guided Face Generation with Multi-modal Contour Conditions

Oct 10, 2021
Qingyan Bai, Weihao Xia, Fei Yin, Yujiu Yang

Figure 1 for Identity-Guided Face Generation with Multi-modal Contour Conditions
Figure 2 for Identity-Guided Face Generation with Multi-modal Contour Conditions
Figure 3 for Identity-Guided Face Generation with Multi-modal Contour Conditions
Figure 4 for Identity-Guided Face Generation with Multi-modal Contour Conditions

Recent face generation methods have tried to synthesize faces based on the given contour condition, like a low-resolution image or a sketch. However, the problem of identity ambiguity remains unsolved, which usually occurs when the contour is too vague to provide reliable identity information (e.g., when its resolution is extremely low). In this work, we propose a framework that takes the contour and an extra image specifying the identity as the inputs, where the contour can be of various modalities, including the low-resolution image, sketch, and semantic label map. This task especially fits the situation of tracking the known criminals or making intelligent creations for entertainment. Concretely, we propose a novel dual-encoder architecture, in which an identity encoder extracts the identity-related feature, accompanied by a main encoder to obtain the rough contour information and further fuse all the information together. The encoder output is iteratively fed into a pre-trained StyleGAN generator until getting a satisfying result. To the best of our knowledge, this is the first work that achieves identity-guided face generation conditioned on multi-modal contour images. Moreover, our method can produce photo-realistic results with 1024$\times$1024 resolution. Code will be available at https://git.io/Jo4yh.

* 5 pages, 4 figures, submitted to ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 

Unsupervised Facial Geometry Learning for Sketch to Photo Synthesis

Oct 12, 2018
Hadi Kazemi, Fariborz Taherkhani, Nasser M. Nasrabadi

Figure 1 for Unsupervised Facial Geometry Learning for Sketch to Photo Synthesis
Figure 2 for Unsupervised Facial Geometry Learning for Sketch to Photo Synthesis
Figure 3 for Unsupervised Facial Geometry Learning for Sketch to Photo Synthesis
Figure 4 for Unsupervised Facial Geometry Learning for Sketch to Photo Synthesis

Face sketch-photo synthesis is a critical application in law enforcement and digital entertainment industry where the goal is to learn the mapping between a face sketch image and its corresponding photo-realistic image. However, the limited number of paired sketch-photo training data usually prevents the current frameworks to learn a robust mapping between the geometry of sketches and their matching photo-realistic images. Consequently, in this work, we present an approach for learning to synthesize a photo-realistic image from a face sketch in an unsupervised fashion. In contrast to current unsupervised image-to-image translation techniques, our framework leverages a novel perceptual discriminator to learn the geometry of human face. Learning facial prior information empowers the network to remove the geometrical artifacts in the face sketch. We demonstrate that a simultaneous optimization of the face photo generator network, employing the proposed perceptual discriminator in combination with a texture-wise discriminator, results in a significant improvement in quality and recognition rate of the synthesized photos. We evaluate the proposed network by conducting extensive experiments on multiple baseline sketch-photo datasets.

* Published as a conference paper in BIOSIG 2018 

Exploring to establish an appropriate model for image aesthetic assessment via CNN-based RSRL: An empirical study

Jun 28, 2021
Ying Dai

Figure 1 for Exploring to establish an appropriate model for image aesthetic assessment via CNN-based RSRL: An empirical study
Figure 2 for Exploring to establish an appropriate model for image aesthetic assessment via CNN-based RSRL: An empirical study
Figure 3 for Exploring to establish an appropriate model for image aesthetic assessment via CNN-based RSRL: An empirical study
Figure 4 for Exploring to establish an appropriate model for image aesthetic assessment via CNN-based RSRL: An empirical study

To establish an appropriate model for photo aesthetic assessment, in this paper, a D-measure which reflects the disentanglement degree of the final layer FC nodes of CNN is introduced. By combining F-measure with D-measure to obtain a FD measure, an algorithm of determining the optimal model from the multiple photo score prediction models generated by CNN-based repetitively self-revised learning(RSRL) is proposed. Furthermore, the first fixation perspective(FFP) and the assessment interest region(AIR) of the models are defined and calculated. The experimental results show that the FD measure is effective for establishing the appropriate model from the multiple score prediction models with different CNN structures. Moreover, the FD-determined optimal models with the comparatively high FD always have the FFP an AIR which are close to the human's aesthetic perception when enjoying photos.

Learning Efficient Multi-Agent Cooperative Visual Exploration

Oct 12, 2021
Chao Yu, Xinyi Yang, Jiaxuan Gao, Huazhong Yang, Yu Wang, Yi Wu

Figure 1 for Learning Efficient Multi-Agent Cooperative Visual Exploration
Figure 2 for Learning Efficient Multi-Agent Cooperative Visual Exploration
Figure 3 for Learning Efficient Multi-Agent Cooperative Visual Exploration
Figure 4 for Learning Efficient Multi-Agent Cooperative Visual Exploration

We consider the task of visual indoor exploration with multiple agents, where the agents need to cooperatively explore the entire indoor region using as few steps as possible. Classical planning-based methods often suffer from particularly expensive computation at each inference step and a limited expressiveness of cooperation strategy. By contrast, reinforcement learning (RL) has become a trending paradigm for tackling this challenge due to its modeling capability of arbitrarily complex strategies and minimal inference overhead. We extend the state-of-the-art single-agent RL solution, Active Neural SLAM (ANS), to the multi-agent setting by introducing a novel RL-based global-goal planner, Spatial Coordination Planner (SCP), which leverages spatial information from each individual agent in an end-to-end manner and effectively guides the agents to navigate towards different spatial goals with high exploration efficiency. SCP consists of a transformer-based relation encoder to capture intra-agent interactions and a spatial action decoder to produce accurate goals. In addition, we also implement a few multi-agent enhancements to process local information from each agent for an aligned spatial representation and more precise planning. Our final solution, Multi-Agent Active Neural SLAM (MAANS), combines all these techniques and substantially outperforms 4 different planning-based methods and various RL baselines in the photo-realistic physical testbed, Habitat.

* First three authors share equal contribution 

Neural Free-Viewpoint Performance Rendering under Complex Human-object Interactions

Aug 03, 2021
Guoxing Sun, Xin Chen, Yizhang Chen, Anqi Pang, Pei Lin, Yuheng Jiang, Lan Xu, Jingya Wang, Jingyi Yu

Figure 1 for Neural Free-Viewpoint Performance Rendering under Complex Human-object Interactions
Figure 2 for Neural Free-Viewpoint Performance Rendering under Complex Human-object Interactions
Figure 3 for Neural Free-Viewpoint Performance Rendering under Complex Human-object Interactions
Figure 4 for Neural Free-Viewpoint Performance Rendering under Complex Human-object Interactions

4D reconstruction of human-object interaction is critical for immersive VR/AR experience and human activity understanding. Recent advances still fail to recover fine geometry and texture results from sparse RGB inputs, especially under challenging human-object interactions scenarios. In this paper, we propose a neural human performance capture and rendering system to generate both high-quality geometry and photo-realistic texture of both human and objects under challenging interaction scenarios in arbitrary novel views, from only sparse RGB streams. To deal with complex occlusions raised by human-object interactions, we adopt a layer-wise scene decoupling strategy and perform volumetric reconstruction and neural rendering of the human and object. Specifically, for geometry reconstruction, we propose an interaction-aware human-object capture scheme that jointly considers the human reconstruction and object reconstruction with their correlations. Occlusion-aware human reconstruction and robust human-aware object tracking are proposed for consistent 4D human-object dynamic reconstruction. For neural texture rendering, we propose a layer-wise human-object rendering scheme, which combines direction-aware neural blending weight learning and spatial-temporal texture completion to provide high-resolution and photo-realistic texture results in the occluded scenarios. Extensive experiments demonstrate the effectiveness of our approach to achieve high-quality geometry and texture reconstruction in free viewpoints for challenging human-object interactions.

* Accepted by ACM MM 2021 

Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis

Oct 28, 2021
Bowen Wu, Zhenyu Xie, Xiaodan Liang, Yubei Xiao, Haoye Dong, Liang Lin

Figure 1 for Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis
Figure 2 for Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis
Figure 3 for Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis
Figure 4 for Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis

Transferring human motion from a source to a target person poses great potential in computer vision and graphics applications. A crucial step is to manipulate sequential future motion while retaining the appearance characteristic.Previous work has either relied on crafted 3D human models or trained a separate model specifically for each target person, which is not scalable in practice.This work studies a more general setting, in which we aim to learn a single model to parsimoniously transfer motion from a source video to any target person given only one image of the person, named as Collaborative Parsing-Flow Network (CPF-Net). The paucity of information regarding the target person makes the task particularly challenging to faithfully preserve the appearance in varying designated poses. To address this issue, CPF-Net integrates the structured human parsing and appearance flow to guide the realistic foreground synthesis which is merged into the background by a spatio-temporal fusion module. In particular, CPF-Net decouples the problem into stages of human parsing sequence generation, foreground sequence generation and final video generation. The human parsing generation stage captures both the pose and the body structure of the target. The appearance flow is beneficial to keep details in synthesized frames. The integration of human parsing and appearance flow effectively guides the generation of video frames with realistic appearance. Finally, the dedicated designed fusion network ensure the temporal coherence. We further collect a large set of human dancing videos to push forward this research field. Both quantitative and qualitative results show our method substantially improves over previous approaches and is able to generate appealing and photo-realistic target videos given any input person image. All source code and dataset will be released at https://github.com/xiezhy6/CPF-Net.

* TIP 2021 

Old Photo Restoration via Deep Latent Space Translation

Sep 14, 2020
Ziyu Wan, Bo Zhang, Dongdong Chen, Pan Zhang, Dong Chen, Jing Liao, Fang Wen

Figure 1 for Old Photo Restoration via Deep Latent Space Translation
Figure 2 for Old Photo Restoration via Deep Latent Space Translation
Figure 3 for Old Photo Restoration via Deep Latent Space Translation
Figure 4 for Old Photo Restoration via Deep Latent Space Translation

We propose to restore old photos that suffer from severe degradation through a deep learning approach. Unlike conventional restoration tasks that can be solved through supervised learning, the degradation in real photos is complex and the domain gap between synthetic images and real old photos makes the network fail to generalize. Therefore, we propose a novel triplet domain translation network by leveraging real photos along with massive synthetic image pairs. Specifically, we train two variational autoencoders (VAEs) to respectively transform old photos and clean photos into two latent spaces. And the translation between these two latent spaces is learned with synthetic paired data. This translation generalizes well to real photos because the domain gap is closed in the compact latent space. Besides, to address multiple degradations mixed in one old photo, we design a global branch with apartial nonlocal block targeting to the structured defects, such as scratches and dust spots, and a local branch targeting to the unstructured defects, such as noises and blurriness. Two branches are fused in the latent space, leading to improved capability to restore old photos from multiple defects. Furthermore, we apply another face refinement network to recover fine details of faces in the old photos, thus ultimately generating photos with enhanced perceptual quality. With comprehensive experiments, the proposed pipeline demonstrates superior performance over state-of-the-art methods as well as existing commercial tools in terms of visual quality for old photos restoration.

* 15 pages. arXiv admin note: substantial text overlap with arXiv:2004.09484