We propose replacing scene text in videos using deep style transfer and learned photometric transformations.Building on recent progress on still image text replacement,we present extensions that alter text while preserving the appearance and motion characteristics of the original video.Compared to the problem of still image text replacement,our method addresses additional challenges introduced by video, namely effects induced by changing lighting, motion blur, diverse variations in camera-object pose over time,and preservation of temporal consistency. We parse the problem into three steps. First, the text in all frames is normalized to a frontal pose using a spatio-temporal trans-former network. Second, the text is replaced in a single reference frame using a state-of-art still-image text replacement method. Finally, the new text is transferred from the reference to remaining frames using a novel learned image transformation network that captures lighting and blur effects in a temporally consistent manner. Results on synthetic and challenging real videos show realistic text trans-fer, competitive quantitative and qualitative performance,and superior inference speed relative to alternatives. We introduce new synthetic and real-world datasets with paired text objects. To the best of our knowledge this is the first attempt at deep video text replacement.
We present an algorithm for re-rendering a person from a single image under arbitrary poses. Existing methods often have difficulties in hallucinating occluded contents photo-realistically while preserving the identity and fine details in the source image. We first learn to inpaint the correspondence field between the body surface texture and the source image with a human body symmetry prior. The inpainted correspondence field allows us to transfer/warp local features extracted from the source to the target view even under large pose changes. Directly mapping the warped local features to an RGB image using a simple CNN decoder often leads to visible artifacts. Thus, we extend the StyleGAN generator so that it takes pose as input (for controlling poses) and introduces a spatially varying modulation for the latent space using the warped local features (for controlling appearances). We show that our method compares favorably against the state-of-the-art algorithms in both quantitative evaluation and visual comparison.
Generative adversarial networks (GANs) have demonstrated impressive image generation quality and semantic editing capability of real images, e.g., changing object classes, modifying attributes, or transferring styles. However, applying these GAN-based editing to a video independently for each frame inevitably results in temporal flickering artifacts. We present a simple yet effective method to facilitate temporally coherent video editing. Our core idea is to minimize the temporal photometric inconsistency by optimizing both the latent code and the pre-trained generator. We evaluate the quality of our editing on different domains and GAN inversion techniques and show favorable results against the baselines.
We propose an approach to domain adaptation for semantic segmentation that is both practical and highly accurate. In contrast to previous work, we abandon the use of computationally involved adversarial objectives, network ensembles and style transfer. Instead, we employ standard data augmentation techniques $-$ photometric noise, flipping and scaling $-$ and ensure consistency of the semantic predictions across these image transformations. We develop this principle in a lightweight self-supervised framework trained on co-evolving pseudo labels without the need for cumbersome extra training rounds. Simple in training from a practitioner's standpoint, our approach is remarkably effective. We achieve significant improvements of the state-of-the-art segmentation accuracy after adaptation, consistent both across different choices of the backbone architecture and adaptation scenarios.
Deep neural networks have largely failed to effectively utilize synthetic data when applied to real images due to the covariate shift problem. In this paper, we show that by applying a straightforward modification to an existing photorealistic style transfer algorithm, we achieve state-of-the-art synthetic-to-real domain adaptation results. We conduct extensive experimental validations on four synthetic-to-real tasks for semantic segmentation and object detection, and show that our approach exceeds the performance of any current state-of-the-art GAN-based image translation approach as measured by segmentation and object detection metrics. Furthermore we offer a distance based analysis of our method which shows a dramatic reduction in Frechet Inception distance between the source and target domains, offering a quantitative metric that demonstrates the effectiveness of our algorithm in bridging the synthetic-to-real gap.
Ever since Machine Learning as a Service (MLaaS) emerges as a viable business that utilizes deep learning models to generate lucrative revenue, Intellectual Property Right (IPR) has become a major concern because these deep learning models can easily be replicated, shared, and re-distributed by any unauthorized third parties. To the best of our knowledge, one of the prominent deep learning models - Generative Adversarial Networks (GANs) which has been widely used to create photorealistic image are totally unprotected despite the existence of pioneering IPR protection methodology for Convolutional Neural Networks (CNNs). This paper therefore presents a complete protection framework in both black-box and white-box settings to enforce IPR protection on GANs. Empirically, we show that the proposed method does not compromise the original GANs performance (i.e. image generation, image super-resolution, style transfer), and at the same time, it is able to withstand both removal and ambiguity attacks against embedded watermarks.
Machine learning is driven by data, yet while their availability is constantly increasing, training data require laborious, time consuming and error-prone labelling or ground truth acquisition, which in some cases is very difficult or even impossible. Recent works have resorted to synthetic data generation, but the inferior performance of models trained on synthetic data when applied to the real world, introduced the challenge of unsupervised domain adaptation. In this work we investigate an unsupervised domain adaptation technique that descends from another perspective, in order to avoid the complexity of adversarial training and cycle consistencies. We exploit the recent advances in photorealistic style transfer and take a fully data driven approach. While this concept is already implicitly formulated within the intricate objectives of domain adaptation GANs, we take an explicit approach and apply it directly as data pre-processing. The resulting technique is scalable, efficient and easy to implement, offers competitive performance to the complex state-of-the-art alternatives and can open up new pathways for domain adaptation.
Humans comprehend a natural scene at a single glance; painters and other visual artists, through their abstract representations, stressed this capacity to the limit. The performance of computer vision solutions matched that of humans in many problems of visual recognition. In this paper we address the problem of recognizing the genre (subject) in digitized paintings using Convolutional Neural Networks (CNN) as part of the more general dealing with abstract and/or artistic representation of scenes. Initially we establish the state of the art performance by training a CNN from scratch. In the next level of evaluation, we identify aspects that hinder the CNNs' recognition, such as artistic abstraction. Further, we test various domain adaptation methods that could enhance the subject recognition capabilities of the CNNs. The evaluation is performed on a database of 80,000 annotated digitized paintings, which is tentatively extended with artistic photographs, either original or stylized, in order to emulate artistic representations. Surprisingly, the most efficient domain adaptation is not the neural style transfer. Finally, the paper provides an experiment-based assessment of the abstraction level that CNNs are able to achieve.
Edge preserving filters preserve the edges and its information while blurring an image. In other words they are used to smooth an image, while reducing the edge blurring effects across the edge like halos, phantom etc. They are nonlinear in nature. Examples are bilateral filter, anisotropic diffusion filter, guided filter, trilateral filter etc. Hence these family of filters are very useful in reducing the noise in an image making it very demanding in computer vision and computational photography applications like denoising, video abstraction, demosaicing, optical-flow estimation, stereo matching, tone mapping, style transfer, relighting etc. This paper provides a concrete introduction to edge preserving filters starting from the heat diffusion equation in olden to recent eras, an overview of its numerous applications, as well as mathematical analysis, various efficient and optimized ways of implementation and their interrelationships, keeping focus on preserving the boundaries, spikes and canyons in presence of noise. Furthermore it provides a realistic notion for efficient implementation with a research scope for hardware realization for further acceleration.