Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning. We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent.
Human beings often assess the aesthetic quality of an image coupled with the identification of the image's semantic content. This paper addresses the correlation issue between automatic aesthetic quality assessment and semantic recognition. We cast the assessment problem as the main task among a multi-task deep model, and argue that semantic recognition task offers the key to address this problem. Based on convolutional neural networks, we employ a single and simple multi-task framework to efficiently utilize the supervision of aesthetic and semantic labels. A correlation item between these two tasks is further introduced to the framework by incorporating the inter-task relationship learning. This item not only provides some useful insight about the correlation but also improves assessment accuracy of the aesthetic task. Particularly, an effective strategy is developed to keep a balance between the two tasks, which facilitates to optimize the parameters of the framework. Extensive experiments on the challenging AVA dataset and Photo.net dataset validate the importance of semantic recognition in aesthetic quality assessment, and demonstrate that multi-task deep models can discover an effective aesthetic representation to achieve state-of-the-art results.
This paper introduces a visual sentiment concept classification method based on deep convolutional neural networks (CNNs). The visual sentiment concepts are adjective noun pairs (ANPs) automatically discovered from the tags of web photos, and can be utilized as effective statistical cues for detecting emotions depicted in the images. Nearly one million Flickr images tagged with these ANPs are downloaded to train the classifiers of the concepts. We adopt the popular model of deep convolutional neural networks which recently shows great performance improvement on classifying large-scale web-based image dataset such as ImageNet. Our deep CNNs model is trained based on Caffe, a newly developed deep learning framework. To deal with the biased training data which only contains images with strong sentiment and to prevent overfitting, we initialize the model with the model weights trained from ImageNet. Performance evaluation shows the newly trained deep CNNs model SentiBank 2.0 (or called DeepSentiBank) is significantly improved in both annotation accuracy and retrieval performance, compared to its predecessors which mainly use binary SVM classification models.
Novel texture synthesis for existing 3D mesh models is an important step towards photo realistic asset generation for existing simulators. But existing methods inherently work in the 2D image space which is the projection of the 3D space from a given camera perspective. These methods take camera angle, 3D model information, lighting information and generate photorealistic 2D image. To generate a photorealistic image from another perspective or lighting, we need to make a computationally expensive forward pass each time we change the parameters. Also, it is hard to generate such images for a simulator that can satisfy the temporal constraints the sequences of images should be similar but only need to change the viewpoint of lighting as desired. The solution can not be directly integrated with existing tools like Blender and Unreal Engine. Manual solution is expensive and time consuming. We thus present a new system called a graph generative adversarial network (GGAN) that can generate textures which can be directly integrated into a given 3D mesh models with tools like Blender and Unreal Engine and can be simulated from any perspective and lighting condition easily.
In this paper, we propose an iterative self-training framework for sim-to-real 6D object pose estimation to facilitate cost-effective robotic grasping. Given a bin-picking scenario, we establish a photo-realistic simulator to synthesize abundant virtual data, and use this to train an initial pose estimation network. This network then takes the role of a teacher model, which generates pose predictions for unlabeled real data. With these predictions, we further design a comprehensive adaptive selection scheme to distinguish reliable results, and leverage them as pseudo labels to update a student model for pose estimation on real data. To continuously improve the quality of pseudo labels, we iterate the above steps by taking the trained student model as a new teacher and re-label real data using the refined teacher model. We evaluate our method on a public benchmark and our newly-released dataset, achieving an ADD(-S) improvement of 11.49% and 22.62% respectively. Our method is also able to improve robotic bin-picking success by 19.54%, demonstrating the potential of iterative sim-to-real solutions for robotic applications.
Modern AI tools, such as generative adversarial networks, have transformed our ability to create and modify visual data with photorealistic results. However, one of the deleterious side-effects of these advances is the emergence of nefarious uses in manipulating information in visual data, such as through the use of deep fakes. We propose a novel architecture for preserving the provenance of semantic information in images to make them less susceptible to deep fake attacks. Our architecture includes semantic signing and verification steps. We apply this architecture to verifying two types of semantic information: individual identities (faces) and whether the photo was taken indoors or outdoors. Verification accounts for a collection of common image transformation, such as translation, scaling, cropping, and small rotations, and rejects adversarial transformations, such as adversarially perturbed or, in the case of face verification, swapped faces. Experiments demonstrate that in the case of provenance of faces in an image, our approach is robust to black-box adversarial transformations (which are rejected) as well as benign transformations (which are accepted), with few false negatives and false positives. Background verification, on the other hand, is susceptible to black-box adversarial examples, but becomes significantly more robust after adversarial training.
In this paper we propose a new problem scenario in image processing, wide-range image blending, which aims to smoothly merge two different input photos into a panorama by generating novel image content for the intermediate region between them. Although such problem is closely related to the topics of image inpainting, image outpainting, and image blending, none of the approaches from these topics is able to easily address it. We introduce an effective deep-learning model to realize wide-range image blending, where a novel Bidirectional Content Transfer module is proposed to perform the conditional prediction for the feature representation of the intermediate region via recurrent neural networks. In addition to ensuring the spatial and semantic consistency during the blending, we also adopt the contextual attention mechanism as well as the adversarial learning scheme in our proposed method for improving the visual quality of the resultant panorama. We experimentally demonstrate that our proposed method is not only able to produce visually appealing results for wide-range image blending, but also able to provide superior performance with respect to several baselines built upon the state-of-the-art image inpainting and outpainting approaches.
Face authentication on mobile end has been widely applied in various scenarios. Despite the increasing reliability of cutting-edge face authentication/verification systems to variations like blinking eye and subtle facial expression, anti-spoofing against high-resolution rendering replay of paper photos or digital videos retains as an open problem. In this paper, we propose a simple yet effective face anti-spoofing system, termed Aurora Guard (AG). Our system firstly extracts the normal cues via light reflection analysis, and then adopts an end-to-end trainable multi-task Convolutional Neural Network (CNN) to accurately recover subjects' intrinsic depth and material map to assist liveness classification, along with the light CAPTCHA checking mechanism in the regression branch to further improve the system reliability. Experiments on public Replay-Attack and CASIA datasets demonstrate the merits of our proposed method over the state-of-the-arts. We also conduct extensive experiments on a large-scale dataset containing 12,000 live and diverse spoofing samples, which further validates the generalization ability of our method in the wild.
Aerial cinematography is significantly expanding the capabilities of film-makers. Recent progress in autonomous unmanned aerial vehicles (UAVs) has further increased the potential impact of aerial cameras, with systems that can safely track actors in unstructured cluttered environments. Professional productions, however, require the use of multiple cameras simultaneously to record different viewpoints of the same scene, which are edited into the final footage either in real time or in post-production. Such extreme motion coordination is particularly hard for unscripted action scenes, which are a common use case of aerial cameras. In this work we develop a real-time multi-UAV coordination system that is capable of recording dynamic targets while maximizing shot diversity and avoiding collisions and mutual visibility between cameras. We validate our approach in multiple cluttered environments of a photo-realistic simulator, and deploy the system using two UAVs in real-world experiments. We show that our coordination scheme has low computational cost and takes only 1.17 ms on average to plan for a team of 3 UAVs over a 10 s time horizon. Supplementary video: https://youtu.be/m2R3anv2ADE
Image compositing is a task of combining regions from different images to compose a new image. A common use case is background replacement of portrait images. To obtain high quality composites, professionals typically manually perform multiple editing steps such as segmentation, matting and foreground color decontamination, which is very time consuming even with sophisticated photo editing tools. In this paper, we propose a new method which can automatically generate high-quality image compositing without any user input. Our method can be trained end-to-end to optimize exploitation of contextual and color information of both foreground and background images, where the compositing quality is considered in the optimization. Specifically, inspired by Laplacian pyramid blending, a dense-connected multi-stream fusion network is proposed to effectively fuse the information from the foreground and background images at different scales. In addition, we introduce a self-taught strategy to progressively train from easy to complex cases to mitigate the lack of training data. Experiments show that the proposed method can automatically generate high-quality composites and outperforms existing methods both qualitatively and quantitatively.