Iteration of training and evaluating a machine learning model is an important process to improve its performance. However, while teachable interfaces enable blind users to train and test an object recognizer with photos taken in their distinctive environment, accessibility of training iteration and evaluation steps has received little attention. Iteration assumes visual inspection of the training photos, which is inaccessible for blind users. We explore this challenge through MyCam, a mobile app that incorporates automatically estimated descriptors for non-visual access to the photos in the users' training sets. We explore how blind participants (N=12) interact with MyCam and the descriptors through an evaluation study in their homes. We demonstrate that the real-time photo-level descriptors enabled blind users to reduce photos with cropped objects, and that participants could add more variations by iterating through and accessing the quality of their training sets. Also, Participants found the app simple to use indicating that they could effectively train it and that the descriptors were useful. However, subjective responses were not reflected in the performance of their models, partially due to little variation in training and cluttered backgrounds.
Instagram has become a great venue for amateur and professional photographers alike to showcase their work. It has, in other words, democratized photography. Generally, photographers take thousands of photos in a session, from which they pick a few to showcase their work on Instagram. Photographers trying to build a reputation on Instagram have to strike a balance between maximizing their followers' engagement with their photos, while also maintaining their artistic style. We used transfer learning to adapt Xception, which is a model for object recognition trained on the ImageNet dataset, to the task of engagement prediction and utilized Gram matrices generated from VGG19, another object recognition model trained on ImageNet, for the task of style similarity measurement on photos posted on Instagram. Our models can be trained on individual Instagram accounts to create personalized engagement prediction and style similarity models. Once trained on their accounts, users can have new photos sorted based on predicted engagement and style similarity to their previous work, thus enabling them to upload photos that not only have the potential to maximize engagement from their followers but also maintain their style of photography. We trained and validated our models on several Instagram accounts, showing it to be adept at both tasks, also outperforming several baseline models and human annotators.
Image generation task has received increasing attention because of its wide application in security and entertainment. Sketch-based face generation brings more fun and better quality of image generation due to supervised interaction. However, When a sketch poorly aligned with the true face is given as input, existing supervised image-to-image translation methods often cannot generate acceptable photo-realistic face images. To address this problem, in this paper we propose Cali-Sketch, a poorly-drawn-sketch to photo-realistic-image generation method. Cali-Sketch explicitly models stroke calibration and image generation using two constituent networks: a Stroke Calibration Network (SCN), which calibrates strokes of facial features and enriches facial details while preserving the original intent features; and an Image Synthesis Network (ISN), which translates the calibrated and enriched sketches to photo-realistic face images. In this way, we manage to decouple a difficult cross-domain translation problem into two easier steps. Extensive experiments verify that the face photos generated by Cali-Sketch are both photo-realistic and faithful to the input sketches, compared with state-of-the-art methods
Modelling human free-hand sketches has become topical recently, driven by practical applications such as fine-grained sketch based image retrieval (FG-SBIR). Sketches are clearly related to photo edge-maps, but a human free-hand sketch of a photo is not simply a clean rendering of that photo's edge map. Instead there is a fundamental process of abstraction and iconic rendering, where overall geometry is warped and salient details are selectively included. In this paper we study this sketching process and attempt to invert it. We model this inversion by translating iconic free-hand sketches to contours that resemble more geometrically realistic projections of object boundaries, and separately factorise out the salient added details. This factorised re-representation makes it easier to match a free-hand sketch to a photo instance of an object. Specifically, we propose a novel unsupervised image style transfer model based on enforcing a cyclic embedding consistency constraint. A deep FG-SBIR model is then formulated to accommodate complementary discriminative detail from each factorised sketch for better matching with the corresponding photo. Our method is evaluated both qualitatively and quantitatively to demonstrate its superiority over a number of state-of-the-art alternatives for style transfer and FG-SBIR.
We live in a society where the large majority of the population has a camera-equipped smartphone. In addition, hard drives and cloud storage are getting cheaper and cheaper, leading to a tremendous growth in stored personal photos. Unlike photo collections captured by a digital camera, which typically are pre-processed by the user who organizes them into event-related folders, smartphone pictures are automatically stored in the cloud. As a consequence, photo collections captured by a smartphone are highly unstructured and because smartphones are ubiquitous, they present a larger variability compared to pictures captured by a digital camera. To solve the need of organizing large smartphone photo collections automatically, we propose here a new methodology for hierarchical photo organization into topics and topic-related categories. Our approach successfully estimates latent topics in the pictures by applying probabilistic Latent Semantic Analysis, and automatically assigns a name to each topic by relying on a lexical database. Topic-related categories are then estimated by using a set of topic-specific Convolutional Neuronal Networks. To validate our approach, we ensemble and make public a large dataset of more than 8,000 smartphone pictures from 10 persons. Experimental results demonstrate better user satisfaction with respect to state of the art solutions in terms of organization.
We address the problem of end-to-end visual storytelling. Given a photo album, our model first selects the most representative (summary) photos, and then composes a natural language story for the album. For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story. Automatic and human evaluations show our model achieves better performance on selection, generation, and retrieval than baselines.
Renovating the memories in old photos is an intriguing research topic in computer vision fields. These legacy images often suffer from severe and commingled degradations such as cracks, noise, and color-fading, while lack of large-scale paired old photo datasets makes this restoration task very challenging. In this work, we present a novel reference-based end-to-end learning framework that can jointly repair and colorize the degraded legacy pictures. Specifically, the proposed framework consists of three modules: a restoration sub-network for degradation restoration, a similarity sub-network for color histogram matching and transfer, and a colorization subnet that learns to predict the chroma elements of the images conditioned on chromatic reference signals. The whole system takes advantage of the color histogram priors in a given reference image, which vastly reduces the dependency on large-scale training data. Apart from the proposed method, we also create, to our knowledge, the first public and real-world old photo dataset with paired ground truth for evaluating old photo restoration models, wherein each old photo is paired with a manually restored pristine image by PhotoShop experts. Our extensive experiments conducted on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-arts both quantitatively and qualitatively.
Deep learning is expected to offer new opportunities and a new paradigm for the field of architecture. One such opportunity is teaching neural networks to visually understand architectural elements from the built environment. However, the availability of large training datasets is one of the biggest limitations of neural networks. Also, the vast majority of training data for visual recognition tasks is annotated by humans. In order to resolve this bottleneck, we present a concept of a hybrid system using both building information modeling (BIM) and hyperrealistic (photorealistic) rendering to synthesize datasets for training a neural network for building object recognition in photos. For generating our training dataset BIMrAI, we used an existing BIM model and a corresponding photo-realistically rendered model of the same building. We created methods for using renderings to train a deep learning model, trained a generative adversarial network (GAN) model using these methods, and tested the output model on real-world photos. For the specific case study presented in this paper, our results show that a neural network trained with synthetic data; i.e., photorealistic renderings and BIM-based semantic labels, can be used to identify building objects from photos without using photos in the training data. Future work can enhance the presented methods using available BIM models and renderings for more generalized mapping and description of photographed built environments.
This paper presents a novel approach for landmark recognition in images that we've successfully deployed at Mail ru. This method enables us to recognize famous places, buildings, monuments, and other landmarks in user photos. The main challenge lies in the fact that it's very complicated to give a precise definition of what is and what is not a landmark. Some buildings, statues and natural objects are landmarks; others are not. There's also no database with a fairly large number of landmarks to train a recognition model. A key feature of using landmark recognition in a production environment is that the number of photos containing landmarks is extremely small. This is why the model should have a very low false positive rate as well as high recognition accuracy. We propose a metric learning-based approach that successfully deals with existing challenges and efficiently handles a large number of landmarks. Our method uses a deep neural network and requires a single pass inference that makes it fast to use in production. We also describe an algorithm for cleaning landmarks database which is essential for training a metric learning model. We provide an in-depth description of basic components of our method like neural network architecture, the learning strategy, and the features of our metric learning approach. We show the results of proposed solutions in tests that emulate the distribution of photos with and without landmarks from a user collection. We compare our method with others during these tests. The described system has been deployed as a part of a photo recognition solution at Cloud Mail ru, which is the photo sharing and storage service at Mail ru Group.
3D photography is a new medium that allows viewers to more fully experience a captured moment. In this work, we refer to a 3D photo as one that displays parallax induced by moving the viewpoint (as opposed to a stereo pair with a fixed viewpoint). 3D photos are static in time, like traditional photos, but are displayed with interactive parallax on mobile or desktop screens, as well as on Virtual Reality devices, where viewing it also includes stereo. We present an end-to-end system for creating and viewing 3D photos, and the algorithmic and design choices therein. Our 3D photos are captured in a single shot and processed directly on a mobile device. The method starts by estimating depth from the 2D input image using a new monocular depth estimation network that is optimized for mobile devices. It performs competitively to the state-of-the-art, but has lower latency and peak memory consumption and uses an order of magnitude fewer parameters. The resulting depth is lifted to a layered depth image, and new geometry is synthesized in parallax regions. We synthesize color texture and structures in the parallax regions as well, using an inpainting network, also optimized for mobile devices, on the LDI directly. Finally, we convert the result into a mesh-based representation that can be efficiently transmitted and rendered even on low-end devices and over poor network connections. Altogether, the processing takes just a few seconds on a mobile device, and the result can be instantly viewed and shared. We perform extensive quantitative evaluation to validate our system and compare its new components against the current state-of-the-art.