Due to the prevalence of social media websites, one challenge facing computer vision researchers is to devise methods to process and search for persons of interest among the billions of shared photos on these websites. Facebook revealed in a 2013 white paper that its users have uploaded more than 250 billion photos, and are uploading 350 million new photos each day. Due to this humongous amount of data, large-scale face search for mining web images is both important and challenging. Despite significant progress in face recognition, searching a large collection of unconstrained face images has not been adequately addressed. To address this challenge, we propose a face search system which combines a fast search procedure, coupled with a state-of-the-art commercial off the shelf (COTS) matcher, in a cascaded framework. Given a probe face, we first filter the large gallery of photos to find the top-k most similar faces using deep features generated from a convolutional neural network. The k candidates are re-ranked by combining similarities from deep features and the COTS matcher. We evaluate the proposed face search system on a gallery containing 80 million web-downloaded face images. Experimental results demonstrate that the deep features are competitive with state-of-the-art methods on unconstrained face recognition benchmarks (LFW and IJB-A). Further, the proposed face search system offers an excellent trade-off between accuracy and scalability on datasets consisting of millions of images. Additionally, in an experiment involving searching for face images of the Tsarnaev brothers, convicted of the Boston Marathon bombing, the proposed face search system could find the younger brother's (Dzhokhar Tsarnaev) photo at rank 1 in 1 second on a 5M gallery and at rank 8 in 7 seconds on an 80M gallery.
Exemplar-based face sketch synthesis methods usually meet the challenging problem that input photos are captured in different lighting conditions from training photos. The critical step causing the failure is the search of similar patch candidates for an input photo patch. Conventional illumination invariant patch distances are adopted rather than directly relying on pixel intensity difference, but they will fail when local contrast within a patch changes. In this paper, we propose a fast preprocessing method named Bidirectional Luminance Remapping (BLR), which interactively adjust the lighting of training and input photos. Our method can be directly integrated into state-of-the-art exemplar-based methods to improve their robustness with ignorable computational cost.
In this paper we consider the user modeling given the photos and videos from the gallery on a mobile device. We propose the novel user preference prediction engine based on scene understanding, object detection and face recognition. At first, all faces in a gallery are clustered and all private photos and videos with faces from large clusters are processed on the embedded system in offline mode. Other photos are sent to the remote server to be analyzed by very deep models. The visual features of each photo are aggregated into a single user descriptor using the neural attention block. The proposed pipeline is implemented for the Android mobile platform. Experimental results with a subset of Amazon Home and Kitchen, Places2 and Open Images datasets demonstrate the possibility to process images very efficiently without accuracy degradation.
CNNs have massively improved performance in object detection in photographs. However research into object detection in artwork remains limited. We show state-of-the-art performance on a challenging dataset, People-Art, which contains people from photos, cartoons and 41 different artwork movements. We achieve this high performance by fine-tuning a CNN for this task, thus also demonstrating that training CNNs on photos results in overfitting for photos: only the first three or four layers transfer from photos to artwork. Although the CNN's performance is the highest yet, it remains less than 60\% AP, suggesting further work is needed for the cross-depiction problem. The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-46604-0_57
The goal of cross-domain object matching (CDOM) is to find correspondence between two sets of objects in different domains in an unsupervised way. Photo album summarization is a typical application of CDOM, where photos are automatically aligned into a designed frame expressed in the Cartesian coordinate system. CDOM is usually formulated as finding a mapping from objects in one domain (photos) to objects in the other domain (frame) so that the pairwise dependency is maximized. A state-of-the-art CDOM method employs a kernel-based dependency measure, but it has a drawback that the kernel parameter needs to be determined manually. In this paper, we propose alternative CDOM methods that can naturally address the model selection problem. Through experiments on image matching, unpaired voice conversion, and photo album summarization tasks, the effectiveness of the proposed methods is demonstrated.
The overwhelming popularity of social media has resulted in bulk amounts of personal photos being uploaded to the internet every day. Since these photos are taken in unconstrained settings, recognizing the identities of people among the photos remains a challenge. Studies have indicated that utilizing evidence other than face appearance improves the performance of person recognition systems. In this work, we aim to take advantage of additional cues obtained from different body regions in a zooming in fashion for person recognition. Hence, we present Zoom-RNN, a novel method based on recurrent neural networks for combining evidence extracted from the whole body, upper body, and head regions. Our model is evaluated on a challenging dataset, namely People In Photo Albums (PIPA), and we demonstrate that employing our system improves the performance of conventional fusion methods by a noticeable margin.
In psychology, theory-driven researches are usually conducted with extensive laboratory experiments, yet rarely tested or disproved with big data. In this paper, we make use of 418K travel photos with traveler ratings to test the influential "broaden-and-build" theory, that suggests positive emotions broaden one's visual attention. The core hypothesis examined in this study is that positive emotion is associated with a wider attention, hence highly-rated sites would trigger wide-angle photographs. By analyzing travel photos, we find a strong correlation between a preference for wide-angle photos and the high rating of tourist sites on TripAdvisor. We are able to carry out this analysis through the use of deep learning algorithms to classify the photos into wide and narrow angles, and present this study as an exemplar of how big data and deep learning can be used to test laboratory findings in the wild.
Photo collections and its applications today attempt to reflect user interactions in various forms. Moreover, photo collections aim to capture the users' intention with minimum effort through applications capturing user intentions. Human interest regions in an image carry powerful information about the user's behavior and can be used in many photo applications. Research on human visual attention has been conducted in the form of gaze tracking and computational saliency models in the computer vision community, and has shown considerable progress. This paper presents an integration between implicit gaze estimation and computational saliency model to effectively estimate human attention regions in images on the fly. Furthermore, our method estimates human attention via implicit calibration and incremental model updating without any active participation from the user. We also present extensive analysis and possible applications for personal photo collections.
We address the task of converting a floorplan and a set of associated photos of a residence into a textured 3D mesh model, a task which we call Plan2Scene. Our system 1) lifts a floorplan image to a 3D mesh model; 2) synthesizes surface textures based on the input photos; and 3) infers textures for unobserved surfaces using a graph neural network architecture. To train and evaluate our system we create indoor surface texture datasets, and augment a dataset of floorplans and photos from prior work with rectified surface crops and additional annotations. Our approach handles the challenge of producing tileable textures for dominant surfaces such as floors, walls, and ceilings from a sparse set of unaligned photos that only partially cover the residence. Qualitative and quantitative evaluations show that our system produces realistic 3D interior models, outperforming baseline approaches on a suite of texture quality metrics and as measured by a holistic user study.
Recognising persons in everyday photos presents major challenges (occluded faces, different clothing, locations, etc.) for machine vision. We propose a convnet based person recognition system on which we provide an in-depth analysis of informativeness of different body cues, impact of training data, and the common failure modes of the system. In addition, we discuss the limitations of existing benchmarks and propose more challenging ones. Our method is simple and is built on open source and open data, yet it improves the state of the art results on a large dataset of social media photos (PIPA).