Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"photo": models, code, and papers

More Photos are All You Need: Semi-Supervised Learning for Fine-Grained Sketch Based Image Retrieval

Mar 25, 2021
Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Aneeshan Sain, Yongxin Yang, Tao Xiang, Yi-Zhe Song

A fundamental challenge faced by existing Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) models is the data scarcity -- model performances are largely bottlenecked by the lack of sketch-photo pairs. Whilst the number of photos can be easily scaled, each corresponding sketch still needs to be individually produced. In this paper, we aim to mitigate such an upper-bound on sketch data, and study whether unlabelled photos alone (of which they are many) can be cultivated for performances gain. In particular, we introduce a novel semi-supervised framework for cross-modal retrieval that can additionally leverage large-scale unlabelled photos to account for data scarcity. At the centre of our semi-supervision design is a sequential photo-to-sketch generation model that aims to generate paired sketches for unlabelled photos. Importantly, we further introduce a discriminator guided mechanism to guide against unfaithful generation, together with a distillation loss based regularizer to provide tolerance against noisy training samples. Last but not least, we treat generation and retrieval as two conjugate problems, where a joint learning procedure is devised for each module to mutually benefit from each other. Extensive experiments show that our semi-supervised model yields significant performance boost over the state-of-the-art supervised alternatives, as well as existing methods that can exploit unlabelled photos for FG-SBIR.

* IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021 Code : https://github.com/AyanKumarBhunia/semisupervised-FGSBIR 
  

Inferring Restaurant Styles by Mining Crowd Sourced Photos from User-Review Websites

Nov 19, 2016
Haofu Liao, Yucheng Li, Tianran Hu, Jiebo Luo

When looking for a restaurant online, user uploaded photos often give people an immediate and tangible impression about a restaurant. Due to their informativeness, such user contributed photos are leveraged by restaurant review websites to provide their users an intuitive and effective search experience. In this paper, we present a novel approach to inferring restaurant types or styles (ambiance, dish styles, suitability for different occasions) from user uploaded photos on user-review websites. To that end, we first collect a novel restaurant photo dataset associating the user contributed photos with the restaurant styles from TripAdvior. We then propose a deep multi-instance multi-label learning (MIML) framework to deal with the unique problem setting of the restaurant style classification task. We employ a two-step bootstrap strategy to train a multi-label convolutional neural network (CNN). The multi-label CNN is then used to compute the confidence scores of restaurant styles for all the images associated with a restaurant. The computed confidence scores are further used to train a final binary classifier for each restaurant style tag. Upon training, the styles of a restaurant can be profiled by analyzing restaurant photos with the trained multi-label CNN and SVM models. Experimental evaluation has demonstrated that our crowd sourcing-based approach can effectively infer the restaurant style when there are a sufficient number of user uploaded photos for a given restaurant.

* 10 pages, Accepted by IEEE BigData 2016 
  

Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling

Feb 03, 2020
Yunjae Jung, Dahun Kim, Sanghyun Woo, Kyungsu Kim, Sungjin Kim, In So Kweon

Visual storytelling is a task of creating a short story based on photo streams. Unlike existing visual captioning, storytelling aims to contain not only factual descriptions, but also human-like narration and semantics. However, the VIST dataset consists only of a small, fixed number of photos per story. Therefore, the main challenge of visual storytelling is to fill in the visual gap between photos with narrative and imaginative story. In this paper, we propose to explicitly learn to imagine a storyline that bridges the visual gap. During training, one or more photos is randomly omitted from the input stack, and we train the network to produce a full plausible story even with missing photo(s). Furthermore, we propose for visual storytelling a hide-and-tell model, which is designed to learn non-local relations across the photo streams and to refine and improve conventional RNN-based models. In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling, and that our model outperforms previous state-of-the-art methods in automatic metrics. Finally, we qualitatively show the learned ability to interpolate storyline over visual gaps.

* AAAI 2020 paper 
  

Attribute-controlled face photo synthesis from simple line drawing

Feb 09, 2017
Qi Guo, Ce Zhu, Zhiqiang Xia, Zhengtao Wang, Yipeng Liu

Face photo synthesis from simple line drawing is a one-to-many task as simple line drawing merely contains the contour of human face. Previous exemplar-based methods are over-dependent on the datasets and are hard to generalize to complicated natural scenes. Recently, several works utilize deep neural networks to increase the generalization, but they are still limited in the controllability of the users. In this paper, we propose a deep generative model to synthesize face photo from simple line drawing controlled by face attributes such as hair color and complexion. In order to maximize the controllability of face attributes, an attribute-disentangled variational auto-encoder (AD-VAE) is firstly introduced to learn latent representations disentangled with respect to specified attributes. Then we conduct photo synthesis from simple line drawing based on AD-VAE. Experiments show that our model can well disentangle the variations of attributes from other variations of face photos and synthesize detailed photorealistic face images with desired attributes. Regarding background and illumination as the style and human face as the content, we can also synthesize face photos with the target style of a style photo.

* 5 pages, 5 figures 
  

A Photo-Based Mobile Crowdsourcing Framework for Event Reporting

May 03, 2020
Aymen Hamrouni, Hakim Ghazzai, Mounir Frikha, Yehia Massoud

Mobile Crowdsourcing (MCS) photo-based is an arising field of interest and a trending topic in the domain of ubiquitous computing. It has recently drawn substantial attention of the smart cities and urban computing communities. In fact, the built-in cameras of mobile devices are becoming the most common way for visual logging techniques in our daily lives. MCS photo-based frameworks collect photos in a distributed way in which a large number of contributors upload photos whenever and wherever it is suitable. This inevitably leads to evolving picture streams which possibly contain misleading and redundant information that affects the task result. In order to overcome these issues, we develop, in this paper, a solution for selecting highly relevant data from an evolving picture stream and ensuring correct submission. The proposed photo-based MCS framework for event reporting incorporates (i) a deep learning model to eliminate false submissions and ensure photos credibility and (ii) an A-Tree shape data structure model for clustering streaming pictures to reduce information redundancy and provide maximum event coverage. Simulation results indicate that the implemented framework can effectively reduce false submissions and select a subset with high utility coverage with low redundancy ratio from the streaming data.

* 2019 IEEE 62nd International Midwest Symposium on Circuits and Systems (MWSCAS), Dallas, TX, USA, 2019, pp. 198-202 
* Published in 2019 IEEE 62nd International Midwest Symposium on Circuits and Systems (MWSCAS) 
  

A Photo-Based Mobile Crowdsourcing Frameworkfor Event Reporting

Apr 28, 2020
Aymen Hamrouni, Hakim Ghazzai, Mounir Frikha, Yehia Massoud

Mobile Crowdsourcing (MCS) photo-based is an arising field of interest and a trending topic in the domain of ubiquitous computing. It has recently drawn substantial attention of the smart cities and urban computing communities. In fact, the built-in cameras of mobile devices are becoming the most common way for visual logging techniques in our daily lives. MCS photo-based frameworks collect photos in a distributed way in which a large number of contributors upload photos whenever and wherever it is suitable. This inevitably leads to evolving picture streams which possibly contain misleading and redundant information that affects the task result. In order to overcome these issues, we develop, in this paper, a solution for selecting highly relevant data from an evolving picture stream and ensuring correct submission. The proposed photo-based MCS framework for event reporting incorporates (i) a deep learning model to eliminate false submissions and ensure photos credibility and (ii) an A-Tree shape data structure model for clustering streaming pictures to reduce information redundancy and provide maximum event coverage. Simulation results indicate that the implemented framework can effectively reduce false submissions and select a subset with high utility coverage with low redundancy ratio from the streaming data.

* 2019 IEEE 62nd International Midwest Symposium on Circuits and Systems (MWSCAS), Dallas, TX, USA, 2019, pp. 198-202 
* Published in 2019 IEEE 62nd International Midwest Symposium on Circuits and Systems (MWSCAS) 
  

3D Moments from Near-Duplicate Photos

May 12, 2022
Qianqian Wang, Zhengqi Li, David Salesin, Noah Snavely, Brian Curless, Janne Kontkanen

We introduce 3D Moments, a new computational photography effect. As input we take a pair of near-duplicate photos, i.e., photos of moving subjects from similar viewpoints, common in people's photo collections. As output, we produce a video that smoothly interpolates the scene motion from the first photo to the second, while also producing camera motion with parallax that gives a heightened sense of 3D. To achieve this effect, we represent the scene as a pair of feature-based layered depth images augmented with scene flow. This representation enables motion interpolation along with independent control of the camera viewpoint. Our system produces photorealistic space-time videos with motion parallax and scene dynamics, while plausibly recovering regions occluded in the original views. We conduct extensive experiments demonstrating superior performance over baselines on public datasets and in-the-wild photos. Project page: https://3d-moments.github.io/

* CVPR 2022 
  

CheXphoto: 10,000+ Smartphone Photos and Synthetic Photographic Transformations of Chest X-rays for Benchmarking Deep Learning Robustness

Jul 13, 2020
Nick A. Phillips, Pranav Rajpurkar, Mark Sabini, Rayan Krishnan, Sharon Zhou, Anuj Pareek, Nguyet Minh Phu, Chris Wang, Andrew Y. Ng, Matthew P. Lungren

Clinical deployment of deep learning algorithms for chest x-ray interpretation requires a solution that can integrate into the vast spectrum of clinical workflows across the world. An appealing solution to scaled deployment is to leverage the existing ubiquity of smartphones: in several parts of the world, clinicians and radiologists capture photos of chest x-rays to share with other experts or clinicians via smartphone using messaging services like WhatsApp. However, the application of chest x-ray algorithms to photos of chest x-rays requires reliable classification in the presence of smartphone photo artifacts such as screen glare and poor viewing angle not typically encountered on digital x-rays used to train machine learning models. We introduce CheXphoto, a dataset of smartphone photos and synthetic photographic transformations of chest x-rays sampled from the CheXpert dataset. To generate CheXphoto we (1) automatically and manually captured photos of digital x-rays under different settings, including various lighting conditions and locations, and, (2) generated synthetic transformations of digital x-rays targeted to make them look like photos of digital x-rays and x-ray films. We release this dataset as a resource for testing and improving the robustness of deep learning algorithms for automated chest x-ray interpretation on smartphone photos of chest x-rays.

  

Adversarial Open Domain Adaption Framework (AODA): Sketch-to-Photo Synthesis

Aug 19, 2021
Amey Thakur, Mega Satish

This paper aims to demonstrate the efficiency of the Adversarial Open Domain Adaption framework for sketch-to-photo synthesis. The unsupervised open domain adaption for generating realistic photos from a hand-drawn sketch is challenging as there is no such sketch of that class for training data. The absence of learning supervision and the huge domain gap between both the freehand drawing and picture domains make it hard. We present an approach that learns both sketch-to-photo and photo-to-sketch generation to synthesise the missing freehand drawings from pictures. Due to the domain gap between synthetic sketches and genuine ones, the generator trained on false drawings may produce unsatisfactory results when dealing with drawings of lacking classes. To address this problem, we offer a simple but effective open-domain sampling and optimization method that tricks the generator into considering false drawings as genuine. Our approach generalises the learnt sketch-to-photo and photo-to-sketch mappings from in-domain input to open-domain categories. On the Scribble and SketchyCOCO datasets, we compared our technique to the most current competing methods. For many types of open-domain drawings, our model outperforms impressive results in synthesising accurate colour, substance, and retaining the structural layout.

* This was an undergraduate research effort, and in retrospect, it isn't comprehensive enough 
  
<<
1
2
3
4
5
6
7
8
9
>>