Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"photo": models, code, and papers

Visual-based Safe Landing for UAVs in Populated Areas: Real-time Validation in Virtual Environments

Mar 25, 2022
Hector Tovanche-Picon, Javier Gonzalez-Trejo, Angel Flores-Abad, Diego Mercado-Ravell

Safe autonomous landing for Unmanned Aerial Vehicles (UAVs) in populated areas is a crucial aspect for successful urban deployment, particularly in emergency landing situations. Nonetheless, validating autonomous landing in real scenarios is a challenging task involving a high risk of injuring people. In this work, we propose a framework for real-time safe and thorough evaluation of vision-based autonomous landing in populated scenarios, using photo-realistic virtual environments. We propose to use the Unreal graphics engine coupled with the AirSim plugin for drone's simulation, and evaluate autonomous landing strategies based on visual detection of Safe Landing Zones (SLZ) in populated scenarios. Then, we study two different criteria for selecting the "best" SLZ, and evaluate them during autonomous landing of a virtual drone in different scenarios and conditions, under different distributions of people in urban scenes, including moving people. We evaluate different metrics to quantify the performance of the landing strategies, establishing a baseline for comparison with future works in this challenging task, and analyze them through an important number of randomized iterations. The study suggests that the use of the autonomous landing algorithms considerably helps to prevent accidents involving humans, which may allow to unleash the full potential of drones in urban environments near to people.

Access Paper or Ask Questions

Development of Automatic Tree Counting Software from UAV Based Aerial Images With Machine Learning

Jan 07, 2022
Musa Ataş, Ayhan Talay

Unmanned aerial vehicles (UAV) are used successfully in many application areas such as military, security, monitoring, emergency aid, tourism, agriculture, and forestry. This study aims to automatically count trees in designated areas on the Siirt University campus from high-resolution images obtained by UAV. Images obtained at 30 meters height with 20% overlap were stitched offline at the ground station using Adobe Photoshop's photo merge tool. The resulting image was denoised and smoothed by applying the 3x3 median and mean filter, respectively. After generating the orthophoto map of the aerial images captured by the UAV in certain regions, the bounding boxes of different objects on these maps were labeled in the modalities of HSV (Hue Saturation Value), RGB (Red Green Blue) and Gray. Training, validation, and test datasets were generated and then have been evaluated for classification success rates related to tree detection using various machine learning algorithms. In the last step, a ground truth model was established by obtaining the actual tree numbers, and then the prediction performance was calculated by comparing the reference ground truth data with the proposed model. It is considered that significant success has been achieved for tree count with an average accuracy rate of 87% obtained using the MLP classifier in predetermined regions.

Access Paper or Ask Questions

Neural Point Light Fields

Dec 17, 2021
Julian Ost, Issam Laradji, Alejandro Newell, Yuval Bahat, Felix Heide

We introduce Neural Point Light Fields that represent scenes implicitly with a light field living on a sparse point cloud. Combining differentiable volume rendering with learned implicit density representations has made it possible to synthesize photo-realistic images for novel views of small scenes. As neural volumetric rendering methods require dense sampling of the underlying functional scene representation, at hundreds of samples along a ray cast through the volume, they are fundamentally limited to small scenes with the same objects projected to hundreds of training views. Promoting sparse point clouds to neural implicit light fields allows us to represent large scenes effectively with only a single implicit sampling operation per ray. These point light fields are as a function of the ray direction, and local point feature neighborhood, allowing us to interpolate the light field conditioned training images without dense object coverage and parallax. We assess the proposed method for novel view synthesis on large driving scenarios, where we synthesize realistic unseen views that existing implicit approaches fail to represent. We validate that Neural Point Light Fields make it possible to predict videos along unseen trajectories previously only feasible to generate by explicitly modeling the scene.

* 9 pages, replacement changed font of equations 
Access Paper or Ask Questions

Transfer learning using deep neural networks for Ear Presentation Attack Detection: New Database for PAD

Dec 09, 2021
Jalil Nourmohammadi Khiarak

Ear recognition system has been widely studied whereas there are just a few ear presentation attack detection methods for ear recognition systems, consequently, there is no publicly available ear presentation attack detection (PAD) database. In this paper, we propose a PAD method using a pre-trained deep neural network and release a new dataset called Warsaw University of Technology Ear Dataset for Presentation Attack Detection (WUT-Ear V1.0). There is no ear database that is captured using mobile devices. Hence, we have captured more than 8500 genuine ear images from 134 subjects and more than 8500 fake ear images using. We made replay-attack and photo print attacks with 3 different mobile devices. Our approach achieves 99.83% and 0.08% for the half total error rate (HTER) and attack presentation classification error rate (APCER), respectively, on the replay-attack database. The captured data is analyzed and visualized statistically to find out its importance and make it a benchmark for further research. The experiments have been found out a secure PAD method for ear recognition system, publicly available ear image, and ear PAD dataset. The codes and evaluation results are publicly available at

Access Paper or Ask Questions

Layout-to-Image Translation with Double Pooling Generative Adversarial Networks

Aug 29, 2021
Hao Tang, Nicu Sebe

In this paper, we address the task of layout-to-image translation, which aims to translate an input semantic layout to a realistic image. One open challenge widely observed in existing methods is the lack of effective semantic constraints during the image translation process, leading to models that cannot preserve the semantic information and ignore the semantic dependencies within the same object. To address this issue, we propose a novel Double Pooing GAN (DPGAN) for generating photo-realistic and semantically-consistent results from the input layout. We also propose a novel Double Pooling Module (DPM), which consists of the Square-shape Pooling Module (SPM) and the Rectangle-shape Pooling Module (RPM). Specifically, SPM aims to capture short-range semantic dependencies of the input layout with different spatial scales, while RPM aims to capture long-range semantic dependencies from both horizontal and vertical directions. We then effectively fuse both outputs of SPM and RPM to further enlarge the receptive field of our generator. Extensive experiments on five popular datasets show that the proposed DPGAN achieves better results than state-of-the-art methods. Finally, both SPM and SPM are general and can be seamlessly integrated into any GAN-based architectures to strengthen the feature representation. The code is available at

* Accepted to TIP 
Access Paper or Ask Questions

A Survey on Deep Learning Based Point-Of-Interest (POI) Recommendations

Nov 20, 2020
Md. Ashraful Islam, Mir Mahathir Mohammad, Sarkar Snigdha Sarathi Das, Mohammed Eunus Ali

Location-based Social Networks (LBSNs) enable users to socialize with friends and acquaintances by sharing their check-ins, opinions, photos, and reviews. Huge volume of data generated from LBSNs opens up a new avenue of research that gives birth to a new sub-field of recommendation systems, known as Point-of-Interest (POI) recommendation. A POI recommendation technique essentially exploits users' historical check-ins and other multi-modal information such as POI attributes and friendship network, to recommend the next set of POIs suitable for a user. A plethora of earlier works focused on traditional machine learning techniques by using hand-crafted features from the dataset. With the recent surge of deep learning research, we have witnessed a large variety of POI recommendation works utilizing different deep learning paradigms. These techniques largely vary in problem formulations, proposed techniques, used datasets, and features, etc. To the best of our knowledge, this work is the first comprehensive survey of all major deep learning-based POI recommendation works. Our work categorizes and critically analyzes the recent POI recommendation works based on different deep learning paradigms and other relevant features. This review can be considered a cookbook for researchers or practitioners working in the area of POI recommendation.

* 21 pages, 5 figures 
Access Paper or Ask Questions

Batteries, camera, action! Learning a semantic control space for expressive robot cinematography

Nov 19, 2020
Rogerio Bonatti, Arthur Bucker, Sebastian Scherer, Mustafa Mukadam, Jessica Hodgins

Aerial vehicles are revolutionizing the way film-makers can capture shots of actors by composing novel aerial and dynamic viewpoints. However, despite great advancements in autonomous flight technology, generating expressive camera behaviors is still a challenge and requires non-technical users to edit a large number of unintuitive control parameters. In this work we develop a data-driven framework that enables editing of these complex camera positioning parameters in a semantic space (e.g. calm, enjoyable, establishing). First, we generate a database of video clips with a diverse range of shots in a photo-realistic simulator, and use hundreds of participants in a crowd-sourcing framework to obtain scores for a set of semantic descriptors for each clip. Next, we analyze correlations between descriptors and build a semantic control space based on cinematography guidelines and human perception studies. Finally, we learn a generative model that can map a set of desired semantic video descriptors into low-level camera trajectory parameters. We evaluate our system by demonstrating that our model successfully generates shots that are rated by participants as having the expected degrees of expression for each descriptor. We also show that our models generalize to different scenes in both simulation and real-world experiments. Supplementary video:

Access Paper or Ask Questions

Kimera-Multi: a System for Distributed Multi-Robot Metric-Semantic Simultaneous Localization and Mapping

Nov 08, 2020
Yun Chang, Yulun Tian, Jonathan P. How, Luca Carlone

We present the first fully distributed multi-robot system for dense metric-semantic Simultaneous Localization and Mapping (SLAM). Our system, dubbed Kimera-Multi, is implemented by a team of robots equipped with visual-inertial sensors, and builds a 3D mesh model of the environment in real-time, where each face of the mesh is annotated with a semantic label (e.g., building, road, objects). In Kimera-Multi, each robot builds a local trajectory estimate and a local mesh using Kimera. Then, when two robots are within communication range, they initiate a distributed place recognition and robust pose graph optimization protocol with a novel incremental maximum clique outlier rejection; the protocol allows the robots to improve their local trajectory estimates by leveraging inter-robot loop closures. Finally, each robot uses its improved trajectory estimate to correct the local mesh using mesh deformation techniques. We demonstrate Kimera-Multi in photo-realistic simulations and real data. Kimera-Multi (i) is able to build accurate 3D metric-semantic meshes, (ii) is robust to incorrect loop closures while requiring less computation than state-of-the-art distributed SLAM back-ends, and (iii) is efficient, both in terms of computation at each robot as well as communication bandwidth.

* 9 pages 
Access Paper or Ask Questions

MichiGAN: Multi-Input-Conditioned Hair Image Generation for Portrait Editing

Oct 30, 2020
Zhentao Tan, Menglei Chai, Dongdong Chen, Jing Liao, Qi Chu, Lu Yuan, Sergey Tulyakov, Nenghai Yu

Despite the recent success of face image generation with GANs, conditional hair editing remains challenging due to the under-explored complexity of its geometry and appearance. In this paper, we present MichiGAN (Multi-Input-Conditioned Hair Image GAN), a novel conditional image generation method for interactive portrait hair manipulation. To provide user control over every major hair visual factor, we explicitly disentangle hair into four orthogonal attributes, including shape, structure, appearance, and background. For each of them, we design a corresponding condition module to represent, process, and convert user inputs, and modulate the image generation pipeline in ways that respect the natures of different visual attributes. All these condition modules are integrated with the backbone generator to form the final end-to-end network, which allows fully-conditioned hair generation from multiple user inputs. Upon it, we also build an interactive portrait hair editing system that enables straightforward manipulation of hair by projecting intuitive and high-level user inputs such as painted masks, guiding strokes, or reference photos to well-defined condition representations. Through extensive experiments and evaluations, we demonstrate the superiority of our method regarding both result quality and user controllability. The code is available at

* Siggraph 2020, code is available at 
Access Paper or Ask Questions