Alert button
Picture for Stefano Rosa

Stefano Rosa

Alert button

Self-improving object detection via disagreement reconciliation

Feb 21, 2023
Gianluca Scarpellini, Stefano Rosa, Pietro Morerio, Lorenzo Natale, Alessio Del Bue

Figure 1 for Self-improving object detection via disagreement reconciliation
Figure 2 for Self-improving object detection via disagreement reconciliation
Figure 3 for Self-improving object detection via disagreement reconciliation
Figure 4 for Self-improving object detection via disagreement reconciliation

Object detectors often experience a drop in performance when new environmental conditions are insufficiently represented in the training data. This paper studies how to automatically fine-tune a pre-existing object detector while exploring and acquiring images in a new environment without relying on human intervention, i.e., in a self-supervised fashion. In our setting, an agent initially explores the environment using a pre-trained off-the-shelf detector to locate objects and associate pseudo-labels. By assuming that pseudo-labels for the same object must be consistent across different views, we devise a novel mechanism for producing refined predictions from the consensus among observations. Our approach improves the off-the-shelf object detector by 2.66% in terms of mAP and outperforms the current state of the art without relying on ground-truth annotations.

* This article is a conference paper related to arXiv:2302.03566 and is currently under review 
Viaarxiv icon

Look around and learn: self-improving object detection by exploration

Feb 10, 2023
Gianluca Scarpellini, Stefano Rosa, Pietro Morerio, Lorenzo Natale, Alessio Del Bue

Figure 1 for Look around and learn: self-improving object detection by exploration
Figure 2 for Look around and learn: self-improving object detection by exploration
Figure 3 for Look around and learn: self-improving object detection by exploration
Figure 4 for Look around and learn: self-improving object detection by exploration

Object detectors often experience a drop in performance when new environmental conditions are insufficiently represented in the training data. This paper studies how to automatically fine-tune a pre-existing object detector while exploring and acquiring images in a new environment without relying on human intervention, i.e., in an utterly self-supervised fashion. In our setting, an agent initially learns to explore the environment using a pre-trained off-the-shelf detector to locate objects and associate pseudo-labels. By assuming that pseudo-labels for the same object must be consistent across different views, we learn an exploration policy mining hard samples and we devise a novel mechanism for producing refined predictions from the consensus among observations. Our approach outperforms the current state-of-the-art, and it closes the performance gap against a fully supervised setting without relying on ground-truth annotations. We also compare various exploration policies for the agent to gather more informative observations. Code and dataset will be made available upon paper acceptance

Viaarxiv icon

Learning Semantic Segmentation of Large-Scale Point Clouds with Random Sampling

Jul 06, 2021
Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, Andrew Markham

Figure 1 for Learning Semantic Segmentation of Large-Scale Point Clouds with Random Sampling
Figure 2 for Learning Semantic Segmentation of Large-Scale Point Clouds with Random Sampling
Figure 3 for Learning Semantic Segmentation of Large-Scale Point Clouds with Random Sampling
Figure 4 for Learning Semantic Segmentation of Large-Scale Point Clouds with Random Sampling

We study the problem of efficient semantic segmentation of large-scale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/post-processing steps, most existing approaches are only able to be trained and operate over small-scale point clouds. In this paper, we introduce RandLA-Net, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Comparative experiments show that our RandLA-Net can process 1 million points in a single pass up to 200x faster than existing approaches. Moreover, extensive experiments on five large-scale point cloud datasets, including Semantic3D, SemanticKITTI, Toronto3D, NPM3D and S3DIS, demonstrate the state-of-the-art semantic segmentation performance of our RandLA-Net.

* IEEE TPAMI 2021. arXiv admin note: substantial text overlap with arXiv:1911.11236 
Viaarxiv icon

SelectFusion: A Generic Framework to Selectively Learn Multisensory Fusion

Dec 30, 2019
Changhao Chen, Stefano Rosa, Chris Xiaoxuan Lu, Niki Trigoni, Andrew Markham

Figure 1 for SelectFusion: A Generic Framework to Selectively Learn Multisensory Fusion
Figure 2 for SelectFusion: A Generic Framework to Selectively Learn Multisensory Fusion
Figure 3 for SelectFusion: A Generic Framework to Selectively Learn Multisensory Fusion
Figure 4 for SelectFusion: A Generic Framework to Selectively Learn Multisensory Fusion

Autonomous vehicles and mobile robotic systems are typically equipped with multiple sensors to provide redundancy. By integrating the observations from different sensors, these mobile agents are able to perceive the environment and estimate system states, e.g. locations and orientations. Although deep learning approaches for multimodal odometry estimation and localization have gained traction, they rarely focus on the issue of robust sensor fusion - a necessary consideration to deal with noisy or incomplete sensor observations in the real world. Moreover, current deep odometry models also suffer from a lack of interpretability. To this extent, we propose SelectFusion, an end-to-end selective sensor fusion module which can be applied to useful pairs of sensor modalities such as monocular images and inertial measurements, depth images and LIDAR point clouds. During prediction, the network is able to assess the reliability of the latent features from different sensor modalities and estimate both trajectory at scale and global pose. In particular, we propose two fusion modules based on different attention strategies: deterministic soft fusion and stochastic hard fusion, and we offer a comprehensive study of the new strategies compared to trivial direct fusion. We evaluate all fusion strategies in both ideal conditions and on progressively degraded datasets that present occlusions, noisy and missing data and time misalignment between sensors, and we investigate the effectiveness of the different fusion strategies in attending the most reliable features, which in itself, provides insights into the operation of the various models.

* An extended journal version of arXiv:1903.01534 (CVPR 2019) 
Viaarxiv icon

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds

Nov 25, 2019
Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, Andrew Markham

Figure 1 for RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds
Figure 2 for RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds
Figure 3 for RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds
Figure 4 for RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds

We study the problem of efficient semantic segmentation for large-scale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/post-processing steps, most existing approaches are only able to be trained and operate over small-scale point clouds. In this paper, we introduce RandLA-Net, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Extensive experiments show that our RandLA-Net can process 1 million points in a single pass with up to 200X faster than existing approaches. Moreover, our RandLA-Net clearly surpasses state-of-the-art approaches for semantic segmentation on two large-scale benchmarks Semantic3D and SemanticKITTI.

* Code and data are available at: https://github.com/QingyongHu/RandLA-Net 
Viaarxiv icon

DeepTIO: A Deep Thermal-Inertial Odometry with Visual Hallucination

Sep 16, 2019
Muhamad Risqi U. Saputra, Pedro P. B. de Gusmao, Chris Xiaoxuan Lu, Yasin Almalioglu, Stefano Rosa, Changhao Chen, Johan Wahlström, Wei Wang, Andrew Markham, Niki Trigoni

Figure 1 for DeepTIO: A Deep Thermal-Inertial Odometry with Visual Hallucination
Figure 2 for DeepTIO: A Deep Thermal-Inertial Odometry with Visual Hallucination
Figure 3 for DeepTIO: A Deep Thermal-Inertial Odometry with Visual Hallucination
Figure 4 for DeepTIO: A Deep Thermal-Inertial Odometry with Visual Hallucination

Visual odometry shows excellent performance in a wide range of environments. However, in visually-denied scenarios (e.g. heavy smoke or darkness), pose estimates degrade or even fail. Thermal imaging cameras are commonly used for perception and inspection when the environment has low visibility. However, their use in odometry estimation is hampered by the lack of robust visual features. In part, this is as a result of the sensor measuring the ambient temperature profile rather than scene appearance and geometry. To overcome these issues, we propose a Deep Neural Network model for thermal-inertial odometry (DeepTIO) by incorporating a visual hallucination network to provide the thermal network with complementary information. The hallucination network is taught to predict fake visual features from thermal images by using the robust Huber loss. We also employ selective fusion to attentively fuse the features from three different modalities, i.e thermal, hallucination, and inertial features. Extensive experiments are performed in our large scale hand-held data in benign and smoke-filled environments, showing the efficacy of the proposed model.

* Submitted to IEEE RAL + ICRA 2020 
Viaarxiv icon

Selective Sensor Fusion for Neural Visual-Inertial Odometry

Mar 04, 2019
Changhao Chen, Stefano Rosa, Yishu Miao, Chris Xiaoxuan Lu, Wei Wu, Andrew Markham, Niki Trigoni

Figure 1 for Selective Sensor Fusion for Neural Visual-Inertial Odometry
Figure 2 for Selective Sensor Fusion for Neural Visual-Inertial Odometry
Figure 3 for Selective Sensor Fusion for Neural Visual-Inertial Odometry
Figure 4 for Selective Sensor Fusion for Neural Visual-Inertial Odometry

Deep learning approaches for Visual-Inertial Odometry (VIO) have proven successful, but they rarely focus on incorporating robust fusion strategies for dealing with imperfect input sensory data. We propose a novel end-to-end selective sensor fusion framework for monocular VIO, which fuses monocular images and inertial measurements in order to estimate the trajectory whilst improving robustness to real-life issues, such as missing and corrupted data or bad sensor synchronization. In particular, we propose two fusion modalities based on different masking strategies: deterministic soft fusion and stochastic hard fusion, and we compare with previously proposed direct fusion baselines. During testing, the network is able to selectively process the features of the available sensor modalities and produce a trajectory at scale. We present a thorough investigation on the performances on three public autonomous driving, Micro Aerial Vehicle (MAV) and hand-held VIO datasets. The results demonstrate the effectiveness of the fusion strategies, which offer better performances compared to direct fusion, particularly in presence of corrupted data. In addition, we study the interpretability of the fusion networks by visualising the masking layers in different scenarios and with varying data corruption, revealing interesting correlations between the fusion networks and imperfect sensory input data.

* Accepted by CVPR 2019 
Viaarxiv icon

Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning

Dec 12, 2018
Linhai Xie, Sen Wang, Stefano Rosa, Andrew Markham, Niki Trigoni

Figure 1 for Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning
Figure 2 for Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning
Figure 3 for Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning
Figure 4 for Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) has been applied successfully to many robotic applications. However, the large number of trials needed for training is a key issue. Most of existing techniques developed to improve training efficiency (e.g. imitation) target on general tasks rather than being tailored for robot applications, which have their specific context to benefit from. We propose a novel framework, Assisted Reinforcement Learning, where a classical controller (e.g. a PID controller) is used as an alternative, switchable policy to speed up training of DRL for local planning and navigation problems. The core idea is that the simple control law allows the robot to rapidly learn sensible primitives, like driving in a straight line, instead of random exploration. As the actor network becomes more advanced, it can then take over to perform more complex actions, like obstacle avoidance. Eventually, the simple controller can be discarded entirely. We show that not only does this technique train faster, it also is less sensitive to the structure of the DRL network and consistently outperforms a standard Deep Deterministic Policy Gradient network. We demonstrate the results in both simulation and real-world experiments.

* Published in ICRA2018. The code is now available at https://github.com/xie9187/AsDDPG 
Viaarxiv icon

3D-PhysNet: Learning the Intuitive Physics of Non-Rigid Object Deformations

Oct 24, 2018
Zhihua Wang, Stefano Rosa, Bo Yang, Sen Wang, Niki Trigoni, Andrew Markham

Figure 1 for 3D-PhysNet: Learning the Intuitive Physics of Non-Rigid Object Deformations
Figure 2 for 3D-PhysNet: Learning the Intuitive Physics of Non-Rigid Object Deformations
Figure 3 for 3D-PhysNet: Learning the Intuitive Physics of Non-Rigid Object Deformations
Figure 4 for 3D-PhysNet: Learning the Intuitive Physics of Non-Rigid Object Deformations

The ability to interact and understand the environment is a fundamental prerequisite for a wide range of applications from robotics to augmented reality. In particular, predicting how deformable objects will react to applied forces in real time is a significant challenge. This is further confounded by the fact that shape information about encountered objects in the real world is often impaired by occlusions, noise and missing regions e.g. a robot manipulating an object will only be able to observe a partial view of the entire solid. In this work we present a framework, 3D-PhysNet, which is able to predict how a three-dimensional solid will deform under an applied force using intuitive physics modelling. In particular, we propose a new method to encode the physical properties of the material and the applied force, enabling generalisation over materials. The key is to combine deep variational autoencoders with adversarial training, conditioned on the applied force and the material properties. We further propose a cascaded architecture that takes a single 2.5D depth view of the object and predicts its deformation. Training data is provided by a physics simulator. The network is fast enough to be used in real-time applications from partial views. Experimental results show the viability and the generalisation properties of the proposed architecture.

* in IJCAI 2018 
Viaarxiv icon