Accurate smartphone localization (< 1-meter error) for indoor navigation using only RSSI received from a set of BLE beacons remains a challenging problem, due to the inherent noise of RSSI measurements. To overcome the large variance in RSSI measurements, we propose a data-driven approach that uses a deep recurrent network, DeepBLE, to localize the smartphone using RSSI measured from multiple beacons in an environment. In particular, we focus on the ability of our approach to generalize across many smartphone brands (e.g., Apple, Samsung) and models (e.g., iPhone 8, S10). Towards this end, we collect a large-scale dataset of 15 hours of smartphone data, which consists of over 50,000 BLE beacon RSSI measurements collected from 47 beacons in a single building using 15 different popular smartphone models, along with precise 2D location annotations. Our experiments show that there is a very high variability of RSSI measurements across smartphone models (especially across brand), making it very difficult to apply supervised learning using only a subset of smartphone models. To address this challenge, we propose a novel statistic similarity loss (SSL) which enables our model to generalize to unseen phones using a semi-supervised learning approach. For known phones, the iPhone XR achieves the best mean distance error of 0.84 meters. For unknown phones, the Huawei Mate20 Pro shows the greatest improvement, cutting error by over 38\% from 2.62 meters to 1.63 meters error using our semi-supervised adaptation method.
Many smartphone applications use inertial measurement units (IMUs) to sense movement, but the use of these sensors for pedestrian localization can be challenging due to their noise characteristics. Recent data-driven inertial odometry approaches have demonstrated the increasing feasibility of inertial navigation. However, they still rely upon conventional smartphone orientation estimates that they assume to be accurate, while in fact these orientation estimates can be a significant source of error. To address the problem of inaccurate orientation estimates, we present a two-stage, data-driven pipeline using a commodity smartphone that first estimates device orientations and then estimates device position. The orientation module relies on a recurrent neural network and Extended Kalman Filter to obtain orientation estimates that are used to then rotate raw IMU measurements into the appropriate reference frame. The position module then passes those measurements through another recurrent network architecture to perform localization. Our proposed method outperforms state-of-the-art methods in both orientation and position error on a large dataset we constructed that contains 20 hours of pedestrian motion across 3 buildings and 15 subjects.
3D multi-object tracking is an important component in robotic perception systems such as self-driving vehicles. Recent work follows a tracking-by-detection pipeline, which aims to match past tracklets with detections in the current frame. To avoid matching with false positive detections, prior work filters out detections with low confidence scores via a threshold. However, finding a proper threshold is non-trivial, which requires extensive manual search via ablation study. Also, this threshold is sensitive to many factors such as target object category so we need to re-search the threshold if these factors change. To ease this process, we propose to automatically select high-quality detections and remove the efforts needed for manual threshold search. Also, prior work often uses a single threshold per data sequence, which is sub-optimal in particular frames or for certain objects. Instead, we dynamically search threshold per frame or per object to further boost performance. Through experiments on KITTI and nuScenes, our method can filter out $45.7\%$ false positives while maintaining the recall, achieving new S.O.T.A. performance and removing the need for manually threshold tuning.
DETR is a recently proposed Transformer-based method which views object detection as a set prediction problem and achieves state-of-the-art performance but demands extra-long training time to converge. In this paper, we investigate the causes of the optimization difficulty in the training of DETR. Our examinations reveal several factors contributing to the slow convergence of DETR, primarily the issues with the Hungarian loss and the Transformer cross attention mechanism. To overcome these issues we propose two solutions, namely, TSP-FCOS (Transformer-based Set Prediction with FCOS) and TSP-RCNN (Transformer-based Set Prediction with RCNN). Experimental results show that the proposed methods not only converge much faster than the original DETR, but also significantly outperform DETR and other baselines in terms of detection accuracy.
The ability to both recognize and discover terrain characteristics is an important function required for many autonomous ground robots such as social robots, assistive robots, autonomous vehicles, and ground exploration robots. Recognizing and discovering terrain characteristics is challenging because similar terrains may have very different appearances (e.g., carpet comes in many colors), while terrains with very similar appearance may have very different physical properties (e.g. mulch versus dirt). In order to address the inherent ambiguity in vision-based terrain recognition and discovery, we propose a multi-modal self-supervised learning technique that switches between audio features extracted from a mic attached to the underside of a mobile platform and image features extracted by a camera on the platform to cluster terrain types. The terrain cluster labels are then used to train an image-based convolutional neural network to predict changes in terrain types. Through experiments, we demonstrate that the proposed self-supervised terrain type discovery method achieves over 80% accuracy, which greatly outperforms several baselines and suggests strong potential for assistive applications.
3D multi-object tracking (MOT) and trajectory forecasting are two critical components in modern 3D perception systems. We hypothesize that it is beneficial to unify both tasks under one framework to learn a shared feature representation of agent interaction. To evaluate this hypothesis, we propose a unified solution for 3D MOT and trajectory forecasting which also incorporates two additional novel computational units. First, we employ a feature interaction technique by introducing Graph Neural Networks (GNNs) to capture the way in which multiple agents interact with one another. The GNN is able to model complex hierarchical interactions, improve the discriminative feature learning for MOT association, and provide socially-aware context for trajectory forecasting. Second, we use a diversity sampling function to improve the quality and diversity of our forecasted trajectories. The learned sampling function is trained to efficiently extract a variety of outcomes from a generative trajectory distribution and helps avoid the problem of generating many duplicate trajectory samples. We show that our method achieves state-of-the-art performance on the KITTI dataset. Our project website is at http://www.xinshuoweng.com/projects/GNNTrkForecast.
We consider the few-shot classification task with an unbalanced dataset, in which some classes have sufficient training samples while other classes only have limited training samples. Recent works have proposed to solve this task by augmenting the training data of the few-shot classes using generative models with the few-shot training samples as the seeds. However, due to the limited number of the few-shot seeds, the generated samples usually have small diversity, making it difficult to train a discriminative classifier for the few-shot classes. To enrich the diversity of the generated samples, we propose to leverage the intra-class knowledge from the neighbor many-shot classes with the intuition that neighbor classes share similar statistical information. Such intra-class information is obtained with a two-step mechanism. First, a regressor trained only on the many-shot classes is used to evaluate the few-shot class means from only a few samples. Second, superclasses are clustered, and the statistical mean and feature variance of each superclass are used as transferable knowledge inherited by the children few-shot classes. Such knowledge is then used by a generator to augment the sparse training data to help the downstream classification tasks. Extensive experiments show that our method achieves state-of-the-art across different datasets and $n$-shot settings.
3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work often uses a tracking-by-detection pipeline, where the feature of each object is extracted independently to compute an affinity matrix. Then, the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this pipeline is to learn discriminative features for different objects in order to reduce confusion during data association. To that end, we propose two innovative techniques: (1) instead of obtaining the features for each object independently, we propose a novel feature interaction mechanism by introducing Graph Neural Networks; (2) instead of obtaining the features from either 2D or 3D space as in prior work, we propose a novel joint feature extractor to learn appearance and motion features from 2D and 3D space. Through experiments on the KITTI dataset, our proposed method achieves state-of-the-art 3D MOT performance. Our project website is at http://www.xinshuoweng.com/projects/GNN3DMOT.
3D multi-object tracking (MOT) is essential to applications such as autonomous driving. Recent work focuses on developing accurate systems giving less attention to computational cost and system complexity. In contrast, this work proposes a simple real-time 3D MOT system with strong performance. Our system first obtains 3D detections from a LiDAR point cloud. Then, a straightforward combination of a 3D Kalman filter and the Hungarian algorithm is used for state estimation and data association. Additionally, 3D MOT datasets such as KITTI evaluate MOT methods in 2D space and standardized 3D MOT evaluation tools are missing for a fair comparison of 3D MOT methods. We propose a new 3D MOT evaluation tool along with three new metrics to comprehensively evaluate 3D MOT methods. We show that, our proposed method achieves strong 3D MOT performance on KITTI and runs at a rate of $207.4$ FPS on the KITTI dataset, achieving the fastest speed among modern 3D MOT systems. Our code is publicly available at http://www.xinshuoweng.com/projects/AB3DMOT.
Non-line-of-sight (NLOS) imaging techniques use light that diffusely reflects off of visible surfaces (e.g., walls) to see around corners. One approach involves using pulsed lasers and ultrafast sensors to measure the travel time of multiply scattered light. Unlike existing NLOS techniques that generally require densely raster scanning points across the entirety of a relay wall, we explore a more efficient form of NLOS scanning that reduces both acquisition times and computational requirements. We propose a circular and confocal non-line-of-sight (C2NLOS) scan that involves illuminating and imaging a common point, and scanning this point in a circular path along a wall. We observe that (1) these C2NLOS measurements consist of a superposition of sinusoids, which we refer to as a transient sinogram, (2) there exists computationally efficient reconstruction procedures that transform these sinusoidal measurements into 3D positions of hidden scatterers or NLOS images of hidden objects, and (3) despite operating on an order of magnitude fewer measurements than previous approaches, these C2NLOS scans provide sufficient information about the hidden scene to solve these different NLOS imaging tasks. We show results from both simulated and real C2NLOS scans.