In this work, we propose a novel method for performing inertial aided navigation, by using deep neural networks (DNNs). To date, most DNN inertial navigation methods focus on the task of inertial odometry, by taking gyroscope and accelerometer readings as input and regressing for integrated IMU poses (i.e., position and orientation). While this design has been successfully applied on a number of applications, it is not of theoretical performance guarantee unless patterned motion is involved. This inevitably leads to significantly reduced accuracy and robustness in certain use cases. To solve this problem, we design a framework to compute observable IMU integration terms using DNNs, followed by the numerical pose integration and sensor fusion to achieve the performance gain. Specifically, we perform detailed analysis on the motion terms in IMU kinematic equations, propose a dedicated network design, loss functions, and training strategies for the IMU data processing, and conduct extensive experiments. The results show that our method is generally applicable and outperforms both traditional and DNN methods by wide margins.
Facial image retrieval plays a significant role in forensic investigations where an untrained witness tries to identify a suspect from a massive pool of images. However, due to the difficulties in describing human facial appearances verbally and directly, people naturally tend to depict by referring to well-known existing images and comparing specific areas of faces with them and it is also challenging to provide complete comparison at each time. Therefore, we propose an end-to-end framework to retrieve facial images with relevance feedback progressively provided by the witness, enabling an exploitation of history information during multiple rounds and an interactive and iterative approach to retrieving the mental image. With no need of any extra annotations, our model can be applied at the cost of a little response effort. We experiment on \texttt{CelebA} and evaluate the performance by ranking percentile and achieve 99\% under the best setting. Since this topic remains little explored to the best of our knowledge, we hope our work can serve as a stepping stone for further research.
This paper proposes a knowledge distillation method for foreground object search (FoS). Given a background and a rectangle specifying the foreground location and scale, FoS retrieves compatible foregrounds in a certain category for later image composition. Foregrounds within the same category can be grouped into a small number of patterns. Instances within each pattern are compatible with any query input interchangeably. These instances are referred to as interchangeable foregrounds. We first present a pipeline to build pattern-level FoS dataset containing labels of interchangeable foregrounds. We then establish a benchmark dataset for further training and testing following the pipeline. As for the proposed method, we first train a foreground encoder to learn representations of interchangeable foregrounds. We then train a query encoder to learn query-foreground compatibility following a knowledge distillation framework. It aims to transfer knowledge from interchangeable foregrounds to supervise representation learning of compatibility. The query feature representation is projected to the same latent space as interchangeable foregrounds, enabling very efficient and interpretable instance-level search. Furthermore, pattern-level search is feasible to retrieve more controllable, reasonable and diverse foregrounds. The proposed method outperforms the previous state-of-the-art by 10.42% in absolute difference and 24.06% in relative improvement evaluated by mean average precision (mAP). Extensive experimental results also demonstrate its efficacy from various aspects. The benchmark dataset and code will be release shortly.
Researchers have attempted utilizing deep neural network (DNN) to learn novel local features from images inspired by its recent successes on a variety of vision tasks. However, existing DNN-based algorithms have not achieved such remarkable progress that could be partly attributed to insufficient utilization of the interactive characters between local feature detector and descriptor. To alleviate these difficulties, we emphasize two desired properties, i.e., repeatability and reliability, to simultaneously summarize the inherent and interactive characters of local feature detector and descriptor. Guided by these properties, a self-supervised framework, namely self-evolving keypoint detection and description (SEKD), is proposed to learn an advanced local feature model from unlabeled natural images. Additionally, to have performance guarantees, novel training strategies have also been dedicatedly designed to minimize the gap between the learned feature and its properties. We benchmark the proposed method on homography estimation, relative pose estimation, and structure-from-motion tasks. Extensive experimental results demonstrate that the proposed method outperforms popular hand-crafted and DNN-based methods by remarkable margins. Ablation studies also verify the effectiveness of each critical training strategy. We will release our code along with the trained model publicly.
In recent years, the volume of the commercial mobile robot market has been continuously growing due to the maturity of both hardware and software technology. To build commercial robots, skid-steering mechanical design is of increased popularity due to its manufacturing simplicity and unique hardware properties. However, this causes challenges on software and algorithm design, especially for localization (i.e., determining the robot's rotation and position). While the general localization algorithms have been extensively studied in research communities, there are still fundamental problems that need to be resolved for localizing skid-steering robots. To tackle this problem, we propose a probabilistic sliding-window estimator dedicated to skid-steering robots, using measurements from a monocular camera, the wheel encoders, and optionally an inertial measurement unit (IMU). Specifically, we explicitly model the kinematics of skid-steering robots by both track instantaneous centers of rotation (ICRs) and correction factors, which are capable of compensating for the complexity of track-to-terrain interaction, the imperfectness of mechanical design, terrain conditions and smoothness, and so on. These time and location varying kinematic parameters are estimated online along with other localization states in a tightly-coupled manner. More importantly, we conduct in-depth observability analysis for different sensors and design configurations in this paper, which provides us with theoretical tools in making the correct choice when building real commercial robots. In our experiments, we validate the proposed method by both simulation tests and real-world experiments, which demonstrate that our method outperforms competing methods by wide margins.
Conventional uncertainty quantification methods usually lacks the capability of dealing with high-dimensional problems due to the curse of dimensionality. This paper presents a semi-supervised learning framework for dimension reduction and reliability analysis. An autoencoder is first adopted for mapping the high-dimensional space into a low-dimensional latent space, which contains a distinguishable failure surface. Then a deep feedforward neural network (DFN) is utilized to learn the mapping relationship and reconstruct the latent space, while the Gaussian process (GP) modeling technique is used to build the surrogate model of the transformed limit state function. During the training process of the DFN, the discrepancy between the actual and reconstructed latent space is minimized through semi-supervised learning for ensuring the accuracy. Both labeled and unlabeled samples are utilized for defining the loss function of the DFN. Evolutionary algorithm is adopted to train the DFN, then the Monte Carlo simulation method is used for uncertainty quantification and reliability analysis based on the proposed framework. The effectiveness is demonstrated through a mathematical example.
The inherent heavy computation of deep neural networks prevents their widespread applications. A widely used method for accelerating model inference is quantization, by replacing the input operands of a network using fixed-point values. Then the majority of computation costs focus on the integer matrix multiplication accumulation. In fact, high-bit accumulator leads to partially wasted computation and low-bit one typically suffers from numerical overflow. To address this problem, we propose an overflow aware quantization method by designing trainable adaptive fixed-point representation, to optimize the number of bits for each input tensor while prohibiting numeric overflow during the computation. With the proposed method, we are able to fully utilize the computing power to minimize the quantization loss and obtain optimized inference performance. To verify the effectiveness of our method, we conduct image classification, object detection, and semantic segmentation tasks on ImageNet, Pascal VOC, and COCO datasets, respectively. Experimental results demonstrate that the proposed method can achieve comparable performance with state-of-the-art quantization methods while accelerating the inference process by about 2 times.
Monocular 3D object detection is an essential component in autonomous driving while challenging to solve, especially for those occluded samples which are only partially visible. Most detectors consider each 3D object as an independent training target, inevitably resulting in a lack of useful information for occluded samples. To this end, we propose a novel method to improve the monocular 3D object detection by considering the relationship of paired samples. This allows us to encode spatial constraints for partially-occluded objects from their adjacent neighbors. Specifically, the proposed detector computes uncertainty-aware predictions for object locations and 3D distances for the adjacent object pairs, which are subsequently jointly optimized by nonlinear least squares. Finally, the one-stage uncertainty-aware prediction structure and the post-optimization module are dedicatedly integrated for ensuring the run-time efficiency. Experiments demonstrate that our method yields the best performance on KITTI 3D detection benchmark, by outperforming state-of-the-art competitors by wide margins, especially for the hard samples.
Online hand gesture recognition (HGR) techniques are essential in augmented reality (AR) applications for enabling natural human-to-computer interaction and communication. In recent years, the consumer market for low-cost AR devices has been rapidly growing, while the technology maturity in this domain is still limited. Those devices are typical of low prices, limited memory, and resource-constrained computational units, which makes online HGR a challenging problem. To tackle this problem, we propose a lightweight and computationally efficient HGR framework, namely LE-HGR, to enable real-time gesture recognition on embedded devices with low computing power. We also show that the proposed method is of high accuracy and robustness, which is able to reach high-end performance in a variety of complicated interaction environments. To achieve our goal, we first propose a cascaded multi-task convolutional neural network (CNN) to simultaneously predict probabilities of hand detection and regress hand keypoint locations online. We show that, with the proposed cascaded architecture design, false-positive estimates can be largely eliminated. Additionally, an associated mapping approach is introduced to track the hand trace via the predicted locations, which addresses the interference of multi-handedness. Subsequently, we propose a trace sequence neural network (TraceSeqNN) to recognize the hand gesture by exploiting the motion features of the tracked trace. Finally, we provide a variety of experimental results to show that the proposed framework is able to achieve state-of-the-art accuracy with significantly reduced computational cost, which are the key properties for enabling real-time applications in low-cost commercial devices such as mobile devices and AR/VR headsets.