In convolutional neural network (CNN), dropout cannot work well because dropped information is not entirely obscured in convolutional layers where features are correlated spatially. Except randomly discarding regions or channels, many approaches try to overcome this defect by dropping influential units. In this paper, we propose a non-random dropout method named FocusedDropout, aiming to make the network focus more on the target. In FocusedDropout, we use a simple but effective way to search for the target-related features, retain these features and discard others, which is contrary to the existing methods. We found that this novel method can improve network performance by making the network more target-focused. Besides, increasing the weight decay while using FocusedDropout can avoid the overfitting and increase accuracy. Experimental results show that even a slight cost, 10\% of batches employing FocusedDropout, can produce a nice performance boost over the baselines on multiple datasets of classification, including CIFAR10, CIFAR100, Tiny Imagenet, and has a good versatility for different CNN models.
In this paper, we propose Selective Output Smoothing Regularization, a novel regularization method for training the Convolutional Neural Networks (CNNs). Inspired by the diverse effects on training from different samples, Selective Output Smoothing Regularization improves the performance by encouraging the model to produce equal logits on incorrect classes when dealing with samples that the model classifies correctly and over-confidently. This plug-and-play regularization method can be conveniently incorporated into almost any CNN-based project without extra hassle. Extensive experiments have shown that Selective Output Smoothing Regularization consistently achieves significant improvement in image classification benchmarks, such as CIFAR-100, Tiny ImageNet, ImageNet, and CUB-200-2011. Particularly, our method obtains 77.30$\%$ accuracy on ImageNet with ResNet-50, which gains 1.1$\%$ than baseline (76.2$\%$). We also empirically demonstrate the ability of our method to make further improvements when combining with other widely used regularization techniques. On Pascal detection, using the SOSR-trained ImageNet classifier as the pretrained model leads to better detection performances. Moreover, we demonstrate the effectiveness of our method in small sample size problem and imbalanced dataset problem.
Modern LiDAR-SLAM (L-SLAM) systems have shown excellent results in large-scale, real-world scenarios. However, they commonly have a high latency due to the expensive data association and nonlinear optimization. This paper demonstrates that actively selecting a subset of features significantly improves both the accuracy and efficiency of an L-SLAM system. We formulate the feature selection as a combinatorial optimization problem under a cardinality constraint to preserve the information matrix's spectral attributes. The stochastic-greedy algorithm is applied to approximate the optimal results in real-time. To avoid ill-conditioned estimation, we also propose a general strategy to evaluate the environment's degeneracy and modify the feature number online. The proposed feature selector is integrated into a multi-LiDAR SLAM system. We validate this enhanced system with extensive experiments covering various scenarios on two sensor setups and computation platforms. We show that our approach exhibits low localization error and speedup compared to the state-of-the-art L-SLAM systems. To benefit the community, we have released the source code: https://ram-lab.com/file/site/m-loam.
Path planning is a fundamental capability for autonomous navigation of robotic wheelchairs. With the impressive development of deep-learning technologies, imitation learning-based path planning approaches have achieved effective results in recent years. However, the disadvantages of these approaches are twofold: 1) they may need extensive time and labor to record expert demonstrations as training data; and 2) existing approaches could only receive high-level commands, such as turning left/right. These commands could be less sufficient for the navigation of mobile robots (e.g., robotic wheelchairs), which usually require exact poses of goals. We contribute a solution to this problem by proposing S2P2, a self-supervised goal-directed path planning approach. Specifically, we develop a pipeline to automatically generate planned path labels given as input RGB-D images and poses of goals. Then, we present a best-fit regression plane loss to train our data-driven path planning model based on the generated labels. Our S2P2 does not need pre-built maps, but it can be integrated into existing map-based navigation systems through our framework. Experimental results show that our S2P2 outperforms traditional path planning algorithms, and increases the robustness of existing map-based navigation systems. Our project page is available at https://sites.google.com/view/s2p2.
Object detection in 3D with stereo cameras is an important problem in computer vision, and is particularly crucial in low-cost autonomous mobile robots without LiDARs. Nowadays, most of the best-performing frameworks for stereo 3D object detection are based on dense depth reconstruction from disparity estimation, making them extremely computationally expensive. To enable real-world deployments of vision detection with binocular images, we take a step back to gain insights from 2D image-based detection frameworks and enhance them with stereo features. We incorporate knowledge and the inference structure from real-time one-stage 2D/3D object detector and introduce a light-weight stereo matching module. Our proposed framework, YOLOStereo3D, is trained on one single GPU and runs at more than ten fps. It demonstrates performance comparable to state-of-the-art stereo 3D detection frameworks without usage of LiDAR data. The code will be published in https://github.com/Owen-Liuyuxuan/visualDet3D.
Supervised learning with deep convolutional neural networks (DCNNs) has seen huge adoption in stereo matching. However, the acquisition of large-scale datasets with well-labeled ground truth is cumbersome and labor-intensive, making supervised learning-based approaches often hard to implement in practice. To overcome this drawback, we propose a robust and effective self-supervised stereo matching approach, consisting of a pyramid voting module (PVM) and a novel DCNN architecture, referred to as OptStereo. Specifically, our OptStereo first builds multi-scale cost volumes, and then adopts a recurrent unit to iteratively update disparity estimations at high resolution; while our PVM can generate reliable semi-dense disparity images, which can be employed to supervise OptStereo training. Furthermore, we publish the HKUST-Drive dataset, a large-scale synthetic stereo dataset, collected under different illumination and weather conditions for research purposes. Extensive experimental results demonstrate the effectiveness and efficiency of our self-supervised stereo matching approach on the KITTI Stereo benchmarks and our HKUST-Drive dataset. PVStereo, our best-performing implementation, greatly outperforms all other state-of-the-art self-supervised stereo matching approaches. Our project page is available at sites.google.com/view/pvstereo.
In this paper, we propose a new data augmentation strategy named Thumbnail, which aims to strengthen the network's capture of global features. We get a generated image by reducing an image to a certain size, which is called as the thumbnail, and pasting it in the random position of the original image. The generated image not only retains most of the original image information but also has the global information in the thumbnail. Furthermore, we find that the idea of thumbnail can be perfectly integrated with Mixed Sample Data Augmentation, so we paste the thumbnail in another image where the ground truth labels are also mixed with a certain weight, which makes great achievements on various computer vision tasks. Extensive experiments show that Thumbnail works better than the state-of-the-art augmentation strategies across classification, fine-grained image classification, and object detection. On ImageNet classification, ResNet50 architecture with our method achieves 79.21% accuracy, which is more than 2.89% improvement on the baseline.
Joint detection of drivable areas and road anomalies is very important for mobile robots. Recently, many semantic segmentation approaches based on convolutional neural networks (CNNs) have been proposed for pixel-wise drivable area and road anomaly detection. In addition, some benchmark datasets, such as KITTI and Cityscapes, have been widely used. However, the existing benchmarks are mostly designed for self-driving cars. There lacks a benchmark for ground mobile robots, such as robotic wheelchairs. Therefore, in this paper, we first build a drivable area and road anomaly detection benchmark for ground mobile robots, evaluating the existing state-of-the-art single-modal and data-fusion semantic segmentation CNNs using six modalities of visual features. Furthermore, we propose a novel module, referred to as the dynamic fusion module (DFM), which can be easily deployed in existing data-fusion networks to fuse different types of visual features effectively and efficiently. The experimental results show that the transformed disparity image is the most informative visual feature and the proposed DFM-RTFNet outperforms the state-of-the-arts. Additionally, our DFM-RTFNet achieves competitive performance on the KITTI road benchmark. Our benchmark is publicly available at https://sites.google.com/view/gmrb.
Estimating the 3D position and orientation of objects in the environment with a single RGB camera is a critical and challenging task for low-cost urban autonomous driving and mobile robots. Most of the existing algorithms are based on the geometric constraints in 2D-3D correspondence, which stems from generic 6D object pose estimation. We first identify how the ground plane provides additional clues in depth reasoning in 3D detection in driving scenes. Based on this observation, we then improve the processing of 3D anchors and introduce a novel neural network module to fully utilize such application-specific priors in the framework of deep learning. Finally, we introduce an efficient neural network embedded with the proposed module for 3D object detection. We further verify the power of the proposed module with a neural network designed for monocular depth prediction. The two proposed networks achieve state-of-the-art performances on the KITTI 3D object detection and depth prediction benchmarks, respectively. The code will be published in https://www.github.com/Owen-Liuyuxuan/visualDet3D