3D object detection has become an emerging task in autonomous driving scenarios. Previous works process 3D point clouds using either projection-based or voxel-based models. However, both approaches contain some drawbacks. The voxel-based methods lack semantic information, while the projection-based methods suffer from numerous spatial information loss when projected to different views. In this paper, we propose the Stereo RGB and Deeper LIDAR (SRDL) framework which can utilize semantic and spatial information simultaneously such that the performance of network for 3D object detection can be improved naturally. Specifically, the network generates candidate boxes from stereo pairs and combines different region-wise features using a deep fusion scheme. The stereo strategy offers more information for prediction compared with prior works. Then, several local and global feature extractors are stacked in the segmentation module to capture richer deep semantic geometric features from point clouds. After aligning the interior points with fused features, the proposed network refines the prediction in a more accurate manner and encodes the whole box in a novel compact method. The decent experimental results on the challenging KITTI detection benchmark demonstrate the effectiveness of utilizing both stereo images and point clouds for 3D object detection.
Accurate 3D object detection from point clouds has become a crucial component in autonomous driving. However, the volumetric representations and the projection methods in previous works fail to establish the relationships between the local point sets. In this paper, we propose Sparse Voxel-Graph Attention Network (SVGA-Net), a novel end-to-end trainable network which mainly contains voxel-graph module and sparse-to-dense regression module to achieve comparable 3D detection tasks from raw LIDAR data. Specifically, SVGA-Net constructs the local complete graph within each divided 3D spherical voxel and global KNN graph through all voxels. The local and global graphs serve as the attention mechanism to enhance the extracted features. In addition, the novel sparse-to-dense regression module enhances the 3D box estimation accuracy through feature maps aggregation at different levels. Experiments on KITTI detection benchmark demonstrate the efficiency of extending the graph representation to 3D object detection and the proposed SVGA-Net can achieve decent detection accuracy.
The channel redundancy in feature maps of convolutional neural networks (CNNs) results in the large consumption of memories and computational resources. In this work, we design a novel Slim Convolution (SlimConv) module to boost the performance of CNNs by reducing channel redundancies. Our SlimConv consists of three main steps: Reconstruct, Transform and Fuse, through which the features are splitted and reorganized in a more efficient way, such that the learned weights can be compressed effectively. In particular, the core of our model is a weight flipping operation which can largely improve the feature diversities, contributing to the performance crucially. Our SlimConv is a plug-and-play architectural unit which can be used to replace convolutional layers in CNNs directly. We validate the effectiveness of SlimConv by conducting comprehensive experiments on ImageNet, MS COCO2014, Pascal VOC2012 segmentation, and Pascal VOC2007 detection datasets. The experiments show that SlimConv-equipped models can achieve better performances consistently, less consumption of memory and computation resources than non-equipped conterparts. For example, the ResNet-101 fitted with SlimConv achieves 77.84% top-1 classification accuracy with 4.87 GFLOPs and 27.96M parameters on ImageNet, which shows almost 0.5% better performance with about 3 GFLOPs and 38% parameters reduced.
We present a new deep point cloud rendering pipeline through multi-plane projections. The input to the network is the raw point cloud of a scene and the output are image or image sequences from a novel view or along a novel camera trajectory. Unlike previous approaches that directly project features from 3D points onto 2D image domain, we propose to project these features into a layered volume of camera frustum. In this way, the visibility of 3D points can be automatically learnt by the network, such that ghosting effects due to false visibility check as well as occlusions caused by noise interferences are both avoided successfully. Next, the 3D feature volume is fed into a 3D CNN to produce multiple layers of images w.r.t. the space division in the depth directions. The layered images are then blended based on learned weights to produce the final rendering results. Experiments show that our network produces more stable renderings compared to previous methods, especially near the object boundaries. Moreover, our pipeline is robust to noisy and relatively sparse point cloud for a variety of challenging scenes.
The unprecedented performance achieved by deep convolutional neural networks for image classification is linked primarily to their ability of capturing rich structural features at various layers within networks. Here we design a series of experiments, inspired by children's learning of the arithmetic addition of two integers, to showcase that such deep networks can go beyond the structural features to learn deeper knowledge. In our experiments, a set of images is constructed, each image containing an arithmetic addition $n+m$ in its central area, and several classification networks are then trained over a subset of images, using the sum as the label. Tests on the excluded images show that, as the image set gets larger, the networks have well learnt the law of arithmetic additions so as to build up their autonomous reasoning ability strongly. For instance, networks trained over a small percentage of images can classify a big majority of the remaining images correctly, and many arithmetic additions involving some integers that have never been seen during the training can also be solved correctly by the trained networks.
Different rain models and novel network structures have been proposed to remove rain streaks from single rainy images. In this work, we bring attention to the intrinsic priors and multi-scale features of the rainy images, and develop several intrinsic loss functions to train a CNN deraining network. We first study the sparse priors of rainy images, which have been verified to preserve unbroken edges in image decomposition. However, its mathematical formulation usually leads to an intractable solution, we propose quasi-sparsity priors to decrease complexity, so that our network can be trained under the supervision of sparse properties of rainy images. Quasi-sparsity supervises network training in different gradient domain which is still ill-posed to decompose a rainy image into rain layer and background layer. We develop another $L_1$ loss based on the intrinsic low-value property of rain layer to restore image contents together with the commonly-used $L_1$ similarity loss. Multi-scale features are further explored via a multi-scale auxiliary decoding structure to show which kinds of features contribute the most to the deraining task, and the corresponding multi-scale auxiliary loss improves the deraining performance further. In our network, more efficient group convolution and feature sharing are utilized to obtain an one order of magnitude improvement in network running speed. The proposed deraining method performs favorably against state-of-the-art deraining approaches.
In recent years, deep learning based methods have made significant progress in rain-removing. However, the existing methods usually do not have good generalization ability, which leads to the fact that almost all of existing methods have a satisfied performance on removing a specific type of rain streaks, but may have a relatively poor performance on other types of rain streaks. In this paper, aiming at removing multiple types of rain streaks from single images, we propose a novel deraining framework (GRASPP-GAN), which has better generalization capacity. Specifically, a modified ResNet-18 which extracts the deep features of rainy images and a revised ASPP structure which adapts to the various shapes and sizes of rain streaks are composed together to form the backbone of our deraining network. Taking the more prominent characteristics of rain streaks in the gradient domain into consideration, a gradient loss is introduced to help to supervise our deraining training process, for which, a Sobel convolution layer is built to extract the gradient information flexibly. To further boost the performance, an adversarial learning scheme is employed for the first time to train the proposed network. Extensive experiments on both real-world and synthetic datasets demonstrate that our method outperforms the state-of-the-art deraining methods quantitatively and qualitatively. In addition, without any modifications, our proposed framework also achieves good visual performance on dehazing.
Rain removal in images/videos is still an important task in computer vision field and attracting attentions of more and more people. Traditional methods always utilize some incomplete priors or filters (e.g. guided filter) to remove rain effect. Deep learning gives more probabilities to better solve this task. However, they remove rain either by evaluating background from rainy image directly or learning a rain residual first then subtracting the residual to obtain a clear background. No other models are used in deep learning based de-raining methods to remove rain and obtain other information about rainy scenes. In this paper, we utilize an extensively-used image degradation model which is derived from atmospheric scattering principles to model the formation of rainy images and try to learn the transmission, atmospheric light in rainy scenes and remove rain further. To reach this goal, we propose a robust evaluation method of global atmospheric light in a rainy scene. Instead of using the estimated atmospheric light directly to learn a network to calculate transmission, we utilize it as ground truth and design a simple but novel triangle-shaped network structure to learn atmospheric light for every rainy image, then fine-tune the network to obtain a better estimation of atmospheric light during the training of transmission network. Furthermore, more efficient ShuffleNet Units are utilized in transmission network to learn transmission map and the de-raining image is then obtained by the image degradation model. By subjective and objective comparisons, our method outperforms the selected state-of-the-art works.
Removing rain effects from an image automatically has many applications such as autonomous driving, drone piloting and photo editing and still draws the attention of many people. Traditional methods use heuristics to handcraft various priors to remove or separate the rain effects from an image. Recently end-to-end deep learning based deraining methods have been proposed to offer more flexibility and effectiveness. However, they tend not to obtain good visual effect when encountered images with heavy rain. Heavy rain brings not only rain streaks but also haze-like effect which is caused by the accumulation of tiny raindrops. Different from previous deraining methods, in this paper we model rainy images with a new rain model to remove not only rain streaks but also haze-like effect. Guided by our model, we design a two-branch network to learn its parameters. Then, an SPP structure is jointly trained to refine the results of our model to control the degree of removing the haze-like effect flexibly. Besides, a subnetwork which can localize the rainy pixels is proposed to guide the training of our network. Extensive experiments on several datasets show that our method outperforms the state-of-the-art in both objectives assessments and visual quality.
In this paper, we propose a quality enhancement network for Versatile Video Coding (VVC) compressed videos by jointly exploiting spatial details and temporal structure (SDTS). The network consists of a temporal structure prediction subnet and a spatial detail enhancement subnet. The former subnet is used to estimate and compensate the temporal motion across frames, and the spatial detail subnet is used to reduce the compression artifacts and enhance the reconstruction quality of the VVC compressed video. Experimental results demonstrate the effectiveness of our SDTS-based approach. It offers over 7.82$\%$ BD-rate saving on the common test video sequences and achieves the state-of-the-art performance.