Network quantization aims to lower the bitwidth of weights and activations and hence reduce the model size and accelerate the inference of deep networks. Even though existing quantization methods have achieved promising performance on image classification, applying aggressively low bitwidth quantization on object detection while preserving the performance is still a challenge. In this paper, we demonstrate that the poor performance of the quantized network on object detection comes from the inaccurate batch statistics of batch normalization. To solve this, we propose an accurate quantized object detection (AQD) method. Specifically, we propose to employ multi-level batch normalization (multi-level BN) to estimate the batch statistics of each detection head separately. We further propose a learned interval quantization method to improve how the quantizer itself is configured. To evaluate the performance of the proposed methods, we apply AQD to two one-stage detectors (i.e., RetinaNet and FCOS). Experimental results on COCO show that our methods achieve near-lossless performance compared with the full-precision model by using extremely low bitwidth regimes such as 3-bit. In particular, we even outperform the full-precision counterpart by a large margin with a 4-bit detector, which is of great practical value.
As the application area of convolutional neural networks (CNN) is growing in embedded devices, it becomes popular to use a hardware CNN accelerator, called neural processing unit (NPU), to achieve higher performance per watt than CPUs or GPUs. Recently, automated neural architecture search (NAS) emerges as the default technique to find a state-of-the-art CNN architecture with higher accuracy than manually-designed architectures for image classification. In this paper, we present a fast NPU-aware NAS methodology, called S3NAS, to find a CNN architecture with higher accuracy than the existing ones under a given latency constraint. It consists of three steps: supernet design, Single-Path NAS for fast architecture exploration, and scaling. To widen the search space of the supernet structure that consists of stages, we allow stages to have a different number of blocks and blocks to have parallel layers of different kernel sizes. For a fast neural architecture search, we apply a modified Single-Path NAS technique to the proposed supernet structure. In this step, we assume a shorter latency constraint than the required to reduce the search space and the search time. The last step is to scale up the network maximally within the latency constraint. For accurate latency estimation, an analytical latency estimator is devised, based on a cycle-level NPU simulator that runs an entire CNN considering the memory access overhead accurately. With the proposed methodology, we are able to find a network in 3 hours using TPUv3, which shows 82.72% top-1 accuracy on ImageNet with 11.66 ms latency. Code are released at https://github.com/cap-lab/S3NAS
We propose a self-supervised learning framework for visual odometry (VO) that incorporates correlation of consecutive frames and takes advantage of adversarial learning. Previous methods tackle self-supervised VO as a local structure from motion (SfM) problem that recovers depth from single image and relative poses from image pairs by minimizing photometric loss between warped and captured images. As single-view depth estimation is an ill-posed problem, and photometric loss is incapable of discriminating distortion artifacts of warped images, the estimated depth is vague and pose is inaccurate. In contrast to previous methods, our framework learns a compact representation of frame-to-frame correlation, which is updated by incorporating sequential information. The updated representation is used for depth estimation. Besides, we tackle VO as a self-supervised image generation task and take advantage of Generative Adversarial Networks (GAN). The generator learns to estimate depth and pose to generate a warped target image. The discriminator evaluates the quality of generated image with high-level structural perception that overcomes the problem of pixel-wise loss in previous methods. Experiments on KITTI and Cityscapes datasets show that our method obtains more accurate depth with details preserved and predicted pose outperforms state-of-the-art self-supervised methods significantly.
An estimated 15 million babies are born too early every year. Approximately 1 million children die each year due to complications of preterm birth (PTB). Many survivors face a lifetime of disability, including learning disabilities and visual and hearing problems. Although manual analysis of ultrasound images (US) is still prevalent, it is prone to errors due to its subjective component and complex variations in the shape and position of organs across patients. In this work, we introduce a conceptually simple convolutional neural network (CNN) trained for segmenting prenatal ultrasound images and classifying task for the purpose of preterm birth detection. Our method efficiently segments different types of cervixes in transvaginal ultrasound images while simultaneously predicting a preterm birth based on extracted image features without human oversight. We employed three popular network models: U-Net, Fully Convolutional Network, and Deeplabv3 for the cervix segmentation task. Based on the conducted results and model efficiency, we decided to extend U-Net by adding a parallel branch for classification task. The proposed model is trained and evaluated on a dataset consisting of 354 2D transvaginal ultrasound images and achieved a segmentation accuracy with a mean Jaccard coefficient index of 0.923 $\pm$ 0.081 and a classification sensitivity of 0.677 $\pm$ 0.042 with a 3.49\% false positive rate. Our method obtained better results in the prediction of preterm birth based on transvaginal ultrasound images compared to state-of-the-art methods.
Learning visuomotor control policies in robotic systems is a fundamental problem when aiming for long-term behavioral autonomy. Recent supervised-learning-based vision and motion perception systems, however, are often separately built with limited capabilities, while being restricted to few behavioral skills such as passive visual odometry (VO) or mobile robot visual localization. Here we propose an approach to unify those successful robot perception systems for active target-driven navigation tasks via reinforcement learning (RL). Our method temporally incorporates compact motion and visual perception data - directly obtained using self-supervision from a single image sequence - to enable complex goal-oriented navigation skills. We demonstrate our approach on two real-world driving dataset, KITTI and Oxford RobotCar, using the new interactive CityLearn framework. The results show that our method can accurately generalize to extreme environmental changes such as day to night cycles with up to an 80% success rate, compared to 30% for a vision-only navigation systems.
This paper presents a new supervised classification algorithm for remotely sensed hyperspectral image (HSI) which integrates spectral and spatial information in a unified Bayesian framework. First, we formulate the HSI classification problem from a Bayesian perspective. Then, we adopt a convolutional neural network (CNN) to learn the posterior class distributions using a patch-wise training strategy to better use the spatial information. Next, spatial information is further considered by placing a spatial smoothness prior on the labels. Finally, we iteratively update the CNN parameters using stochastic gradient decent (SGD) and update the class labels of all pixel vectors using an alpha-expansion min-cut-based algorithm. Compared with other state-of-the-art methods, the proposed classification method achieves better performance on one synthetic dataset and two benchmark HSI datasets in a number of experimental settings.
In this paper, we present an efficient and distinctive local descriptor, namely block intensity and gradient difference (BIGD). In an image patch, we randomly sample multi-scale block pairs and utilize the intensity and gradient differences of pairwise blocks to construct the local BIGD descriptor. The random sampling strategy and the multi-scale framework help BIGD descriptors capture the distinctive patterns of patches at different orientations and spatial granularity levels. We use vectors of locally aggregated descriptors (VLAD) or improved Fisher vector (IFV) to encode local BIGD descriptors into a full image descriptor, which is then fed into a linear support vector machine (SVM) classifier for texture classification. We compare the proposed descriptor with typical and state-of-the-art ones by evaluating their classification performance on five public texture data sets including Brodatz, CUReT, KTH-TIPS, and KTH-TIPS-2a and -2b. Experimental results show that the proposed BIGD descriptor with stronger discriminative power yields 0.12% ~ 6.43% higher classification accuracy than the state-of-the-art texture descriptor, dense microblock difference (DMD).
For recognizing speakers in video streams, significant research studies have been made to obtain a rich machine learning model by extracting high-level speaker's features such as facial expression, emotion, and gender. However, generating such a model is not feasible by using only single modality feature extractors that exploit either audio signals or image frames, extracted from video streams. In this paper, we address this problem from a different perspective and propose an unprecedented multimodality data fusion framework called DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection. We execute DeepMSRF by feeding features of the two modalities, namely speakers' audios and face images. DeepMSRF uses a two-stream VGGNET to train on both modalities to reach a comprehensive model capable of accurately recognizing the speaker's identity. We apply DeepMSRF on a subset of VoxCeleb2 dataset with its metadata merged with VGGFace2 dataset. The goal of DeepMSRF is to identify the gender of the speaker first, and further to recognize his or her name for any given video stream. The experimental results illustrate that DeepMSRF outperforms single modality speaker recognition methods with at least 3 percent accuracy.
The paper presents real time speckle de-noising based on activity computation algorithm and wavelet transform. Speckles arise in an image when laser light is reflected from an illuminated surface. The process involves detection of speckles in an image by obtaining a number of frames of the same object under different illumination or angle and comparing the frames for the granular computation and de-noising the same on presence of greater activity index. The project can be implemented in FPGA (Field Programmable Gate Array) technology. The results can be shown that the used activity computation algorithm and wavelet transform has better accuracy in the process of speckle detection and de-noising.
We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint representations of vision and language. ERNIE-ViL tries to construct the detailed semantic connections (objects, attributes of objects and relationships between objects in visual scenes) across vision and language, which are essential to vision-language cross-modal tasks. Incorporating knowledge from scene graphs, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction in the pre-training phase. More specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can model the joint representation characterizing the alignments of the detailed semantics across vision and language. Pre-trained on two large image-text alignment datasets (Conceptual Captions and SBU), ERNIE-ViL learns better and more robust joint representations. It achieves state-of-the-art performance on 5 vision-language downstream tasks after fine-tuning ERNIE-ViL. Furthermore, it ranked the 1st place on the VCR leader-board with an absolute improvement of 3.7\%.