Alert button
Picture for Fan Zhu

Fan Zhu

Alert button

Visual-tactile sensing for Real-time liquid Volume Estimation in Grasping

Feb 23, 2022
Fan Zhu, Ruixing Jia, Lei Yang, Youcan Yan, Zheng Wang, Jia Pan, Wenping Wang

Figure 1 for Visual-tactile sensing for Real-time liquid Volume Estimation in Grasping
Figure 2 for Visual-tactile sensing for Real-time liquid Volume Estimation in Grasping
Figure 3 for Visual-tactile sensing for Real-time liquid Volume Estimation in Grasping
Figure 4 for Visual-tactile sensing for Real-time liquid Volume Estimation in Grasping

We propose a deep visuo-tactile model for realtime estimation of the liquid inside a deformable container in a proprioceptive way.We fuse two sensory modalities, i.e., the raw visual inputs from the RGB camera and the tactile cues from our specific tactile sensor without any extra sensor calibrations.The robotic system is well controlled and adjusted based on the estimation model in real time. The main contributions and novelties of our work are listed as follows: 1) Explore a proprioceptive way for liquid volume estimation by developing an end-to-end predictive model with multi-modal convolutional networks, which achieve a high precision with an error of around 2 ml in the experimental validation. 2) Propose a multi-task learning architecture which comprehensively considers the losses from both classification and regression tasks, and comparatively evaluate the performance of each variant on the collected data and actual robotic platform. 3) Utilize the proprioceptive robotic system to accurately serve and control the requested volume of liquid, which is continuously flowing into a deformable container in real time. 4) Adaptively adjust the grasping plan to achieve more stable grasping and manipulation according to the real-time liquid volume prediction.

Viaarxiv icon

G-VAE, a Geometric Convolutional VAE for ProteinStructure Generation

Jun 22, 2021
Hao Huang, Boulbaba Ben Amor, Xichan Lin, Fan Zhu, Yi Fang

Figure 1 for G-VAE, a Geometric Convolutional VAE for ProteinStructure Generation
Figure 2 for G-VAE, a Geometric Convolutional VAE for ProteinStructure Generation
Figure 3 for G-VAE, a Geometric Convolutional VAE for ProteinStructure Generation
Figure 4 for G-VAE, a Geometric Convolutional VAE for ProteinStructure Generation

Analyzing the structure of proteins is a key part of understanding their functions and thus their role in biology at the molecular level. In addition, design new proteins in a methodical way is a major engineering challenge. In this work, we introduce a joint geometric-neural networks approach for comparing, deforming and generating 3D protein structures. Viewing protein structures as 3D open curves, we adopt the Square Root Velocity Function (SRVF) representation and leverage its suitable geometric properties along with Deep Residual Networks (ResNets) for a joint registration and comparison. Our ResNets handle better large protein deformations while being more computationally efficient. On top of the mathematical framework, we further design a Geometric Variational Auto-Encoder (G-VAE), that once trained, maps original, previously unseen structures, into a low-dimensional (latent) hyper-sphere. Motivated by the spherical structure of the pre-shape space, we naturally adopt the von Mises-Fisher (vMF) distribution to model our hidden variables. We test the effectiveness of our models by generating novel protein structures and predicting completions of corrupted protein structures. Experimental results show that our method is able to generate plausible structures, different from the structures in the training data.

* 14 pages 
Viaarxiv icon

Residual Networks as Flows of Velocity Fields for Diffeomorphic Time Series Alignment

Jun 22, 2021
Hao Huang, Boulbaba Ben Amor, Xichan Lin, Fan Zhu, Yi Fang

Figure 1 for Residual Networks as Flows of Velocity Fields for Diffeomorphic Time Series Alignment
Figure 2 for Residual Networks as Flows of Velocity Fields for Diffeomorphic Time Series Alignment
Figure 3 for Residual Networks as Flows of Velocity Fields for Diffeomorphic Time Series Alignment
Figure 4 for Residual Networks as Flows of Velocity Fields for Diffeomorphic Time Series Alignment

Non-linear (large) time warping is a challenging source of nuisance in time-series analysis. In this paper, we propose a novel diffeomorphic temporal transformer network for both pairwise and joint time-series alignment. Our ResNet-TW (Deep Residual Network for Time Warping) tackles the alignment problem by compositing a flow of incremental diffeomorphic mappings. Governed by the flow equation, our Residual Network (ResNet) builds smooth, fluid and regular flows of velocity fields and consequently generates smooth and invertible transformations (i.e. diffeomorphic warping functions). Inspired by the elegant Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework, the final transformation is built by the flow of time-dependent vector fields which are none other than the building blocks of our Residual Network. The latter is naturally viewed as an Eulerian discretization schema of the flow equation (an ODE). Once trained, our ResNet-TW aligns unseen data by a single inexpensive forward pass. As we show in experiments on both univariate (84 datasets from UCR archive) and multivariate time-series (MSR Action-3D, Florence-3D and MSR Daily Activity), ResNet-TW achieves competitive performance in joint alignment and classification.

* 19 pages 
Viaarxiv icon

A LiDAR Assisted Control Module with High Precision in Parking Scenarios for Autonomous Driving Vehicle

May 02, 2021
Xin Xu, Yu Dong, Fan Zhu

Figure 1 for A LiDAR Assisted Control Module with High Precision in Parking Scenarios for Autonomous Driving Vehicle
Figure 2 for A LiDAR Assisted Control Module with High Precision in Parking Scenarios for Autonomous Driving Vehicle
Figure 3 for A LiDAR Assisted Control Module with High Precision in Parking Scenarios for Autonomous Driving Vehicle
Figure 4 for A LiDAR Assisted Control Module with High Precision in Parking Scenarios for Autonomous Driving Vehicle

Autonomous driving has been quite promising in recent years. The public has seen Robotaxi delivered by Waymo, Baidu, Cruise, and so on. While autonomous driving vehicles certainly have a bright future, we have to admit that it is still a long way to go for products such as Robotaxi. On the other hand, in less complex scenarios autonomous driving may have the potentiality to reliably outperform humans. For example, humans are good at interactive tasks (while autonomous driving systems usually do not), but we are often incompetent for tasks with strict precision demands. In this paper, we introduce a real-world, industrial scenario of which human drivers are not capable. The task required the ego vehicle to keep a stationary lateral distance (i.e. 3? <= 5 centimeters) with respect to a reference. To address this challenge, we redesigned the control module from Baidu Apollo open-source autonomous driving system. A precise (3? <= 2 centimeters) Error Feedback System was first built to partly replace the localization module. Then we investigated the control module thoroughly and added a real-time calibration algorithm to gain extra precision. We also built a simulation to fine-tune the control parameters. After all those works, the results are encouraging, showing that an end-to-end lateral precision with 3? <= 5 centimeters has been achieved. Further, we show that the results not only outperformed original Apollo modules but also beat specially trained and highly experienced human test drivers.

Viaarxiv icon

Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification

Apr 30, 2021
Yichao Yan, Jie Qin1, Jiaxin Chen, Li Liu, Fan Zhu, Ying Tai, Ling Shao

Figure 1 for Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification
Figure 2 for Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification
Figure 3 for Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification
Figure 4 for Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification

Video-based person re-identification (re-ID) is an important research topic in computer vision. The key to tackling the challenging task is to exploit both spatial and temporal clues in video sequences. In this work, we propose a novel graph-based framework, namely Multi-Granular Hypergraph (MGH), to pursue better representational capabilities by modeling spatiotemporal dependencies in terms of multiple granularities. Specifically, hypergraphs with different spatial granularities are constructed using various levels of part-based features across the video sequence. In each hypergraph, different temporal granularities are captured by hyperedges that connect a set of graph nodes (i.e., part-based features) across different temporal ranges. Two critical issues (misalignment and occlusion) are explicitly addressed by the proposed hypergraph propagation and feature aggregation schemes. Finally, we further enhance the overall video representation by learning more diversified graph-level representations of multiple granularities based on mutual information minimization. Extensive experiments on three widely adopted benchmarks clearly demonstrate the effectiveness of the proposed framework. Notably, 90.0% top-1 accuracy on MARS is achieved using MGH, outperforming the state-of-the-arts. Code is available at https://github.com/daodaofr/hypergraph_reid.

* CVPR 2020 
Viaarxiv icon

Learning Multi-Attention Context Graph for Group-Based Re-Identification

Apr 29, 2021
Yichao Yan, Jie Qin, Bingbing Ni, Jiaxin Chen, Li Liu, Fan Zhu, Wei-Shi Zheng, Xiaokang Yang, Ling Shao

Figure 1 for Learning Multi-Attention Context Graph for Group-Based Re-Identification
Figure 2 for Learning Multi-Attention Context Graph for Group-Based Re-Identification
Figure 3 for Learning Multi-Attention Context Graph for Group-Based Re-Identification
Figure 4 for Learning Multi-Attention Context Graph for Group-Based Re-Identification

Learning to re-identify or retrieve a group of people across non-overlapped camera systems has important applications in video surveillance. However, most existing methods focus on (single) person re-identification (re-id), ignoring the fact that people often walk in groups in real scenarios. In this work, we take a step further and consider employing context information for identifying groups of people, i.e., group re-id. We propose a novel unified framework based on graph neural networks to simultaneously address the group-based re-id tasks, i.e., group re-id and group-aware person re-id. Specifically, we construct a context graph with group members as its nodes to exploit dependencies among different people. A multi-level attention mechanism is developed to formulate both intra-group and inter-group context, with an additional self-attention module for robust graph-level representations by attentively aggregating node-level features. The proposed model can be directly generalized to tackle group-aware person re-id using node-level representations. Meanwhile, to facilitate the deployment of deep learning models on these tasks, we build a new group re-id dataset that contains more than 3.8K images with 1.5K annotated groups, an order of magnitude larger than existing group re-id datasets. Extensive experiments on the novel dataset as well as three existing datasets clearly demonstrate the effectiveness of the proposed framework for both group-based re-id tasks. The code is available at https://github.com/daodaofr/group_reid.

Viaarxiv icon

Anchor-Free Person Search

Mar 22, 2021
Yichao Yan, Jingpeng Li, Jie Qin, Song Bai, Shengcai Liao, Li Liu, Fan Zhu, Ling Shao

Figure 1 for Anchor-Free Person Search
Figure 2 for Anchor-Free Person Search
Figure 3 for Anchor-Free Person Search
Figure 4 for Anchor-Free Person Search

Person search aims to simultaneously localize and identify a query person from realistic, uncropped images, which can be regarded as the unified task of pedestrian detection and person re-identification (re-id). Most existing works employ two-stage detectors like Faster-RCNN, yielding encouraging accuracy but with high computational overhead. In this work, we present the Feature-Aligned Person Search Network (AlignPS), the first anchor-free framework to efficiently tackle this challenging task. AlignPS explicitly addresses the major challenges, which we summarize as the misalignment issues in different levels (i.e., scale, region, and task), when accommodating an anchor-free detector for this task. More specifically, we propose an aligned feature aggregation module to generate more discriminative and robust feature embeddings by following a "re-id first" principle. Such a simple design directly improves the baseline anchor-free model on CUHK-SYSU by more than 20% in mAP. Moreover, AlignPS outperforms state-of-the-art two-stage methods, with a higher speed. Code is available at https://github.com/daodaofr/AlignPS

* Accepted by CVPR 2021 
Viaarxiv icon

P2-Net: Joint Description and Detection of Local Features for Pixel and Point Matching

Mar 01, 2021
Bing Wang, Changhao Chen, Zhaopeng Cui, Jie Qin, Chris Xiaoxuan Lu, Zhengdi Yu, Peijun Zhao, Zhen Dong, Fan Zhu, Niki Trigoni, Andrew Markham

Figure 1 for P2-Net: Joint Description and Detection of Local Features for Pixel and Point Matching
Figure 2 for P2-Net: Joint Description and Detection of Local Features for Pixel and Point Matching
Figure 3 for P2-Net: Joint Description and Detection of Local Features for Pixel and Point Matching
Figure 4 for P2-Net: Joint Description and Detection of Local Features for Pixel and Point Matching

Accurately describing and detecting 2D and 3D keypoints is crucial to establishing correspondences across images and point clouds. Despite a plethora of learning-based 2D or 3D local feature descriptors and detectors having been proposed, the derivation of a shared descriptor and joint keypoint detector that directly matches pixels and points remains under-explored by the community. This work takes the initiative to establish fine-grained correspondences between 2D images and 3D point clouds. In order to directly match pixels and points, a dual fully convolutional framework is presented that maps 2D and 3D inputs into a shared latent representation space to simultaneously describe and detect keypoints. Furthermore, an ultra-wide reception mechanism in combination with a novel loss function are designed to mitigate the intrinsic information variations between pixel and point local regions. Extensive experimental results demonstrate that our framework shows competitive performance in fine-grained matching between images and point clouds and achieves state-of-the-art results for the task of indoor visual localization. Our source code will be available at [no-name-for-blind-review].

Viaarxiv icon

Group Whitening: Balancing Learning Efficiency and Representational Capacity

Sep 28, 2020
Lei Huang, Li Liu, Fan Zhu, Ling Shao

Figure 1 for Group Whitening: Balancing Learning Efficiency and Representational Capacity
Figure 2 for Group Whitening: Balancing Learning Efficiency and Representational Capacity
Figure 3 for Group Whitening: Balancing Learning Efficiency and Representational Capacity
Figure 4 for Group Whitening: Balancing Learning Efficiency and Representational Capacity

Batch normalization (BN) is an important technique commonly incorporated into deep learning models to perform standardization within mini-batches. The merits of BN in improving model's learning efficiency can be further amplified by applying whitening, while its drawbacks in estimating population statistics for inference can be avoided through group normalization (GN). This paper proposes group whitening (GW), which elaborately exploits the advantages of the whitening operation and avoids the disadvantages of normalization within mini-batches. Specifically, GW divides the neurons of a sample into groups for standardization, like GN, and then further decorrelates the groups. In addition, we quantitatively analyze the constraint imposed by normalization, and show how the batch size (group number) affects the performance of batch (group) normalized networks, from the perspective of model's representational capacity. This analysis provides theoretical guidance for applying GW in practice. Finally, we apply the proposed GW to ResNet and ResNeXt architectures and conduct experiments on the ImageNet and COCO benchmarks. Results show that GW consistently improves the performance of different architectures, with absolute gains of $1.02\%$ $\sim$ $1.49\%$ in top-1 accuracy on ImageNet and $1.82\%$ $\sim$ $3.21\%$ in bounding box AP on COCO.

* V1: rejected by NeurIPS 2020. The main issue is that Assumption 1 not rigorous, thus the analysis based on which. Code available at: https://github.com/huangleiBuaa/GroupWhitening 
Viaarxiv icon

Normalization Techniques in Training DNNs: Methodology, Analysis and Application

Sep 27, 2020
Lei Huang, Jie Qin, Yi Zhou, Fan Zhu, Li Liu, Ling Shao

Figure 1 for Normalization Techniques in Training DNNs: Methodology, Analysis and Application
Figure 2 for Normalization Techniques in Training DNNs: Methodology, Analysis and Application

Normalization techniques are essential for accelerating the training and improving the generalization of deep neural networks (DNNs), and have successfully been used in various applications. This paper reviews and comments on the past, present and future of normalization methods in the context of DNN training. We provide a unified picture of the main motivation behind different approaches from the perspective of optimization, and present a taxonomy for understanding the similarities and differences between them. Specifically, we decompose the pipeline of the most representative normalizing activation methods into three components: the normalization area partitioning, normalization operation and normalization representation recovery. In doing so, we provide insight for designing new normalization technique. Finally, we discuss the current progress in understanding normalization methods, and provide a comprehensive review of the applications of normalization for particular tasks, in which it can effectively solve the key issues.

* 20 pages 
Viaarxiv icon