Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pichao Wang

TransReID: Transformer-based Object Re-Identification

Feb 08, 2021

Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, Wei Jiang

Figure 1 for TransReID: Transformer-based Object Re-Identification

Figure 2 for TransReID: Transformer-based Object Re-Identification

Figure 3 for TransReID: Transformer-based Object Re-Identification

Figure 4 for TransReID: Transformer-based Object Re-Identification

Abstract:In this paper, we explore the Vision Transformer (ViT), a pure transformer-based model, for the object re-identification (ReID) task. With several adaptations, a strong baseline ViT-BoT is constructed with ViT as backbone, which achieves comparable results to convolution neural networks- (CNN-) based frameworks on several ReID benchmarks. Furthermore, two modules are designed in consideration of the specialties of ReID data: (1) It is super natural and simple for Transformer to encode non-visual information such as camera or viewpoint into vector embedding representations. Plugging into these embeddings, ViT holds the ability to eliminate the bias caused by diverse cameras or viewpoints.(2) We design a Jigsaw branch, parallel with the Global branch, to facilitate the training of the model in a two-branch learning framework. In the Jigsaw branch, a jigsaw patch module is designed to learn robust feature representation and help the training of transformer by shuffling the patches. With these novel modules, we propose a pure-transformer framework dubbed as TransReID, which is the first work to use a pure Transformer for ReID research to the best of our knowledge. Experimental results of TransReID are superior promising, which achieve state-of-the-art performance on both person and vehicle ReID benchmarks.

Via

Access Paper or Ask Questions

Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition

Feb 01, 2021

Ming Lin, Pichao Wang, Zhenhong Sun, Hesen Chen, Xiuyu Sun, Qi Qian, Hao Li, Rong Jin

Figure 1 for Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition

Figure 2 for Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition

Figure 3 for Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition

Figure 4 for Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition

Abstract:A key component in Neural Architecture Search (NAS) is an accuracy predictor which asserts the accuracy of a queried architecture. To build a high quality accuracy predictor, conventional NAS algorithms rely on training a mass of architectures or a big supernet. This step often consumes hundreds to thousands of GPU days, dominating the total search cost. To address this issue, we propose to replace the accuracy predictor with a novel model-complexity index named Zen-score. Instead of predicting model accuracy, Zen-score directly asserts the model complexity of a network without training its parameters. This is inspired by recent advances in deep learning theories which show that model complexity of a network positively correlates to its accuracy on the target dataset. The computation of Zen-score only takes a few forward inferences through a randomly initialized network using random Gaussian input. It is applicable to any Vanilla Convolutional Neural Networks (VCN-networks) or compatible variants, covering a majority of networks popular in real-world applications. When combining Zen-score with Evolutionary Algorithm, we obtain a novel Zero-Shot NAS algorithm named Zen-NAS. We conduct extensive experiments on CIFAR10/CIFAR100 and ImageNet. In summary, Zen-NAS is able to design high performance architectures in less than half GPU day (12 GPU hours). The resultant networks, named ZenNets, achieve up to $83.0\%$ top-1 accuracy on ImageNet. Comparing to EfficientNets-B3/B5 of the same or better accuracies, ZenNets are up to $5.6$ times faster on NVIDIA V100, $11$ times faster on NVIDIA T4, $2.6$ times faster on Google Pixel2 and uses $50\%$ less FLOPs. Our source code and pre-trained models are released on https://github.com/idstcv/ZenNAS.

Via

Access Paper or Ask Questions

Trear: Transformer-based RGB-D Egocentric Action Recognition

Jan 05, 2021

Xiangyu Li, Yonghong Hou, Pichao Wang, Zhimin Gao, Mingliang Xu, Wanqing Li

Figure 1 for Trear: Transformer-based RGB-D Egocentric Action Recognition

Figure 2 for Trear: Transformer-based RGB-D Egocentric Action Recognition

Figure 3 for Trear: Transformer-based RGB-D Egocentric Action Recognition

Figure 4 for Trear: Transformer-based RGB-D Egocentric Action Recognition

Abstract:In this paper, we propose a \textbf{Tr}ansformer-based RGB-D \textbf{e}gocentric \textbf{a}ction \textbf{r}ecognition framework, called Trear. It consists of two modules, inter-frame attention encoder and mutual-attentional fusion block. Instead of using optical flow or recurrent units, we adopt self-attention mechanism to model the temporal structure of the data from different modalities. Input frames are cropped randomly to mitigate the effect of the data redundancy. Features from each modality are interacted through the proposed fusion block and combined through a simple yet effective fusion operation to produce a joint RGB-D representation. Empirical experiments on two large egocentric RGB-D datasets, THU-READ and FPHA, and one small dataset, WCVS, have shown that the proposed method outperforms the state-of-the-art results by a large margin.

* Accepted by IEEE Transactions

Via

Access Paper or Ask Questions

Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry

Dec 08, 2020

Xiangyu Li, Yonghong Hou, Pichao Wang, Zhimin Gao, Mingliang Xu, Wanqing Li

Figure 1 for Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry

Figure 2 for Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry

Figure 3 for Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry

Figure 4 for Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry

Abstract:Existing unsupervised visual odometry (VO) methods either match pairwise images or integrate the temporal information using recurrent neural networks over a long sequence of images. They are either not accurate, time-consuming in training or error accumulative. In this paper, we propose a method consisting of two camera pose estimators that deal with the information from pairwise images and a short sequence of images respectively. For image sequences, a Transformer-like structure is adopted to build a geometry model over a local temporal window, referred to as Transformer-based Auxiliary Pose Estimator (TAPE). Meanwhile, a Flow-to-Flow Pose Estimator (F2FPE) is proposed to exploit the relationship between pairwise images. The two estimators are constrained through a simple yet effective consistency loss in training. Empirical evaluation has shown that the proposed method outperforms the state-of-the-art unsupervised learning-based methods by a large margin and performs comparably to supervised and traditional ones on the KITTI and Malaga dataset.

Via

Access Paper or Ask Questions

SAR-NAS: Skeleton-based Action Recognition via Neural Architecture Searching

Oct 29, 2020

Haoyuan Zhang, Yonghong Hou, Pichao Wang, Zihui Guo, Wanqing Li

Figure 1 for SAR-NAS: Skeleton-based Action Recognition via Neural Architecture Searching

Figure 2 for SAR-NAS: Skeleton-based Action Recognition via Neural Architecture Searching

Figure 3 for SAR-NAS: Skeleton-based Action Recognition via Neural Architecture Searching

Figure 4 for SAR-NAS: Skeleton-based Action Recognition via Neural Architecture Searching

Abstract:This paper presents a study of automatic design of neural network architectures for skeleton-based action recognition. Specifically, we encode a skeleton-based action instance into a tensor and carefully define a set of operations to build two types of network cells: normal cells and reduction cells. The recently developed DARTS (Differentiable Architecture Search) is adopted to search for an effective network architecture that is built upon the two types of cells. All operations are 2D based in order to reduce the overall computation and search space. Experiments on the challenging NTU RGB+D and Kinectics datasets have verified that most of the networks developed to date for skeleton-based action recognition are likely not compact and efficient. The proposed method provides an approach to search for such a compact network that is able to achieve comparative or even better performance than the state-of-the-art methods.

Via

Access Paper or Ask Questions

Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition

Aug 21, 2020

Zitong Yu, Benjia Zhou, Jun Wan, Pichao Wang, Haoyu Chen, Xin Liu, Stan Z. Li, Guoying Zhao

Figure 1 for Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition

Figure 2 for Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition

Figure 3 for Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition

Figure 4 for Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition

Abstract:Gesture recognition has attracted considerable attention owing to its great potential in applications. Although the great progress has been made recently in multi-modal learning methods, existing methods still lack effective integration to fully explore synergies among spatio-temporal modalities effectively for gesture recognition. The problems are partially due to the fact that the existing manually designed network architectures have low efficiency in the joint learning of multi-modalities. In this paper, we propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition. The proposed method includes two key components: 1) enhanced temporal representation via the proposed 3D Central Difference Convolution (3D-CDC) family, which is able to capture rich temporal context via aggregating temporal difference information; and 2) optimized backbones for multi-sampling-rate branches and lateral connections among varied modalities. The resultant multi-modal multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics. Comprehensive experiments are performed on three benchmark datasets (IsoGD, NvGesture, and EgoGesture), demonstrating the state-of-the-art performance in both single- and multi-modality settings.The code is available at https://github.com/ZitongYu/3DCDC-NAS

* Submitted to IEEE Transactions on Image Processing

Via

Access Paper or Ask Questions

RobustTAD: Robust Time Series Anomaly Detection via Decomposition and Convolutional Neural Networks

Feb 21, 2020

Jingkun Gao, Xiaomin Song, Qingsong Wen, Pichao Wang, Liang Sun, Huan Xu

Figure 1 for RobustTAD: Robust Time Series Anomaly Detection via Decomposition and Convolutional Neural Networks

Figure 2 for RobustTAD: Robust Time Series Anomaly Detection via Decomposition and Convolutional Neural Networks

Figure 3 for RobustTAD: Robust Time Series Anomaly Detection via Decomposition and Convolutional Neural Networks

Figure 4 for RobustTAD: Robust Time Series Anomaly Detection via Decomposition and Convolutional Neural Networks

Abstract:The monitoring and management of numerous and diverse time series data at Alibaba Group calls for an effective and scalable time series anomaly detection service. In this paper, we propose RobustTAD, a Robust Time series Anomaly Detection framework by integrating robust seasonal-trend decomposition and convolutional neural network for time series data. The seasonal-trend decomposition can effectively handle complicated patterns in time series, and meanwhile significantly simplifies the architecture of the neural network, which is an encoder-decoder architecture with skip connections. This architecture can effectively capture the multi-scale information from time series, which is very useful in anomaly detection. Due to the limited labeled data in time series anomaly detection, we systematically investigate data augmentation methods in both time and frequency domains. We also introduce label-based weight and value-based weight in the loss function by utilizing the unbalanced nature of the time series anomaly detection problem. Compared with the widely used forecasting-based anomaly detection algorithms, decomposition-based algorithms, traditional statistical algorithms, as well as recent neural network based algorithms, RobustTAD performs significantly better on public benchmark datasets. It is deployed as a public online service and widely adopted in different business scenarios at Alibaba Group.

* 9 pages, 5 figures, and 2 tables

Via

Access Paper or Ask Questions

RGB-D-based Human Motion Recognition with Deep Learning: A Survey

Apr 24, 2018

Pichao Wang, Wanqing Li, Philip Ogunbona, Jun Wan, Sergio Escalera

Figure 1 for RGB-D-based Human Motion Recognition with Deep Learning: A Survey

Figure 2 for RGB-D-based Human Motion Recognition with Deep Learning: A Survey

Figure 3 for RGB-D-based Human Motion Recognition with Deep Learning: A Survey

Figure 4 for RGB-D-based Human Motion Recognition with Deep Learning: A Survey

Abstract:Human motion recognition is one of the most important branches of human-centered research activities. In recent years, motion recognition based on RGB-D data has attracted much attention. Along with the development in artificial intelligence, deep learning techniques have gained remarkable success in computer vision. In particular, convolutional neural networks (CNN) have achieved great success for image-based tasks, and recurrent neural networks (RNN) are renowned for sequence-based problems. Specifically, deep learning methods based on the CNN and RNN architectures have been adopted for motion recognition using RGB-D data. In this paper, a detailed overview of recent advances in RGB-D-based motion recognition is presented. The reviewed methods are broadly categorized into four groups, depending on the modality adopted for recognition: RGB-based, depth-based, skeleton-based and RGB+D-based. As a survey focused on the application of deep learning to RGB-D-based motion recognition, we explicitly discuss the advantages and limitations of existing techniques. Particularly, we highlighted the methods of encoding spatial-temporal-structural information inherent in video sequence, and discuss potential directions for future research.

Via

Access Paper or Ask Questions

Depth Pooling Based Large-scale 3D Action Recognition with Convolutional Neural Networks

Apr 17, 2018

Pichao Wang, Wanqing Li, Zhimin Gao, Chang Tang, Philip Ogunbona

Figure 1 for Depth Pooling Based Large-scale 3D Action Recognition with Convolutional Neural Networks

Figure 2 for Depth Pooling Based Large-scale 3D Action Recognition with Convolutional Neural Networks

Figure 3 for Depth Pooling Based Large-scale 3D Action Recognition with Convolutional Neural Networks

Figure 4 for Depth Pooling Based Large-scale 3D Action Recognition with Convolutional Neural Networks

Abstract:This paper proposes three simple, compact yet effective representations of depth sequences, referred to respectively as Dynamic Depth Images (DDI), Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI), for both isolated and continuous action recognition. These dynamic images are constructed from a segmented sequence of depth maps using hierarchical bidirectional rank pooling to effectively capture the spatial-temporal information. Specifically, DDI exploits the dynamics of postures over time and DDNI and DDMNI exploit the 3D structural information captured by depth maps. Upon the proposed representations, a ConvNet based method is developed for action recognition. The image-based representations enable us to fine-tune the existing Convolutional Neural Network (ConvNet) models trained on image data without training a large number of parameters from scratch. The proposed method achieved the state-of-art results on three large datasets, namely, the Large-scale Continuous Gesture Recognition Dataset (means Jaccard index 0.4109), the Large-scale Isolated Gesture Recognition Dataset (59.21%), and the NTU RGB+D Dataset (87.08% cross-subject and 84.22% cross-view) even though only the depth modality was used.

* arXiv admin note: text overlap with arXiv:1701.01814, arXiv:1608.06338

Via

Access Paper or Ask Questions

Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition

Dec 05, 2017

Pichao Wang, Wanqing Li, Jun Wan, Philip Ogunbona, Xinwang Liu

Figure 1 for Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition

Figure 2 for Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition

Figure 3 for Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition

Figure 4 for Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition

Abstract:A novel deep neural network training paradigm that exploits the conjoint information in multiple heterogeneous sources is proposed. Specifically, in a RGB-D based action recognition task, it cooperatively trains a single convolutional neural network (named c-ConvNet) on both RGB visual features and depth features, and deeply aggregates the two kinds of features for action recognition. Differently from the conventional ConvNet that learns the deep separable features for homogeneous modality-based classification with only one softmax loss function, the c-ConvNet enhances the discriminative power of the deeply learned features and weakens the undesired modality discrepancy by jointly optimizing a ranking loss and a softmax loss for both homogeneous and heterogeneous modalities. The ranking loss consists of intra-modality and cross-modality triplet losses, and it reduces both the intra-modality and cross-modality feature variations. Furthermore, the correlations between RGB and depth data are embedded in the c-ConvNet, and can be retrieved by either of the modalities and contribute to the recognition in the case even only one of the modalities is available. The proposed method was extensively evaluated on two large RGB-D action recognition datasets, ChaLearn LAP IsoGD and NTU RGB+D datasets, and one small dataset, SYSU 3D HOI, and achieved state-of-the-art results.

Via

Access Paper or Ask Questions