Alert button
Picture for Mang Ye

Mang Ye

Alert button

Wuhan University

The Multi-Modal Video Reasoning and Analyzing Competition

Aug 18, 2021
Haoran Peng, He Huang, Li Xu, Tianjiao Li, Jun Liu, Hossein Rahmani, Qiuhong Ke, Zhicheng Guo, Cong Wu, Rongchang Li, Mang Ye, Jiahao Wang, Jiaxu Zhang, Yuanzhong Liu, Tao He, Fuwei Zhang, Xianbin Liu, Tao Lin

Figure 1 for The Multi-Modal Video Reasoning and Analyzing Competition
Figure 2 for The Multi-Modal Video Reasoning and Analyzing Competition
Figure 3 for The Multi-Modal Video Reasoning and Analyzing Competition
Figure 4 for The Multi-Modal Video Reasoning and Analyzing Competition

In this paper, we introduce the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) workshop in conjunction with ICCV 2021. This competition is composed of four different tracks, namely, video question answering, skeleton-based action recognition, fisheye video-based action recognition, and person re-identification, which are based on two datasets: SUTD-TrafficQA and UAV-Human. We summarize the top-performing methods submitted by the participants in this competition and show their results achieved in the competition.

* Accepted to ICCV 2021 Workshops 
Viaarxiv icon

TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval

May 05, 2021
Yongbiao Chen, Sheng Zhang, Fangxin Liu, Zhigang Chang, Mang Ye, Zhengwei Qi

Figure 1 for TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval
Figure 2 for TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval
Figure 3 for TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval
Figure 4 for TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval

Deep hamming hashing has gained growing popularity in approximate nearest neighbour search for large-scale image retrieval. Until now, the deep hashing for the image retrieval community has been dominated by convolutional neural network architectures, e.g. \texttt{Resnet}\cite{he2016deep}. In this paper, inspired by the recent advancements of vision transformers, we present \textbf{Transhash}, a pure transformer-based framework for deep hashing learning. Concretely, our framework is composed of two major modules: (1) Based on \textit{Vision Transformer} (ViT), we design a siamese vision transformer backbone for image feature extraction. To learn fine-grained features, we innovate a dual-stream feature learning on top of the transformer to learn discriminative global and local features. (2) Besides, we adopt a Bayesian learning scheme with a dynamically constructed similarity matrix to learn compact binary hash codes. The entire framework is jointly trained in an end-to-end manner.~To the best of our knowledge, this is the first work to tackle deep hashing learning problems without convolutional neural networks (\textit{CNNs}). We perform comprehensive experiments on three widely-studied datasets: \textbf{CIFAR-10}, \textbf{NUSWIDE} and \textbf{IMAGENET}. The experiments have evidenced our superiority against the existing state-of-the-art deep hashing methods. Specifically, we achieve 8.2\%, 2.6\%, 12.7\% performance gains in terms of average \textit{mAP} for different hash bit lengths on three public datasets, respectively.

Viaarxiv icon

Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification

Dec 12, 2020
Can Zhang, Hong Liu, Wei Guo, Mang Ye

Figure 1 for Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification
Figure 2 for Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification
Figure 3 for Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification
Figure 4 for Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification

RGB-Infrared person re-identification (RGB-IR Re-ID) aims to match persons from heterogeneous images captured by visible and thermal cameras, which is of great significance in the surveillance system under poor light conditions. Facing great challenges in complex variances including conventional single-modality and additional inter-modality discrepancies, most of the existing RGB-IR Re-ID methods propose to impose constraints in image level, feature level or a hybrid of both. Despite the better performance of hybrid constraints, they are usually implemented with heavy network architecture. As a matter of fact, previous efforts contribute more as pioneering works in new cross-modal Re-ID area while leaving large space for improvement. This can be mainly attributed to: (1) lack of abundant person image pairs from different modalities for training, and (2) scarcity of salient modality-invariant features especially on coarse representations for effective matching. To address these issues, a novel Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global in a cascading manner, which results in a unified representation containing rich and enhanced semantic features. Furthermore, a marginal exponential centre (MeCen) loss is introduced to jointly eliminate mixed variances from intra- and inter-modal examples. Cross-modality correlations can thus be efficiently explored on salient features for distinctive modality-invariant feature learning. Extensive experiments are conducted to demonstrate that the proposed method outperforms all the state-of-the-art by a large margin.

* 8 pages, 5 figures, ICPR2020 conference 
Viaarxiv icon

Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification

Jul 18, 2020
Mang Ye, Jianbing Shen, David J. Crandall, Ling Shao, Jiebo Luo

Figure 1 for Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification
Figure 2 for Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification
Figure 3 for Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification
Figure 4 for Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification

Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem. Due to the large intra-class variations and cross-modality discrepancy with large amount of sample noise, it is difficult to learn discriminative part features. Existing VI-ReID methods instead tend to learn global representations, which have limited discriminability and weak robustness to noisy images. In this paper, we propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID. We propose an intra-modality weighted-part attention module to extract discriminative part-aggregated features, by imposing the domain knowledge on the part relationship mining. To enhance robustness against noisy samples, we introduce cross-modality graph structured attention to reinforce the representation with the contextual relations across the two modalities. We also develop a parameter-free dynamic dual aggregation learning strategy to adaptively integrate the two components in a progressive joint training manner. Extensive experiments demonstrate that DDAG outperforms the state-of-the-art methods under various settings.

* Accepted by ECCV20 
Viaarxiv icon

Deep Learning for Person Re-identification: A Survey and Outlook

Jan 13, 2020
Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, Steven C. H. Hoi

Figure 1 for Deep Learning for Person Re-identification: A Survey and Outlook
Figure 2 for Deep Learning for Person Re-identification: A Survey and Outlook
Figure 3 for Deep Learning for Person Re-identification: A Survey and Outlook
Figure 4 for Deep Learning for Person Re-identification: A Survey and Outlook

Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increased interest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We first conduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization. With the performance saturation under closed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challenging issues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of five different aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art or at least comparable performance on both single- and cross-modality Re-ID tasks. Meanwhile, we introduce a new evaluation metric (mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.

* 20 pages, 8 figures 
Viaarxiv icon

Unsupervised Embedding Learning via Invariant and Spreading Instance Feature

Apr 06, 2019
Mang Ye, Xu Zhang, Pong C. Yuen, Shih-Fu Chang

Figure 1 for Unsupervised Embedding Learning via Invariant and Spreading Instance Feature
Figure 2 for Unsupervised Embedding Learning via Invariant and Spreading Instance Feature
Figure 3 for Unsupervised Embedding Learning via Invariant and Spreading Instance Feature
Figure 4 for Unsupervised Embedding Learning via Invariant and Spreading Instance Feature

This paper studies the unsupervised embedding learning problem, which requires an effective similarity measurement between samples in low-dimensional embedding space. Motivated by the positive concentrated and negative separated properties observed from category-wise supervised learning, we propose to utilize the instance-wise supervision to approximate these properties, which aims at learning data augmentation invariant and instance spread-out features. To achieve this goal, we propose a novel instance based softmax embedding method, which directly optimizes the `real' instance features on top of the softmax function. It achieves significantly faster learning speed and higher accuracy than all existing methods. The proposed method performs well for both seen and unseen testing categories with cosine similarity. It also achieves competitive performance even without pre-trained network over samples from fine-grained categories.

* CVPR 2019 
Viaarxiv icon

Dynamic Label Graph Matching for Unsupervised Video Re-Identification

Sep 27, 2017
Mang Ye, Andy J Ma, Liang Zheng, Jiawei Li, P C Yuen

Figure 1 for Dynamic Label Graph Matching for Unsupervised Video Re-Identification
Figure 2 for Dynamic Label Graph Matching for Unsupervised Video Re-Identification
Figure 3 for Dynamic Label Graph Matching for Unsupervised Video Re-Identification
Figure 4 for Dynamic Label Graph Matching for Unsupervised Video Re-Identification

Label estimation is an important component in an unsupervised person re-identification (re-ID) system. This paper focuses on cross-camera label estimation, which can be subsequently used in feature learning to learn robust re-ID models. Specifically, we propose to construct a graph for samples in each camera, and then graph matching scheme is introduced for cross-camera labeling association. While labels directly output from existing graph matching methods may be noisy and inaccurate due to significant cross-camera variations, this paper proposes a dynamic graph matching (DGM) method. DGM iteratively updates the image graph and the label estimation process by learning a better feature space with intermediate estimated labels. DGM is advantageous in two aspects: 1) the accuracy of estimated labels is improved significantly with the iterations; 2) DGM is robust to noisy initial training data. Extensive experiments conducted on three benchmarks including the large-scale MARS dataset show that DGM yields competitive performance to fully supervised baselines, and outperforms competing unsupervised learning methods.

* Accepted by ICCV 2017. Revised our IDE results on MARS dataset under standard evaluation protocol 
Viaarxiv icon