Alert button
Picture for Jian Zhao

Jian Zhao

Alert button

School of Electronic Science and Engineering, Nanjing University

UniParser: Multi-Human Parsing with Unified Correlation Representation Learning

Oct 13, 2023
Jiaming Chu, Lei Jin, Junliang Xing, Jian Zhao

Figure 1 for UniParser: Multi-Human Parsing with Unified Correlation Representation Learning
Figure 2 for UniParser: Multi-Human Parsing with Unified Correlation Representation Learning
Figure 3 for UniParser: Multi-Human Parsing with Unified Correlation Representation Learning
Figure 4 for UniParser: Multi-Human Parsing with Unified Correlation Representation Learning

Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through separate branches and distinct output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level segmentation results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations. By virtual of unifying instance-level and category-level output, UniParser circumvents manually designed post-processing techniques and surpasses state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We will release our source code, pretrained models, and online demos to facilitate future studies.

Viaarxiv icon

Adversarial Attacks on Video Object Segmentation with Hard Region Discovery

Sep 25, 2023
Ping Li, Yu Zhang, Li Yuan, Jian Zhao, Xianghua Xu, Xiaoqin Zhang

Figure 1 for Adversarial Attacks on Video Object Segmentation with Hard Region Discovery
Figure 2 for Adversarial Attacks on Video Object Segmentation with Hard Region Discovery
Figure 3 for Adversarial Attacks on Video Object Segmentation with Hard Region Discovery
Figure 4 for Adversarial Attacks on Video Object Segmentation with Hard Region Discovery

Video object segmentation has been applied to various computer vision tasks, such as video editing, autonomous driving, and human-robot interaction. However, the methods based on deep neural networks are vulnerable to adversarial examples, which are the inputs attacked by almost human-imperceptible perturbations, and the adversary (i.e., attacker) will fool the segmentation model to make incorrect pixel-level predictions. This will rise the security issues in highly-demanding tasks because small perturbations to the input video will result in potential attack risks. Though adversarial examples have been extensively used for classification, it is rarely studied in video object segmentation. Existing related methods in computer vision either require prior knowledge of categories or cannot be directly applied due to the special design for certain tasks, failing to consider the pixel-wise region attack. Hence, this work develops an object-agnostic adversary that has adversarial impacts on VOS by first-frame attacking via hard region discovery. Particularly, the gradients from the segmentation model are exploited to discover the easily confused region, in which it is difficult to identify the pixel-wise objects from the background in a frame. This provides a hardness map that helps to generate perturbations with a stronger adversarial power for attacking the first frame. Empirical studies on three benchmarks indicate that our attacker significantly degrades the performance of several state-of-the-art video object segmentation models.

Viaarxiv icon

3D Implicit Transporter for Temporally Consistent Keypoint Discovery

Sep 10, 2023
Chengliang Zhong, Yuhang Zheng, Yupeng Zheng, Hao Zhao, Li Yi, Xiaodong Mu, Ling Wang, Pengfei Li, Guyue Zhou, Chao Yang, Xinliang Zhang, Jian Zhao

Figure 1 for 3D Implicit Transporter for Temporally Consistent Keypoint Discovery
Figure 2 for 3D Implicit Transporter for Temporally Consistent Keypoint Discovery
Figure 3 for 3D Implicit Transporter for Temporally Consistent Keypoint Discovery
Figure 4 for 3D Implicit Transporter for Temporally Consistent Keypoint Discovery

Keypoint-based representation has proven advantageous in various visual and robotic tasks. However, the existing 2D and 3D methods for detecting keypoints mainly rely on geometric consistency to achieve spatial alignment, neglecting temporal consistency. To address this issue, the Transporter method was introduced for 2D data, which reconstructs the target frame from the source frame to incorporate both spatial and temporal information. However, the direct application of the Transporter to 3D point clouds is infeasible due to their structural differences from 2D images. Thus, we propose the first 3D version of the Transporter, which leverages hybrid 3D representation, cross attention, and implicit reconstruction. We apply this new learning system on 3D articulated objects and nonrigid animals (humans and rodents) and show that learned keypoints are spatio-temporally consistent. Additionally, we propose a closed-loop control strategy that utilizes the learned keypoints for 3D object manipulation and demonstrate its superior performance. Codes are available at https://github.com/zhongcl-thu/3D-Implicit-Transporter.

* ICCV2023 oral paper 
Viaarxiv icon

Uncovering the Unseen: Discover Hidden Intentions by Micro-Behavior Graph Reasoning

Aug 29, 2023
Zhuo Zhou, Wenxuan Liu, Danni Xu, Zheng Wang, Jian Zhao

Figure 1 for Uncovering the Unseen: Discover Hidden Intentions by Micro-Behavior Graph Reasoning
Figure 2 for Uncovering the Unseen: Discover Hidden Intentions by Micro-Behavior Graph Reasoning
Figure 3 for Uncovering the Unseen: Discover Hidden Intentions by Micro-Behavior Graph Reasoning
Figure 4 for Uncovering the Unseen: Discover Hidden Intentions by Micro-Behavior Graph Reasoning

This paper introduces a new and challenging Hidden Intention Discovery (HID) task. Unlike existing intention recognition tasks, which are based on obvious visual representations to identify common intentions for normal behavior, HID focuses on discovering hidden intentions when humans try to hide their intentions for abnormal behavior. HID presents a unique challenge in that hidden intentions lack the obvious visual representations to distinguish them from normal intentions. Fortunately, from a sociological and psychological perspective, we find that the difference between hidden and normal intentions can be reasoned from multiple micro-behaviors, such as gaze, attention, and facial expressions. Therefore, we first discover the relationship between micro-behavior and hidden intentions and use graph structure to reason about hidden intentions. To facilitate research in the field of HID, we also constructed a seminal dataset containing a hidden intention annotation of a typical theft scenario for HID. Extensive experiments show that the proposed network improves performance on the HID task by 9.9\% over the state-of-the-art method SBP.

Viaarxiv icon

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Aug 26, 2023
Jianqiang Xia, DianXi Shi, Ke Song, Linna Song, XiaoLei Wang, Songchang Jin, Li Zhou, Yu Cheng, Lei Jin, Zheng Zhu, Jianan Li, Gang Wang, Junliang Xing, Jian Zhao

Figure 1 for Unified Single-Stage Transformer Network for Efficient RGB-T Tracking
Figure 2 for Unified Single-Stage Transformer Network for Efficient RGB-T Tracking
Figure 3 for Unified Single-Stage Transformer Network for Efficient RGB-T Tracking
Figure 4 for Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Most existing RGB-T tracking networks extract modality features in a separate manner, which lacks interaction and mutual guidance between modalities. This limits the network's ability to adapt to the diverse dual-modality appearances of targets and the dynamic relationships between the modalities. Additionally, the three-stage fusion tracking paradigm followed by these networks significantly restricts the tracking speed. To overcome these problems, we propose a unified single-stage Transformer RGB-T tracking network, namely USTrack, which unifies the above three stages into a single ViT (Vision Transformer) backbone with a dual embedding layer through self-attention mechanism. With this structure, the network can extract fusion features of the template and search region under the mutual interaction of modalities. Simultaneously, relation modeling is performed between these features, efficiently obtaining the search region fusion features with better target-background discriminability for prediction. Furthermore, we introduce a novel feature selection mechanism based on modality reliability to mitigate the influence of invalid modalities for prediction, further improving the tracking performance. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance while maintaining the fastest inference speed 84.2FPS. In particular, MPR/MSR on the short-term and long-term subsets of VTUAV dataset increased by 11.1$\%$/11.7$\%$ and 11.3$\%$/9.7$\%$.

Viaarxiv icon

Color Prompting for Data-Free Continual Unsupervised Domain Adaptive Person Re-Identification

Aug 21, 2023
Jianyang Gu, Hao Luo, Kai Wang, Wei Jiang, Yang You, Jian Zhao

Figure 1 for Color Prompting for Data-Free Continual Unsupervised Domain Adaptive Person Re-Identification
Figure 2 for Color Prompting for Data-Free Continual Unsupervised Domain Adaptive Person Re-Identification
Figure 3 for Color Prompting for Data-Free Continual Unsupervised Domain Adaptive Person Re-Identification
Figure 4 for Color Prompting for Data-Free Continual Unsupervised Domain Adaptive Person Re-Identification

Unsupervised domain adaptive person re-identification (Re-ID) methods alleviate the burden of data annotation through generating pseudo supervision messages. However, real-world Re-ID systems, with continuously accumulating data streams, simultaneously demand more robust adaptation and anti-forgetting capabilities. Methods based on image rehearsal addresses the forgetting issue with limited extra storage but carry the risk of privacy leakage. In this work, we propose a Color Prompting (CoP) method for data-free continual unsupervised domain adaptive person Re-ID. Specifically, we employ a light-weighted prompter network to fit the color distribution of the current task together with Re-ID training. Then for the incoming new tasks, the learned color distribution serves as color style transfer guidance to transfer the images into past styles. CoP achieves accurate color style recovery for past tasks with adequate data diversity, leading to superior anti-forgetting effects compared with image rehearsal methods. Moreover, CoP demonstrates strong generalization performance for fast adaptation into new domains, given only a small amount of unlabeled images. Extensive experiments demonstrate that after the continual training pipeline the proposed CoP achieves 6.7% and 8.1% average rank-1 improvements over the replay method on seen and unseen domains, respectively. The source code for this work is publicly available in https://github.com/vimar-gu/ColorPromptReID.

Viaarxiv icon

RoPDA: Robust Prompt-based Data Augmentation for Low-Resource Named Entity Recognition

Jul 17, 2023
Sihan Song, Furao Shen, Jian Zhao

Figure 1 for RoPDA: Robust Prompt-based Data Augmentation for Low-Resource Named Entity Recognition
Figure 2 for RoPDA: Robust Prompt-based Data Augmentation for Low-Resource Named Entity Recognition
Figure 3 for RoPDA: Robust Prompt-based Data Augmentation for Low-Resource Named Entity Recognition
Figure 4 for RoPDA: Robust Prompt-based Data Augmentation for Low-Resource Named Entity Recognition

Data augmentation has been widely used in low-resource NER tasks to tackle the problem of data sparsity. However, previous data augmentation methods have the disadvantages of disrupted syntactic structures, token-label mismatch, and requirement for external knowledge or manual effort. To address these issues, we propose Robust Prompt-based Data Augmentation (RoPDA) for low-resource NER. Based on pre-trained language models (PLMs) with continuous prompt, RoPDA performs entity augmentation and context augmentation through five fundamental augmentation operations to generate label-flipping and label-preserving examples. To optimize the utilization of the augmented samples, we present two techniques: Self-Consistency Filtering and mixup. The former effectively eliminates low-quality samples, while the latter prevents performance degradation arising from the direct utilization of label-flipping samples. Extensive experiments on three benchmarks from different domains demonstrate that RoPDA significantly improves upon strong baselines, and also outperforms state-of-the-art semi-supervised learning methods when unlabeled data is included.

Viaarxiv icon

Evidential Detection and Tracking Collaboration: New Problem, Benchmark and Algorithm for Robust Anti-UAV System

Jul 04, 2023
Xue-Feng Zhu, Tianyang Xu, Jian Zhao, Jia-Wei Liu, Kai Wang, Gang Wang, Jianan Li, Qiang Wang, Lei Jin, Zheng Zhu, Junliang Xing, Xiao-Jun Wu

Figure 1 for Evidential Detection and Tracking Collaboration: New Problem, Benchmark and Algorithm for Robust Anti-UAV System
Figure 2 for Evidential Detection and Tracking Collaboration: New Problem, Benchmark and Algorithm for Robust Anti-UAV System
Figure 3 for Evidential Detection and Tracking Collaboration: New Problem, Benchmark and Algorithm for Robust Anti-UAV System
Figure 4 for Evidential Detection and Tracking Collaboration: New Problem, Benchmark and Algorithm for Robust Anti-UAV System

Unmanned Aerial Vehicles (UAVs) have been widely used in many areas, including transportation, surveillance, and military. However, their potential for safety and privacy violations is an increasing issue and highly limits their broader applications, underscoring the critical importance of UAV perception and defense (anti-UAV). Still, previous works have simplified such an anti-UAV task as a tracking problem, where the prior information of UAVs is always provided; such a scheme fails in real-world anti-UAV tasks (i.e. complex scenes, indeterminate-appear and -reappear UAVs, and real-time UAV surveillance). In this paper, we first formulate a new and practical anti-UAV problem featuring the UAVs perception in complex scenes without prior UAVs information. To benchmark such a challenging task, we propose the largest UAV dataset dubbed AntiUAV600 and a new evaluation metric. The AntiUAV600 comprises 600 video sequences of challenging scenes with random, fast, and small-scale UAVs, with over 723K thermal infrared frames densely annotated with bounding boxes. Finally, we develop a novel anti-UAV approach via an evidential collaboration of global UAVs detection and local UAVs tracking, which effectively tackles the proposed problem and can serve as a strong baseline for future research. Extensive experiments show our method outperforms SOTA approaches and validate the ability of AntiUAV600 to enhance UAV perception performance due to its large scale and complexity. Our dataset, pretrained models, and source codes will be released publically.

Viaarxiv icon

GIMM: InfoMin-Max for Automated Graph Contrastive Learning

May 27, 2023
Xin Xiong, Furao Shen, Xiangyu Wang, Jian Zhao

Figure 1 for GIMM: InfoMin-Max for Automated Graph Contrastive Learning
Figure 2 for GIMM: InfoMin-Max for Automated Graph Contrastive Learning
Figure 3 for GIMM: InfoMin-Max for Automated Graph Contrastive Learning
Figure 4 for GIMM: InfoMin-Max for Automated Graph Contrastive Learning

Graph contrastive learning (GCL) shows great potential in unsupervised graph representation learning. Data augmentation plays a vital role in GCL, and its optimal choice heavily depends on the downstream task. Many GCL methods with automated data augmentation face the risk of insufficient information as they fail to preserve the essential information necessary for the downstream task. To solve this problem, we propose InfoMin-Max for automated Graph contrastive learning (GIMM), which prevents GCL from encoding redundant information and losing essential information. GIMM consists of two major modules: (1) automated graph view generator, which acquires the approximation of InfoMin's optimal views through adversarial training without requiring task-relevant information; (2) view comparison, which learns an excellent encoder by applying InfoMax to view representations. To the best of our knowledge, GIMM is the first method that combines the InfoMin and InfoMax principles in GCL. Besides, GIMM introduces randomness to augmentation, thus stabilizing the model against perturbations. Extensive experiments on unsupervised and semi-supervised learning for node and graph classification demonstrate the superiority of our GIMM over state-of-the-art GCL methods with automated and manual data augmentation.

Viaarxiv icon