Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yizhou Wang

Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

Apr 21, 2023

Hongcheng Wang, Yuxuan Wang, Fangwei Zhong, Mingdong Wu, Jianwei Zhang, Yizhou Wang, Hao Dong

Figure 1 for Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

Figure 2 for Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

Figure 3 for Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

Figure 4 for Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

Abstract:Visual-audio navigation (VAN) is attracting more and more attention from the robotic community due to its broad applications, \emph{e.g.}, household robots and rescue robots. In this task, an embodied agent must search for and navigate to the sound source with egocentric visual and audio observations. However, the existing methods are limited in two aspects: 1) poor generalization to unheard sound categories; 2) sample inefficient in training. Focusing on these two problems, we propose a brain-inspired plug-and-play method to learn a semantic-agnostic and spatial-aware representation for generalizable visual-audio navigation. We meticulously design two auxiliary tasks for respectively accelerating learning representations with the above-desired characteristics. With these two auxiliary tasks, the agent learns a spatially-correlated representation of visual and audio inputs that can be applied to work on environments with novel sounds and maps. Experiment results on realistic 3D scenes (Replica and Matterport3D) demonstrate that our method achieves better generalization performance when zero-shot transferred to scenes with unseen maps and unheard sound categories.

Via

Access Paper or Ask Questions

RSPT: Reconstruct Surroundings and Predict Trajectories for Generalizable Active Object Tracking

Apr 07, 2023

Fangwei Zhong, Xiao Bi, Yudi Zhang, Wei Zhang, Yizhou Wang

Figure 1 for RSPT: Reconstruct Surroundings and Predict Trajectories for Generalizable Active Object Tracking

Figure 2 for RSPT: Reconstruct Surroundings and Predict Trajectories for Generalizable Active Object Tracking

Figure 3 for RSPT: Reconstruct Surroundings and Predict Trajectories for Generalizable Active Object Tracking

Figure 4 for RSPT: Reconstruct Surroundings and Predict Trajectories for Generalizable Active Object Tracking

Abstract:Active Object Tracking (AOT) aims to maintain a specific relation between the tracker and object(s) by autonomously controlling the motion system of a tracker given observations. AOT has wide-ranging applications, such as in mobile robots and autonomous driving. However, building a generalizable active tracker that works robustly across different scenarios remains a challenge, especially in unstructured environments with cluttered obstacles and diverse layouts. We argue that constructing a state representation capable of modeling the geometry structure of the surroundings and the dynamics of the target is crucial for achieving this goal. To address this challenge, we present RSPT, a framework that forms a structure-aware motion representation by Reconstructing the Surroundings and Predicting the target Trajectory. Additionally, we enhance the generalization of the policy network by training in an asymmetric dueling mechanism. We evaluate RSPT on various simulated scenarios and show that it outperforms existing methods in unseen environments, particularly those with complex obstacles and layouts. We also demonstrate the successful transfer of RSPT to real-world settings. Project Website: https://sites.google.com/view/aot-rspt.

* AAAI 2023 (Oral)

Via

Access Paper or Ask Questions

3D Human Mesh Estimation from Virtual Markers

Mar 27, 2023

Xiaoxuan Ma, Jiajun Su, Chunyu Wang, Wentao Zhu, Yizhou Wang

Abstract:Inspired by the success of volumetric 3D pose estimation, some recent human mesh estimators propose to estimate 3D skeletons as intermediate representations, from which, the dense 3D meshes are regressed by exploiting the mesh topology. However, body shape information is lost in extracting skeletons, leading to mediocre performance. The advanced motion capture systems solve the problem by placing dense physical markers on the body surface, which allows to extract realistic meshes from their non-rigid motions. However, they cannot be applied to wild images without markers. In this work, we present an intermediate representation, named virtual markers, which learns 64 landmark keypoints on the body surface based on the large-scale mocap data in a generative style, mimicking the effects of physical markers. The virtual markers can be accurately detected from wild images and can reconstruct the intact meshes with realistic shapes by simple interpolation. Our approach outperforms the state-of-the-art methods on three datasets. In particular, it surpasses the existing methods by a notable margin on the SURREAL dataset, which has diverse body shapes. Code is available at https://github.com/ShirleyMaxx/VirtualMarker.

* CVPR 2023

Via

Access Paper or Ask Questions

UniHCP: A Unified Model for Human-Centric Perceptions

Mar 19, 2023

Yuanzheng Ci, Yizhou Wang, Meilin Chen, Shixiang Tang, Lei Bai, Feng Zhu, Rui Zhao, Fengwei Yu, Donglian Qi, Wanli Ouyang

Figure 1 for UniHCP: A Unified Model for Human-Centric Perceptions

Figure 2 for UniHCP: A Unified Model for Human-Centric Perceptions

Figure 3 for UniHCP: A Unified Model for Human-Centric Perceptions

Figure 4 for UniHCP: A Unified Model for Human-Centric Perceptions

Abstract:Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian detection, person re-identification, etc.) play a key role in industrial applications of visual models. While specific human-centric tasks have their own relevant semantic aspect to focus on, they also share the same underlying semantic structure of the human body. However, few works have attempted to exploit such homogeneity and design a general-propose model for human-centric tasks. In this work, we revisit a broad range of human-centric tasks and unify them in a minimalist manner. We propose UniHCP, a Unified Model for Human-Centric Perceptions, which unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines on several in-domain and downstream tasks by direct evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing, 86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID, and 85.8 JI on CrowdHuman for pedestrian detection, performing better than specialized models tailored for each task.

* Accepted for publication at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023 (CVPR 2023)

Via

Access Paper or Ask Questions

HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining

Mar 10, 2023

Shixiang Tang, Cheng Chen, Qingsong Xie, Meilin Chen, Yizhou Wang, Yuanzheng Ci, Lei Bai, Feng Zhu, Haiyang Yang, Li Yi(+2 more)

Figure 1 for HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining

Figure 2 for HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining

Figure 3 for HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining

Figure 4 for HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining

Abstract:Human-centric perceptions include a variety of vision tasks, which have widespread industrial applications, including surveillance, autonomous driving, and the metaverse. It is desirable to have a general pretrain model for versatile human-centric downstream tasks. This paper forges ahead along this path from the aspects of both benchmark and pretraining methods. Specifically, we propose a \textbf{HumanBench} based on existing datasets to comprehensively evaluate on the common ground the generalization abilities of different pretraining methods on 19 datasets from 6 diverse downstream tasks, including person ReID, pose estimation, human parsing, pedestrian attribute recognition, pedestrian detection, and crowd counting. To learn both coarse-grained and fine-grained knowledge in human bodies, we further propose a \textbf{P}rojector \textbf{A}ssis\textbf{T}ed \textbf{H}ierarchical pretraining method (\textbf{PATH}) to learn diverse knowledge at different granularity levels. Comprehensive evaluations on HumanBench show that our PATH achieves new state-of-the-art results on 17 downstream datasets and on-par results on the other 2 datasets. The code will be publicly at \href{https://github.com/OpenGVLab/HumanBench}{https://github.com/OpenGVLab/HumanBench}.

* Accepted to CVPR2023

Via

Access Paper or Ask Questions

Proactive Multi-Camera Collaboration For 3D Human Pose Estimation

Mar 07, 2023

Hai Ci, Mickel Liu, Xuehai Pan, Fangwei Zhong, Yizhou Wang

Figure 1 for Proactive Multi-Camera Collaboration For 3D Human Pose Estimation

Figure 2 for Proactive Multi-Camera Collaboration For 3D Human Pose Estimation

Figure 3 for Proactive Multi-Camera Collaboration For 3D Human Pose Estimation

Figure 4 for Proactive Multi-Camera Collaboration For 3D Human Pose Estimation

Abstract:This paper presents a multi-agent reinforcement learning (MARL) scheme for proactive Multi-Camera Collaboration in 3D Human Pose Estimation in dynamic human crowds. Traditional fixed-viewpoint multi-camera solutions for human motion capture (MoCap) are limited in capture space and susceptible to dynamic occlusions. Active camera approaches proactively control camera poses to find optimal viewpoints for 3D reconstruction. However, current methods still face challenges with credit assignment and environment dynamics. To address these issues, our proposed method introduces a novel Collaborative Triangulation Contribution Reward (CTCR) that improves convergence and alleviates multi-agent credit assignment issues resulting from using 3D reconstruction accuracy as the shared reward. Additionally, we jointly train our model with multiple world dynamics learning tasks to better capture environment dynamics and encourage anticipatory behaviors for occlusion avoidance. We evaluate our proposed method in four photo-realistic UE4 environments to ensure validity and generalizability. Empirical results show that our method outperforms fixed and active baselines in various scenarios with different numbers of cameras and humans.

* ICLR 2023 poster

Via

Access Paper or Ask Questions

Saliency Guided Contrastive Learning on Scene Images

Feb 23, 2023

Meilin Chen, Yizhou Wang, Shixiang Tang, Feng Zhu, Haiyang Yang, Lei Bai, Rui Zhao, Donglian Qi, Wanli Ouyang

Figure 1 for Saliency Guided Contrastive Learning on Scene Images

Figure 2 for Saliency Guided Contrastive Learning on Scene Images

Figure 3 for Saliency Guided Contrastive Learning on Scene Images

Figure 4 for Saliency Guided Contrastive Learning on Scene Images

Abstract:Self-supervised learning holds promise in leveraging large numbers of unlabeled data. However, its success heavily relies on the highly-curated dataset, e.g., ImageNet, which still needs human cleaning. Directly learning representations from less-curated scene images is essential for pushing self-supervised learning to a higher level. Different from curated images which include simple and clear semantic information, scene images are more complex and mosaic because they often include complex scenes and multiple objects. Despite being feasible, recent works largely overlooked discovering the most discriminative regions for contrastive learning to object representations in scene images. In this work, we leverage the saliency map derived from the model's output during learning to highlight these discriminative regions and guide the whole contrastive learning. Specifically, the saliency map first guides the method to crop its discriminative regions as positive pairs and then reweighs the contrastive losses among different crops by its saliency scores. Our method significantly improves the performance of self-supervised learning on scene images by +1.1, +4.3, +2.2 Top1 accuracy in ImageNet linear evaluation, Semi-supervised learning with 1% and 10% ImageNet labels, respectively. We hope our insights on saliency maps can motivate future research on more general-purpose unsupervised representation learning from scene data.

* 12 pages, 5 figures. arXiv admin note: text overlap with arXiv:2106.11952 by other authors

Via

Access Paper or Ask Questions

Towards Explainable Visual Anomaly Detection

Feb 13, 2023

Yizhou Wang, Dongliang Guo, Sheng Li, Yun Fu

Figure 1 for Towards Explainable Visual Anomaly Detection

Figure 2 for Towards Explainable Visual Anomaly Detection

Figure 3 for Towards Explainable Visual Anomaly Detection

Figure 4 for Towards Explainable Visual Anomaly Detection

Abstract:Anomaly detection and localization of visual data, including images and videos, are of great significance in both machine learning academia and applied real-world scenarios. Despite the rapid development of visual anomaly detection techniques in recent years, the interpretations of these black-box models and reasonable explanations of why anomalies can be distinguished out are scarce. This paper provides the first survey concentrated on explainable visual anomaly detection methods. We first introduce the basic background of image-level anomaly detection and video-level anomaly detection, followed by the current explainable approaches for visual anomaly detection. Then, as the main content of this survey, a comprehensive and exhaustive literature review of explainable anomaly detection methods for both images and videos is presented. Finally, we discuss several promising future directions and open problems to explore on the explainability of visual anomaly detection.

Via

Access Paper or Ask Questions

Making Reconstruction-based Method Great Again for Video Anomaly Detection

Jan 28, 2023

Yizhou Wang, Can Qin, Yue Bai, Yi Xu, Xu Ma, Yun Fu

Figure 1 for Making Reconstruction-based Method Great Again for Video Anomaly Detection

Figure 2 for Making Reconstruction-based Method Great Again for Video Anomaly Detection

Figure 3 for Making Reconstruction-based Method Great Again for Video Anomaly Detection

Figure 4 for Making Reconstruction-based Method Great Again for Video Anomaly Detection

Abstract:Anomaly detection in videos is a significant yet challenging problem. Previous approaches based on deep neural networks employ either reconstruction-based or prediction-based approaches. Nevertheless, existing reconstruction-based methods 1) rely on old-fashioned convolutional autoencoders and are poor at modeling temporal dependency; 2) are prone to overfit the training samples, leading to indistinguishable reconstruction errors of normal and abnormal frames during the inference phase. To address such issues, firstly, we get inspiration from transformer and propose ${\textbf S}$patio-${\textbf T}$emporal ${\textbf A}$uto-${\textbf T}$rans-${\textbf E}$ncoder, dubbed as $\textbf{STATE}$, as a new autoencoder model for enhanced consecutive frame reconstruction. Our STATE is equipped with a specifically designed learnable convolutional attention module for efficient temporal learning and reasoning. Secondly, we put forward a novel reconstruction-based input perturbation technique during testing to further differentiate anomalous frames. With the same perturbation magnitude, the testing reconstruction error of the normal frames lowers more than that of the abnormal frames, which contributes to mitigating the overfitting problem of reconstruction. Owing to the high relevance of the frame abnormality and the objects in the frame, we conduct object-level reconstruction using both the raw frame and the corresponding optical flow patches. Finally, the anomaly score is designed based on the combination of the raw and motion reconstruction errors using perturbed inputs. Extensive experiments on benchmark video anomaly detection datasets demonstrate that our approach outperforms previous reconstruction-based methods by a notable margin, and achieves state-of-the-art anomaly detection performance consistently. The code is available at https://github.com/wyzjack/MRMGA4VAD.

* Accepted by ICDM 2022

Via

Access Paper or Ask Questions

GFPose: Learning 3D Human Pose Prior with Gradient Fields

Dec 16, 2022

Hai Ci, Mingdong Wu, Wentao Zhu, Xiaoxuan Ma, Hao Dong, Fangwei Zhong, Yizhou Wang

Abstract:Learning 3D human pose prior is essential to human-centered AI. Here, we present GFPose, a versatile framework to model plausible 3D human poses for various applications. At the core of GFPose is a time-dependent score network, which estimates the gradient on each body joint and progressively denoises the perturbed 3D human pose to match a given task specification. During the denoising process, GFPose implicitly incorporates pose priors in gradients and unifies various discriminative and generative tasks in an elegant framework. Despite the simplicity, GFPose demonstrates great potential in several downstream tasks. Our experiments empirically show that 1) as a multi-hypothesis pose estimator, GFPose outperforms existing SOTAs by 20% on Human3.6M dataset. 2) as a single-hypothesis pose estimator, GFPose achieves comparable results to deterministic SOTAs, even with a vanilla backbone. 3) GFPose is able to produce diverse and realistic samples in pose denoising, completion and generation tasks. Project page https://sites.google.com/view/gfpose/

Via

Access Paper or Ask Questions