Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gang Hua

Wormpex AI Research

Sparse Pedestrian Character Learning for Trajectory Prediction

Nov 27, 2023

Yonghao Dong, Le Wang, Sanpin Zhou, Gang Hua, Changyin Sun

Figure 1 for Sparse Pedestrian Character Learning for Trajectory Prediction

Figure 2 for Sparse Pedestrian Character Learning for Trajectory Prediction

Figure 3 for Sparse Pedestrian Character Learning for Trajectory Prediction

Figure 4 for Sparse Pedestrian Character Learning for Trajectory Prediction

Abstract:Pedestrian trajectory prediction in a first-person view has recently attracted much attention due to its importance in autonomous driving. Recent work utilizes pedestrian character information, \textit{i.e.}, action and appearance, to improve the learned trajectory embedding and achieves state-of-the-art performance. However, it neglects the invalid and negative pedestrian character information, which is harmful to trajectory representation and thus leads to performance degradation. To address this issue, we present a two-stream sparse-character-based network~(TSNet) for pedestrian trajectory prediction. Specifically, TSNet learns the negative-removed characters in the sparse character representation stream to improve the trajectory embedding obtained in the trajectory representation stream. Moreover, to model the negative-removed characters, we propose a novel sparse character graph, including the sparse category and sparse temporal character graphs, to learn the different effects of various characters in category and temporal dimensions, respectively. Extensive experiments on two first-person view datasets, PIE and JAAD, show that our method outperforms existing state-of-the-art methods. In addition, ablation studies demonstrate different effects of various characters and prove that TSNet outperforms approaches without eliminating negative characters.

Via

Access Paper or Ask Questions

Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception

Nov 23, 2023

Lei Fan, Mingfu Liang, Yunxuan Li, Gang Hua, Ying Wu

Figure 1 for Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception

Figure 2 for Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception

Figure 3 for Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception

Figure 4 for Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception

Abstract:Active recognition enables robots to intelligently explore novel observations, thereby acquiring more information while circumventing undesired viewing conditions. Recent approaches favor learning policies from simulated or collected data, wherein appropriate actions are more frequently selected when the recognition is accurate. However, most recognition modules are developed under the closed-world assumption, which makes them ill-equipped to handle unexpected inputs, such as the absence of the target object in the current observation. To address this issue, we propose treating active recognition as a sequential evidence-gathering process, providing by-step uncertainty quantification and reliable prediction under the evidence combination theory. Additionally, the reward function developed in this paper effectively characterizes the merit of actions when operating in open-world environments. To evaluate the performance, we collect a dataset from an indoor simulator, encompassing various recognition challenges such as distance, occlusion levels, and visibility. Through a series of experiments on recognition and robustness analysis, we demonstrate the necessity of introducing uncertainties to active recognition and the superior performance of the proposed method.

Via

Access Paper or Ask Questions

HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending

Oct 16, 2023

Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Weiming Zhang, Gang Hua, Nenghai Yu

Figure 1 for HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending

Figure 2 for HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending

Figure 3 for HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending

Figure 4 for HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending

Abstract:Hair editing has made tremendous progress in recent years. Early hair editing methods use well-drawn sketches or masks to specify the editing conditions. Even though they can enable very fine-grained local control, such interaction modes are inefficient for the editing conditions that can be easily specified by language descriptions or reference images. Thanks to the recent breakthrough of cross-modal models (e.g., CLIP), HairCLIP is the first work that enables hair editing based on text descriptions or reference images. However, such text-driven and reference-driven interaction modes make HairCLIP unable to support fine-grained controls specified by sketch or mask. In this paper, we propose HairCLIPv2, aiming to support all the aforementioned interactions with one unified framework. Simultaneously, it improves upon HairCLIP with better irrelevant attributes (e.g., identity, background) preservation and unseen text descriptions support. The key idea is to convert all the hair editing tasks into hair transfer tasks, with editing conditions converted into different proxies accordingly. The editing effects are added upon the input image by blending the corresponding proxy features within the hairstyle or hair color feature spaces. Besides the unprecedented user interaction mode support, quantitative and qualitative experiments demonstrate the superiority of HairCLIPv2 in terms of editing effects, irrelevant attribute preservation and visual naturalness. Our code is available at \url{https://github.com/wty-ustc/HairCLIPv2}.

* ICCV 2023, code is available at https://github.com/wty-ustc/HairCLIPv2

Via

Access Paper or Ask Questions

Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

Sep 14, 2023

Lei Fan, Bo Liu, Haoxiang Li, Ying Wu, Gang Hua

Figure 1 for Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

Figure 2 for Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

Figure 3 for Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

Figure 4 for Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

Abstract:In real-world scenarios, typical visual recognition systems could fail under two major causes, i.e., the misclassification between known classes and the excusable misbehavior on unknown-class images. To tackle these deficiencies, flexible visual recognition should dynamically predict multiple classes when they are unconfident between choices and reject making predictions when the input is entirely out of the training distribution. Two challenges emerge along with this novel task. First, prediction uncertainty should be separately quantified as confusion depicting inter-class uncertainties and ignorance identifying out-of-distribution samples. Second, both confusion and ignorance should be comparable between samples to enable effective decision-making. In this paper, we propose to model these two sources of uncertainty explicitly with the theory of Subjective Logic. Regarding recognition as an evidence-collecting process, confusion is then defined as conflicting evidence, while ignorance is the absence of evidence. By predicting Dirichlet concentration parameters for singletons, comprehensive subjective opinions, including confusion and ignorance, could be achieved via further evidence combinations. Through a series of experiments on synthetic data analysis, visual recognition, and open-set detection, we demonstrate the effectiveness of our methods in quantifying two sources of uncertainties and dealing with flexible recognition.

* Accepted by ICCV23

Via

Access Paper or Ask Questions

SOAR: Scene-debiasing Open-set Action Recognition

Sep 03, 2023

Yuanhao Zhai, Ziyi Liu, Zhenyu Wu, Yi Wu, Chunluan Zhou, David Doermann, Junsong Yuan, Gang Hua

Figure 1 for SOAR: Scene-debiasing Open-set Action Recognition

Figure 2 for SOAR: Scene-debiasing Open-set Action Recognition

Figure 3 for SOAR: Scene-debiasing Open-set Action Recognition

Figure 4 for SOAR: Scene-debiasing Open-set Action Recognition

Abstract:Deep learning models have a risk of utilizing spurious clues to make predictions, such as recognizing actions based on the background scene. This issue can severely degrade the open-set action recognition performance when the testing samples have different scene distributions from the training samples. To mitigate this problem, we propose a novel method, called Scene-debiasing Open-set Action Recognition (SOAR), which features an adversarial scene reconstruction module and an adaptive adversarial scene classification module. The former prevents the decoder from reconstructing the video background given video features, and thus helps reduce the background information in feature learning. The latter aims to confuse scene type classification given video features, with a specific emphasis on the action foreground, and helps to learn scene-invariant information. In addition, we design an experiment to quantify the scene bias. The results indicate that the current open-set action recognizers are biased toward the scene, and our proposed SOAR method better mitigates such bias. Furthermore, our extensive experiments demonstrate that our method outperforms state-of-the-art methods, and the ablation studies confirm the effectiveness of our proposed modules.

* Accepted to ICCV 2023, code:https://github.com/yhZhai/SOAR

Via

Access Paper or Ask Questions

Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting

Aug 22, 2023

Qidong Huang, Xiaoyi Dong, Dongdong Chen, Yinpeng Chen, Lu Yuan, Gang Hua, Weiming Zhang, Nenghai Yu

Abstract:In this paper, we investigate the adversarial robustness of vision transformers that are equipped with BERT pretraining (e.g., BEiT, MAE). A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods. This observation drives us to rethink the basic differences between these BERT pretraining methods and how these differences affect the robustness against adversarial perturbations. Our empirical analysis reveals that the adversarial robustness of BERT pretraining is highly related to the reconstruction target, i.e., predicting the raw pixels of masked image patches will degrade more adversarial robustness of the model than predicting the semantic context, since it guides the model to concentrate more on medium-/high-frequency components of images. Based on our analysis, we provide a simple yet effective way to boost the adversarial robustness of MAE. The basic idea is using the dataset-extracted domain knowledge to occupy the medium-/high-frequency of images, thus narrowing the optimization space of adversarial perturbations. Specifically, we group the distribution of pretraining data and optimize a set of cluster-specific visual prompts on frequency domain. These prompts are incorporated with input images through prototype-based prompt selection during test period. Extensive evaluation shows that our method clearly boost MAE's adversarial robustness while maintaining its clean performance on ImageNet-1k classification. Our code is available at: https://github.com/shikiw/RobustMAE.

* Accepted at ICCV 2023

Via

Access Paper or Ask Questions

HQ-50K: A Large-scale, High-quality Dataset for Image Restoration

Jun 08, 2023

Qinhong Yang, Dongdong Chen, Zhentao Tan, Qiankun Liu, Qi Chu, Jianmin Bao, Lu Yuan, Gang Hua, Nenghai Yu

Figure 1 for HQ-50K: A Large-scale, High-quality Dataset for Image Restoration

Figure 2 for HQ-50K: A Large-scale, High-quality Dataset for Image Restoration

Figure 3 for HQ-50K: A Large-scale, High-quality Dataset for Image Restoration

Figure 4 for HQ-50K: A Large-scale, High-quality Dataset for Image Restoration

Abstract:This paper introduces a new large-scale image restoration dataset, called HQ-50K, which contains 50,000 high-quality images with rich texture details and semantic diversity. We analyze existing image restoration datasets from five different perspectives, including data scale, resolution, compression rates, texture details, and semantic coverage. However, we find that all of these datasets are deficient in some aspects. In contrast, HQ-50K considers all of these five aspects during the data curation process and meets all requirements. We also present a new Degradation-Aware Mixture of Expert (DAMoE) model, which enables a single model to handle multiple corruption types and unknown levels. Our extensive experiments demonstrate that HQ-50K consistently improves the performance on various image restoration tasks, such as super-resolution, denoising, dejpeg, and deraining. Furthermore, our proposed DAMoE, trained on our \dataset, outperforms existing state-of-the-art unified models designed for multiple restoration tasks and levels. The dataset and code are available at \url{https://github.com/littleYaang/HQ-50K}.

* Dataset and code will be available at https://github.com/littleYaang/HQ-50K

Via

Access Paper or Ask Questions

Designing a Better Asymmetric VQGAN for StableDiffusion

Jun 07, 2023

Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua

Figure 1 for Designing a Better Asymmetric VQGAN for StableDiffusion

Figure 2 for Designing a Better Asymmetric VQGAN for StableDiffusion

Figure 3 for Designing a Better Asymmetric VQGAN for StableDiffusion

Figure 4 for Designing a Better Asymmetric VQGAN for StableDiffusion

Abstract:StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at \url{https://github.com/buxiangzhiren/Asymmetric_VQGAN}.

* code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN

Via

Access Paper or Ask Questions

Regularizing Second-Order Influences for Continual Learning

Apr 20, 2023

Zhicheng Sun, Yadong Mu, Gang Hua

Abstract:Continual learning aims to learn on non-stationary data streams without catastrophically forgetting previous knowledge. Prevalent replay-based methods address this challenge by rehearsing on a small buffer holding the seen data, for which a delicate sample selection strategy is required. However, existing selection schemes typically seek only to maximize the utility of the ongoing selection, overlooking the interference between successive rounds of selection. Motivated by this, we dissect the interaction of sequential selection steps within a framework built on influence functions. We manage to identify a new class of second-order influences that will gradually amplify incidental bias in the replay buffer and compromise the selection process. To regularize the second-order effects, a novel selection objective is proposed, which also has clear connections to two widely adopted criteria. Furthermore, we present an efficient implementation for optimizing the proposed criterion. Experiments on multiple continual learning benchmarks demonstrate the advantage of our approach over state-of-the-art methods. Code is available at https://github.com/feifeiobama/InfluenceCL.

* CVPR 2023

Via

Access Paper or Ask Questions

MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

Mar 18, 2023

Zheng Qin, Sanping Zhou, Le Wang, Jinghai Duan, Gang Hua, Wei Tang

Figure 1 for MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

Figure 2 for MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

Figure 3 for MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

Figure 4 for MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

Abstract:The main challenge of Multi-Object Tracking~(MOT) lies in maintaining a continuous trajectory for each target. Existing methods often learn reliable motion patterns to match the same target between adjacent frames and discriminative appearance features to re-identify the lost targets after a long period. However, the reliability of motion prediction and the discriminability of appearances can be easily hurt by dense crowds and extreme occlusions in the tracking process. In this paper, we propose a simple yet effective multi-object tracker, i.e., MotionTrack, which learns robust short-term and long-term motions in a unified framework to associate trajectories from a short to long range. For dense crowds, we design a novel Interaction Module to learn interaction-aware motions from short-term trajectories, which can estimate the complex movement of each target. For extreme occlusions, we build a novel Refind Module to learn reliable long-term motions from the target's history trajectory, which can link the interrupted trajectory with its corresponding detection. Our Interaction Module and Refind Module are embedded in the well-known tracking-by-detection paradigm, which can work in tandem to maintain superior performance. Extensive experimental results on MOT17 and MOT20 datasets demonstrate the superiority of our approach in challenging scenarios, and it achieves state-of-the-art performances at various MOT metrics.

* Accepted by CVPR2023!

Via

Access Paper or Ask Questions