Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nong Sang

Cross-video Identity Correlating for Person Re-identification Pre-training

Sep 27, 2024

Jialong Zuo, Ying Nie, Hanyu Zhou, Huaxin Zhang, Haoyu Wang, Tianyu Guo, Nong Sang, Changxin Gao

Figure 1 for Cross-video Identity Correlating for Person Re-identification Pre-training

Figure 2 for Cross-video Identity Correlating for Person Re-identification Pre-training

Figure 3 for Cross-video Identity Correlating for Person Re-identification Pre-training

Figure 4 for Cross-video Identity Correlating for Person Re-identification Pre-training

Abstract:Recent researches have proven that pre-training on large-scale person images extracted from internet videos is an effective way in learning better representations for person re-identification. However, these researches are mostly confined to pre-training at the instance-level or single-video tracklet-level. They ignore the identity-invariance in images of the same person across different videos, which is a key focus in person re-identification. To address this issue, we propose a Cross-video Identity-cOrrelating pre-traiNing (CION) framework. Defining a noise concept that comprehensively considers both intra-identity consistency and inter-identity discrimination, CION seeks the identity correlation from cross-video images by modeling it as a progressive multi-level denoising problem. Furthermore, an identity-guided self-distillation loss is proposed to implement better large-scale pre-training by mining the identity-invariance within person images. We conduct extensive experiments to verify the superiority of our CION in terms of efficiency and performance. CION achieves significantly leading performance with even fewer training samples. For example, compared with the previous state-of-the-art~\cite{ISR}, CION with the same ResNet50-IBN achieves higher mAP of 93.3\% and 74.3\% on Market1501 and MSMT17, while only utilizing 8\% training samples. Finally, with CION demonstrating superior model-agnostic ability, we contribute a model zoo named ReIDZoo to meet diverse research and application needs in this field. It contains a series of CION pre-trained models with spanning structures and parameters, totaling 32 models with 10 different structures, including GhostNet, ConvNext, RepViT, FastViT and so on. The code and models will be made publicly available at https://github.com/Zplusdragon/CION_ReIDZoo.

* NeurIPS 2024 Accepted Paper

Via

Access Paper or Ask Questions

Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

Jun 18, 2024

Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, Nong Sang

Figure 1 for Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

Figure 2 for Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

Figure 3 for Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

Figure 4 for Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

Abstract:Towards open-ended Video Anomaly Detection (VAD), existing methods often exhibit biased detection when faced with challenging or unseen events and lack interpretability. To address these drawbacks, we propose Holmes-VAD, a novel framework that leverages precise temporal supervision and rich multimodal instructions to enable accurate anomaly localization and comprehensive explanations. Firstly, towards unbiased and explainable VAD system, we construct the first large-scale multimodal VAD instruction-tuning benchmark, i.e., VAD-Instruct50k. This dataset is created using a carefully designed semi-automatic labeling paradigm. Efficient single-frame annotations are applied to the collected untrimmed videos, which are then synthesized into high-quality analyses of both abnormal and normal video clips using a robust off-the-shelf video captioner and a large language model (LLM). Building upon the VAD-Instruct50k dataset, we develop a customized solution for interpretable video anomaly detection. We train a lightweight temporal sampler to select frames with high anomaly response and fine-tune a multimodal large language model (LLM) to generate explanatory content. Extensive experimental results validate the generality and interpretability of the proposed Holmes-VAD, establishing it as a novel interpretable technique for real-world video anomaly analysis. To support the community, our benchmark and model will be publicly available at https://github.com/pipixin321/HolmesVAD.

* 19 pages, 9 figures

Via

Access Paper or Ask Questions

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

Jun 03, 2024

Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, Nong Sang

Figure 1 for UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

Figure 2 for UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

Figure 3 for UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

Figure 4 for UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

Abstract:Recent diffusion-based human image animation techniques have demonstrated impressive success in synthesizing videos that faithfully follow a given reference identity and a sequence of desired movement poses. Despite this, there are still two limitations: i) an extra reference model is required to align the identity image with the main video branch, which significantly increases the optimization burden and model parameters; ii) the generated video is usually short in time (e.g., 24 frames), hampering practical applications. To address these shortcomings, we present a UniAnimate framework to enable efficient and long-term human video generation. First, to reduce the optimization difficulty and ensure temporal coherence, we map the reference image along with the posture guidance and noise video into a common feature space by incorporating a unified video diffusion model. Second, we propose a unified noise input that supports random noised input as well as first frame conditioned input, which enhances the ability to generate long-term video. Finally, to further efficiently handle long sequences, we explore an alternative temporal modeling architecture based on state space model to replace the original computation-consuming temporal Transformer. Extensive experimental results indicate that UniAnimate achieves superior synthesis results over existing state-of-the-art counterparts in both quantitative and qualitative evaluations. Notably, UniAnimate can even generate highly consistent one-minute videos by iteratively employing the first frame conditioning strategy. Code and models will be publicly available. Project page: https://unianimate.github.io/.

* Project page: https://unianimate.github.io/

Via

Access Paper or Ask Questions

Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Apr 26, 2024

Zhengze Xu, Mengting Chen, Zhao Wang, Linyu Xing, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, Changxin Gao

Figure 1 for Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Figure 2 for Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Figure 3 for Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Figure 4 for Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Abstract:Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on." The core idea is excavating a "focus tunnel" in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we first leverage the Kalman filter to construct smooth crops in the focus tunnel and inject the position embedding of the tunnel into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels as supplementary cues. Equipped with these techniques, Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos.

* Project Page: https://mengtingchen.github.io/tunnel-try-on-page/

Via

Access Paper or Ask Questions

REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for Noisy Correspondence

Mar 13, 2024

Ruochen Zheng, Jiahao Hong, Changxin Gao, Nong Sang

Figure 1 for REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for Noisy Correspondence

Figure 2 for REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for Noisy Correspondence

Figure 3 for REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for Noisy Correspondence

Figure 4 for REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for Noisy Correspondence

Abstract:The presence of noise in acquired data invariably leads to performance degradation in cross-modal matching. Unfortunately, obtaining precise annotations in the multimodal field is expensive, which has prompted some methods to tackle the mismatched data pair issue in cross-modal matching contexts, termed as noisy correspondence. However, most of these existing noisy correspondence methods exhibit the following limitations: a) the problem of self-reinforcing error accumulation, and b) improper handling of noisy data pair. To tackle the two problems, we propose a generalized framework termed as Rank corrElation and noisy Pair hAlf-replacing wIth memoRy (REPAIR), which benefits from maintaining a memory bank for features of matched pairs. Specifically, we calculate the distances between the features in the memory bank and those of the target pair for each respective modality, and use the rank correlation of these two sets of distances to estimate the soft correspondence label of the target pair. Estimating soft correspondence based on memory bank features rather than using a similarity network can avoid the accumulation of errors due to incorrect network identifications. For pairs that are completely mismatched, REPAIR searches the memory bank for the most matching feature to replace one feature of one modality, instead of using the original pair directly or merely discarding the mismatched pair. We conduct experiments on three cross-modal datasets, i.e., Flickr30K, MSCOCO, and CC152K, proving the effectiveness and robustness of our REPAIR on synthetic and real-world noise.

Via

Access Paper or Ask Questions

GlanceVAD: Exploring Glance Supervision for Label-efficient Video Anomaly Detection

Mar 12, 2024

Huaxin Zhang, Xiang Wang, Xiaohao Xu, Xiaonan Huang, Chuchu Han, Yuehuan Wang, Changxin Gao, Shanjun Zhang, Nong Sang

Figure 1 for GlanceVAD: Exploring Glance Supervision for Label-efficient Video Anomaly Detection

Figure 2 for GlanceVAD: Exploring Glance Supervision for Label-efficient Video Anomaly Detection

Figure 3 for GlanceVAD: Exploring Glance Supervision for Label-efficient Video Anomaly Detection

Figure 4 for GlanceVAD: Exploring Glance Supervision for Label-efficient Video Anomaly Detection

Abstract:In recent years, video anomaly detection has been extensively investigated in both unsupervised and weakly supervised settings to alleviate costly temporal labeling. Despite significant progress, these methods still suffer from unsatisfactory results such as numerous false alarms, primarily due to the absence of precise temporal anomaly annotation. In this paper, we present a novel labeling paradigm, termed "glance annotation", to achieve a better balance between anomaly detection accuracy and annotation cost. Specifically, glance annotation is a random frame within each abnormal event, which can be easily accessed and is cost-effective. To assess its effectiveness, we manually annotate the glance annotations for two standard video anomaly detection datasets: UCF-Crime and XD-Violence. Additionally, we propose a customized GlanceVAD method, that leverages gaussian kernels as the basic unit to compose the temporal anomaly distribution, enabling the learning of diverse and robust anomaly representations from the glance annotations. Through comprehensive analysis and experiments, we verify that the proposed labeling paradigm can achieve an excellent trade-off between annotation cost and model performance. Extensive experimental results also demonstrate the effectiveness of our GlanceVAD approach, which significantly outperforms existing advanced unsupervised and weakly supervised methods. Code and annotations will be publicly available at https://github.com/pipixin321/GlanceVAD.

* 21 pages

Via

Access Paper or Ask Questions

Spatial Cascaded Clustering and Weighted Memory for Unsupervised Person Re-identification

Mar 01, 2024

Jiahao Hong, Jialong Zuo, Chuchu Han, Ruochen Zheng, Ming Tian, Changxin Gao, Nong Sang

Figure 1 for Spatial Cascaded Clustering and Weighted Memory for Unsupervised Person Re-identification

Figure 2 for Spatial Cascaded Clustering and Weighted Memory for Unsupervised Person Re-identification

Figure 3 for Spatial Cascaded Clustering and Weighted Memory for Unsupervised Person Re-identification

Figure 4 for Spatial Cascaded Clustering and Weighted Memory for Unsupervised Person Re-identification

Abstract:Recent unsupervised person re-identification (re-ID) methods achieve high performance by leveraging fine-grained local context. These methods are referred to as part-based methods. However, most part-based methods obtain local contexts through horizontal division, which suffer from misalignment due to various human poses. Additionally, the misalignment of semantic information in part features restricts the use of metric learning, thus affecting the effectiveness of part-based methods. The two issues mentioned above result in the under-utilization of part features in part-based methods. We introduce the Spatial Cascaded Clustering and Weighted Memory (SCWM) method to address these challenges. SCWM aims to parse and align more accurate local contexts for different human body parts while allowing the memory module to balance hard example mining and noise suppression. Specifically, we first analyze the foreground omissions and spatial confusions issues in the previous method. Then, we propose foreground and space corrections to enhance the completeness and reasonableness of the human parsing results. Next, we introduce a weighted memory and utilize two weighting strategies. These strategies address hard sample mining for global features and enhance noise resistance for part features, which enables better utilization of both global and part features. Extensive experiments on Market-1501 and MSMT17 validate the proposed method's effectiveness over many state-of-the-art methods.

Via

Access Paper or Ask Questions

SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation

Jan 15, 2024

Zhengze Xu, Dongyue Wu, Changqian Yu, Xiangxiang Chu, Nong Sang, Changxin Gao

Abstract:Recent real-time semantic segmentation methods usually adopt an additional semantic branch to pursue rich long-range context. However, the additional branch incurs undesirable computational overhead and slows inference speed. To eliminate this dilemma, we propose SCTNet, a single branch CNN with transformer semantic information for real-time segmentation. SCTNet enjoys the rich semantic representations of an inference-free semantic branch while retaining the high efficiency of lightweight single branch CNN. SCTNet utilizes a transformer as the training-only semantic branch considering its superb ability to extract long-range context. With the help of the proposed transformer-like CNN block CFBlock and the semantic information alignment module, SCTNet could capture the rich semantic information from the transformer branch in training. During the inference, only the single branch CNN needs to be deployed. We conduct extensive experiments on Cityscapes, ADE20K, and COCO-Stuff-10K, and the results show that our method achieves the new state-of-the-art performance. The code and model is available at https://github.com/xzz777/SCTNet

* Accepted by AAAI 2024; typos corrected; code and models have been released at https://github.com/xzz777/SCTNet

Via

Access Paper or Ask Questions

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

Dec 25, 2023

Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang

Figure 1 for A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

Figure 2 for A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

Figure 3 for A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

Figure 4 for A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

Abstract:Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g., 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost of video captioning. Instead, it could be far easier to collect unlabeled clips from video platforms like YouTube. Motivated by this, we come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos. The rationale behind is to separate the process of text decoding from that of temporal modeling. To this end, we employ a content branch and a motion branch, which are jointly optimized with weights shared. Following such a pipeline, we study the effect of doubling the scale of training set (i.e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of our approach. We also find that our model could enjoy sustainable performance gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some text labels for training. Finally, we validate the effectiveness and generalizability of our ideology on both native text-to-video generation and compositional video synthesis paradigms. Code and models will be publicly available at https://tf-t2v.github.io/.

* Project page: https://tf-t2v.github.io/

Via

Access Paper or Ask Questions

VideoLCM: Video Latent Consistency Model

Dec 14, 2023

Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, Nong Sang

Figure 1 for VideoLCM: Video Latent Consistency Model

Figure 2 for VideoLCM: Video Latent Consistency Model

Figure 3 for VideoLCM: Video Latent Consistency Model

Figure 4 for VideoLCM: Video Latent Consistency Model

Abstract:Consistency models have demonstrated powerful capability in efficient image generation and allowed synthesis within a few sampling steps, alleviating the high computational cost in diffusion models. However, the consistency model in the more challenging and resource-consuming video generation is still less explored. In this report, we present the VideoLCM framework to fill this gap, which leverages the concept of consistency models from image generation to efficiently synthesize videos with minimal steps while maintaining high quality. VideoLCM builds upon existing latent video diffusion models and incorporates consistency distillation techniques for training the latent consistency model. Experimental results reveal the effectiveness of our VideoLCM in terms of computational efficiency, fidelity and temporal consistency. Notably, VideoLCM achieves high-fidelity and smooth video synthesis with only four sampling steps, showcasing the potential for real-time synthesis. We hope that VideoLCM can serve as a simple yet effective baseline for subsequent research. The source code and models will be publicly available.

Via

Access Paper or Ask Questions