Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xinchao Wang

Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions

Jul 23, 2024

Yihao Ai, Yifei Qi, Bo Wang, Yu Cheng, Xinchao Wang, Robby T. Tan

Figure 1 for Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions

Figure 2 for Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions

Figure 3 for Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions

Figure 4 for Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions

Abstract:Existing 2D human pose estimation research predominantly concentrates on well-lit scenarios, with limited exploration of poor lighting conditions, which are a prevalent aspect of daily life. Recent studies on low-light pose estimation require the use of paired well-lit and low-light images with ground truths for training, which are impractical due to the inherent challenges associated with annotation on low-light images. To this end, we introduce a novel approach that eliminates the need for low-light ground truths. Our primary novelty lies in leveraging two complementary-teacher networks to generate more reliable pseudo labels, enabling our model achieves competitive performance on extremely low-light images without the need for training with low-light ground truths. Our framework consists of two stages. In the first stage, our model is trained on well-lit data with low-light augmentations. In the second stage, we propose a dual-teacher framework to utilize the unlabeled low-light data, where a center-based main teacher produces the pseudo labels for relatively visible cases, while a keypoints-based complementary teacher focuses on producing the pseudo labels for the missed persons of the main teacher. With the pseudo labels from both teachers, we propose a person-specific low-light augmentation to challenge a student model in training to outperform the teachers. Experimental results on real low-light dataset (ExLPose-OCN) show, our method achieves 6.8% (2.4 AP) improvement over the state-of-the-art (SOTA) method, despite no low-light ground-truth data is used in our approach, in contrast to the SOTA method. Our code will be available at:https://github.com/ayh015-dev/DA-LLPose.

* 18 pages, 3 figure. Accepted by ECCV24

Via

Access Paper or Ask Questions

KAN or MLP: A Fairer Comparison

Jul 23, 2024

Runpeng Yu, Weihao Yu, Xinchao Wang

Figure 1 for KAN or MLP: A Fairer Comparison

Figure 2 for KAN or MLP: A Fairer Comparison

Figure 3 for KAN or MLP: A Fairer Comparison

Figure 4 for KAN or MLP: A Fairer Comparison

Abstract:This paper does not introduce a novel method. Instead, it offers a fairer and more comprehensive comparison of KAN and MLP models across various tasks, including machine learning, computer vision, audio processing, natural language processing, and symbolic formula representation. Specifically, we control the number of parameters and FLOPs to compare the performance of KAN and MLP. Our main observation is that, except for symbolic formula representation tasks, MLP generally outperforms KAN. We also conduct ablation studies on KAN and find that its advantage in symbolic formula representation mainly stems from its B-spline activation function. When B-spline is applied to MLP, performance in symbolic formula representation significantly improves, surpassing or matching that of KAN. However, in other tasks where MLP already excels over KAN, B-spline does not substantially enhance MLP's performance. Furthermore, we find that KAN's forgetting issue is more severe than that of MLP in a standard class-incremental continual learning setting, which differs from the findings reported in the KAN paper. We hope these results provide insights for future research on KAN and other MLP alternatives. Project link: https://github.com/yu-rp/KANbeFair

* Technical Report

Via

Access Paper or Ask Questions

Encapsulating Knowledge in One Prompt

Jul 16, 2024

Qi Li, Runpeng Yu, Xinchao Wang

Figure 1 for Encapsulating Knowledge in One Prompt

Figure 2 for Encapsulating Knowledge in One Prompt

Figure 3 for Encapsulating Knowledge in One Prompt

Figure 4 for Encapsulating Knowledge in One Prompt

Abstract:This paradigm encapsulates knowledge from various models into a solitary prompt without altering the original models or requiring access to the training data, which enables us to achieve efficient and convenient knowledge transfer in more realistic scenarios. From a practicality standpoint, this paradigm not only for the first time proves the effectiveness of Visual Prompt in data inaccessible contexts, but also solves the problems of low model reusability and high storage resource consumption faced by traditional Data-Free Knowledge Transfer, which means that we can realize the parallel knowledge transfer of multiple models without modifying any source model. Extensive experiments across various datasets and models demonstrate the efficacy of the proposed KiOP knowledge transfer paradigm. Without access to real training data and with rigorous storage capacity constraints, it is also capable of yielding considerable outcomes when dealing with cross-model backbone setups and handling parallel knowledge transfer processing requests with multiple (more than 2) models.

Via

Access Paper or Ask Questions

LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis

Jul 15, 2024

Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang

Figure 1 for LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis

Figure 2 for LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis

Figure 3 for LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis

Figure 4 for LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis

Abstract:Latent diffusion models have shown promising results in audio generation, making notable advancements over traditional methods. However, their performance, while impressive with short audio clips, faces challenges when extended to longer audio sequences. These challenges are due to model's self-attention mechanism and training predominantly on 10-second clips, which complicates the extension to longer audio without adaptation. In response to these issues, we introduce a novel approach, LiteFocus that enhances the inference of existing audio latent diffusion models in long audio synthesis. Observed the attention pattern in self-attention, we employ a dual sparse form for attention calculation, designated as same-frequency focus and cross-frequency compensation, which curtails the attention computation under same-frequency constraints, while enhancing audio quality through cross-frequency refillment. LiteFocus demonstrates substantial reduction on inference time with diffusion-based TTA model by 1.99x in synthesizing 80-second audio clips while also obtaining improved audio quality.

* Interspeech 2024; Code: https://github.com/Yuanshi9815/LiteFocus

Via

Access Paper or Ask Questions

Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

Jul 09, 2024

Taolin Zhang, Jiawang Bai, Zhihe Lu, Dongze Lian, Genping Wang, Xinchao Wang, Shu-Tao Xia

Figure 1 for Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

Figure 2 for Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

Figure 3 for Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

Figure 4 for Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

Abstract:Recent works on parameter-efficient transfer learning (PETL) show the potential to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters. However, since they usually insert new structures into the pre-trained model, entire intermediate features of that model are changed and thus need to be stored to be involved in back-propagation, resulting in memory-heavy training. We solve this problem from a novel disentangled perspective, i.e., dividing PETL into two aspects: task-specific learning and pre-trained knowledge utilization. Specifically, we synthesize the task-specific query with a learnable and lightweight module, which is independent of the pre-trained model. The synthesized query equipped with task-specific knowledge serves to extract the useful features for downstream tasks from the intermediate representations of the pre-trained model in a query-only manner. Built upon these features, a customized classification head is proposed to make the prediction for the input sample. lightweight architecture and avoids the use of heavy intermediate features for running gradient descent, it demonstrates limited memory usage in training. Extensive experiments manifest that our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.

* ECCV2024

Via

Access Paper or Ask Questions

Isomorphic Pruning for Vision Models

Jul 05, 2024

Gongfan Fang, Xinyin Ma, Michael Bi Mi, Xinchao Wang

Figure 1 for Isomorphic Pruning for Vision Models

Figure 2 for Isomorphic Pruning for Vision Models

Figure 3 for Isomorphic Pruning for Vision Models

Figure 4 for Isomorphic Pruning for Vision Models

Abstract:Structured pruning reduces the computational overhead of deep neural networks by removing redundant sub-structures. However, assessing the relative importance of different sub-structures remains a significant challenge, particularly in advanced vision models featuring novel mechanisms and architectures like self-attention, depth-wise convolutions, or residual connections. These heterogeneous substructures usually exhibit diverged parameter scales, weight distributions, and computational topology, introducing considerable difficulty to importance comparison. To overcome this, we present Isomorphic Pruning, a simple approach that demonstrates effectiveness across a range of network architectures such as Vision Transformers and CNNs, and delivers competitive performance across different model sizes. Isomorphic Pruning originates from an observation that, when evaluated under a pre-defined importance criterion, heterogeneous sub-structures demonstrate significant divergence in their importance distribution, as opposed to isomorphic structures that present similar importance patterns. This inspires us to perform isolated ranking and comparison on different types of sub-structures for more reliable pruning. Our empirical results on ImageNet-1K demonstrate that Isomorphic Pruning surpasses several pruning baselines dedicatedly designed for Transformers or CNNs. For instance, we improve the accuracy of DeiT-Tiny from 74.52% to 77.50% by pruning an off-the-shelf DeiT-Base model. And for ConvNext-Tiny, we enhanced performance from 82.06% to 82.18%, while reducing the number of parameters and memory usage. Code is available at \url{https://github.com/VainF/Isomorphic-Pruning}.

Via

Access Paper or Ask Questions

Video-Infinity: Distributed Long Video Generation

Jun 24, 2024

Zhenxiong Tan, Xingyi Yang, Songhua Liu, Xinchao Wang

Figure 1 for Video-Infinity: Distributed Long Video Generation

Figure 2 for Video-Infinity: Distributed Long Video Generation

Figure 3 for Video-Infinity: Distributed Long Video Generation

Figure 4 for Video-Infinity: Distributed Long Video Generation

Abstract:Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A straightforward solution would be to split the workload across multiple GPUs, which, however, leads to two issues: (1) ensuring all GPUs communicate effectively to share timing and context information, and (2) modifying existing video diffusion models, which are usually trained on short sequences, to create longer videos without additional training. To tackle these, in this paper we introduce Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs for long-form video generation. Specifically, we propose two coherent mechanisms: Clip parallelism and Dual-scope attention. Clip parallelism optimizes the gathering and sharing of context information across GPUs which minimizes communication overhead, while Dual-scope attention modulates the temporal self-attention to balance local and global contexts efficiently across the devices. Together, the two mechanisms join forces to distribute the workload and enable the fast generation of long videos. Under an 8 x Nvidia 6000 Ada GPU (48G) setup, our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.

Via

Access Paper or Ask Questions

Neural Lineage

Jun 17, 2024

Runpeng Yu, Xinchao Wang

Abstract:Given a well-behaved neural network, is possible to identify its parent, based on which it was tuned? In this paper, we introduce a novel task known as neural lineage detection, aiming at discovering lineage relationships between parent and child models. Specifically, from a set of parent models, neural lineage detection predicts which parent model a child model has been fine-tuned from. We propose two approaches to address this task. (1) For practical convenience, we introduce a learning-free approach, which integrates an approximation of the finetuning process into the neural network representation similarity metrics, leading to a similarity-based lineage detection scheme. (2) For the pursuit of accuracy, we introduce a learning-based lineage detector comprising encoders and a transformer detector. Through experimentation, we have validated that our proposed learning-free and learning-based methods outperform the baseline in various learning settings and are adaptable to a variety of visual models. Moreover, they also exhibit the ability to trace cross-generational lineage, identifying not only parent models but also their ancestors.

Via

Access Paper or Ask Questions

Few-Shot Anomaly Detection via Category-Agnostic Registration Learning

Jun 13, 2024

Chaoqin Huang, Haoyan Guan, Aofan Jiang, Yanfeng Wang, Michael Spratling, Xinchao Wang, Ya Zhang

Abstract:Most existing anomaly detection methods require a dedicated model for each category. Such a paradigm, despite its promising results, is computationally expensive and inefficient, thereby failing to meet the requirements for real-world applications. Inspired by how humans detect anomalies, by comparing a query image to known normal ones, this paper proposes a novel few-shot anomaly detection (FSAD) framework. Using a training set of normal images from various categories, registration, aiming to align normal images of the same categories, is leveraged as the proxy task for self-supervised category-agnostic representation learning. At test time, an image and its corresponding support set, consisting of a few normal images from the same category, are supplied, and anomalies are identified by comparing the registered features of the test image to its corresponding support image features. Such a setup enables the model to generalize to novel test categories. It is, to our best knowledge, the first FSAD method that requires no model fine-tuning for novel categories: enabling a single model to be applied to all categories. Extensive experiments demonstrate the effectiveness of the proposed method. Particularly, it improves the current state-of-the-art for FSAD by 11.3% and 8.3% on the MVTec and MPDD benchmarks, respectively. The source code is available at https://github.com/Haoyan-Guan/CAReg.

Via

Access Paper or Ask Questions

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Jun 11, 2024

Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, Xinchao Wang

Figure 1 for AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Figure 2 for AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Figure 3 for AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Figure 4 for AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Abstract:Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances. The code is available at https://github.com/czg1225/AsyncDiff.

* Work in progress. Project Page: https://czg1225.github.io/asyncdiff_page/

Via

Access Paper or Ask Questions