Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xinchao Wang

Few-Shot Anomaly Detection via Category-Agnostic Registration Learning

Jun 13, 2024

Chaoqin Huang, Haoyan Guan, Aofan Jiang, Yanfeng Wang, Michael Spratling, Xinchao Wang, Ya Zhang

Abstract:Most existing anomaly detection methods require a dedicated model for each category. Such a paradigm, despite its promising results, is computationally expensive and inefficient, thereby failing to meet the requirements for real-world applications. Inspired by how humans detect anomalies, by comparing a query image to known normal ones, this paper proposes a novel few-shot anomaly detection (FSAD) framework. Using a training set of normal images from various categories, registration, aiming to align normal images of the same categories, is leveraged as the proxy task for self-supervised category-agnostic representation learning. At test time, an image and its corresponding support set, consisting of a few normal images from the same category, are supplied, and anomalies are identified by comparing the registered features of the test image to its corresponding support image features. Such a setup enables the model to generalize to novel test categories. It is, to our best knowledge, the first FSAD method that requires no model fine-tuning for novel categories: enabling a single model to be applied to all categories. Extensive experiments demonstrate the effectiveness of the proposed method. Particularly, it improves the current state-of-the-art for FSAD by 11.3% and 8.3% on the MVTec and MPDD benchmarks, respectively. The source code is available at https://github.com/Haoyan-Guan/CAReg.

Via

Access Paper or Ask Questions

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Jun 11, 2024

Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, Xinchao Wang

Figure 1 for AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Figure 2 for AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Figure 3 for AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Figure 4 for AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Abstract:Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances. The code is available at https://github.com/czg1225/AsyncDiff.

* Work in progress. Project Page: https://czg1225.github.io/asyncdiff_page/

Via

Access Paper or Ask Questions

MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

Jun 10, 2024

Xuanyu Yi, Zike Wu, Qiuhong Shen, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Shuicheng Yan, Xinchao Wang, Hanwang Zhang

Figure 1 for MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

Figure 2 for MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

Figure 3 for MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

Figure 4 for MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

Abstract:Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors. Current works further leverage 3D Gaussian Splatting as 3D representation for improved visual quality and rendering efficiency. However, we observe that existing Gaussian reconstruction models often suffer from multi-view inconsistency and blurred textures. We attribute this to the compromise of multi-view information propagation in favor of adopting powerful yet computationally intensive architectures (\eg, Transformers). To address this issue, we introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor based on the RNN-like State Space Model (SSM). Our Gaussian reconstructor propagates causal context containing multi-view information for cross-view self-refinement while generating a long sequence of Gaussians for fine-detail modeling with linear complexity. With off-the-shelf multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts. Extensive experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only $0.1\times$ of the model size.

Via

Access Paper or Ask Questions

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

Jun 03, 2024

Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang

Figure 1 for Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

Figure 2 for Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

Figure 3 for Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

Figure 4 for Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

Abstract:Diffusion Transformers have recently demonstrated unprecedented generative capabilities for various tasks. The encouraging results, however, come with the cost of slow inference, since each denoising step requires inference on a transformer model with a large scale of parameters. In this study, we make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through introducing a caching mechanism, can be readily removed even without updating the model parameters. In the case of U-ViT-H/2, for example, we may remove up to 93.68% of the computation in the cache steps (46.84% for all steps), with less than 0.01 drop in FID. To achieve this, we introduce a novel scheme, named Learning-to-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Specifically, by leveraging the identical structure of layers in transformers and the sequential nature of diffusion, we explore redundant computations between timesteps by treating each layer as the fundamental unit for caching. To address the challenge of the exponential search space in deep models for identifying layers to cache and remove, we propose a novel differentiable optimization objective. An input-invariant yet timestep-variant router is then optimized, which can finally produce a static computation graph. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-Solver, alongside prior cache-based methods at the same inference speed.

* Code is available at https://github.com/horseee/learning-to-cache

Via

Access Paper or Ask Questions

StyDeSty: Min-Max Stylization and Destylization for Single Domain Generalization

Jun 01, 2024

Songhua Liu, Xin Jin, Xingyi Yang, Jingwen Ye, Xinchao Wang

Figure 1 for StyDeSty: Min-Max Stylization and Destylization for Single Domain Generalization

Figure 2 for StyDeSty: Min-Max Stylization and Destylization for Single Domain Generalization

Figure 3 for StyDeSty: Min-Max Stylization and Destylization for Single Domain Generalization

Figure 4 for StyDeSty: Min-Max Stylization and Destylization for Single Domain Generalization

Abstract:Single domain generalization (single DG) aims at learning a robust model generalizable to unseen domains from only one training domain, making it a highly ambitious and challenging task. State-of-the-art approaches have mostly relied on data augmentations, such as adversarial perturbation and style enhancement, to synthesize new data and thus increase robustness. Nevertheless, they have largely overlooked the underlying coherence between the augmented domains, which in turn leads to inferior results in real-world scenarios. In this paper, we propose a simple yet effective scheme, termed as \emph{StyDeSty}, to explicitly account for the alignment of the source and pseudo domains in the process of data augmentation, enabling them to interact with each other in a self-consistent manner and further giving rise to a latent domain with strong generalization power. The heart of StyDeSty lies in the interaction between a \emph{stylization} module for generating novel stylized samples using the source domain, and a \emph{destylization} module for transferring stylized and source samples to a latent domain to learn content-invariant features. The stylization and destylization modules work adversarially and reinforce each other. During inference, the destylization module transforms the input sample with an arbitrary style shift to the latent domain, in which the downstream tasks are carried out. Specifically, the location of the destylization layer within the backbone network is determined by a dedicated neural architecture search (NAS) strategy. We evaluate StyDeSty on multiple benchmarks and demonstrate that it yields encouraging results, outperforming the state of the art by up to {13.44%} on classification accuracy. Codes are available here: https://github.com/Huage001/StyDeSty.

* Accepted at ICML 2024; Work in 2022 spring

Via

Access Paper or Ask Questions

GFlow: Recovering 4D World from Monocular Video

May 28, 2024

Shizun Wang, Xingyi Yang, Qiuhong Shen, Zhenxiang Jiang, Xinchao Wang

Figure 1 for GFlow: Recovering 4D World from Monocular Video

Figure 2 for GFlow: Recovering 4D World from Monocular Video

Figure 3 for GFlow: Recovering 4D World from Monocular Video

Figure 4 for GFlow: Recovering 4D World from Monocular Video

Abstract:Reconstructing 4D scenes from video inputs is a crucial yet challenging task. Conventional methods usually rely on the assumptions of multi-view video inputs, known camera parameters, or static scenes, all of which are typically absent under in-the-wild scenarios. In this paper, we relax all these constraints and tackle a highly ambitious but practical task, which we termed as AnyV4D: we assume only one monocular video is available without any camera parameters as input, and we aim to recover the dynamic 4D world alongside the camera poses. To this end, we introduce GFlow, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video (3D) to a 4D explicit representation, entailing a flow of Gaussian splatting through space and time. GFlow first clusters the scene into still and moving parts, then applies a sequential optimization process that optimizes camera poses and the dynamics of 3D Gaussian points based on 2D priors and scene clustering, ensuring fidelity among neighboring points and smooth movement across frames. Since dynamic scenes always introduce new content, we also propose a new pixel-wise densification strategy for Gaussian points to integrate new visual content. Moreover, GFlow transcends the boundaries of mere 4D reconstruction; it also enables tracking of any points across frames without the need for prior training and segments moving objects from the scene in an unsupervised way. Additionally, the camera poses of each frame can be derived from GFlow, allowing for rendering novel views of a video scene through changing camera pose. By employing the explicit representation, we may readily conduct scene-level or object-level editing as desired, underscoring its versatility and power. Visit our project website at: https://littlepure2333.github.io/GFlow

* Project page: https://littlepure2333.github.io/GFlow

Via

Access Paper or Ask Questions

MambaOut: Do We Really Need Mamba for Vision?

May 14, 2024

Weihao Yu, Xinchao Wang

Figure 1 for MambaOut: Do We Really Need Mamba for Vision?

Figure 2 for MambaOut: Do We Really Need Mamba for Vision?

Figure 3 for MambaOut: Do We Really Need Mamba for Vision?

Figure 4 for MambaOut: Do We Really Need Mamba for Vision?

Abstract:Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of Mamba for long-sequence visual tasks. The code is available at https://github.com/yuweihao/MambaOut

* Code: https://github.com/yuweihao/MambaOut

Via

Access Paper or Ask Questions

DragGaussian: Enabling Drag-style Manipulation on 3D Gaussian Representation

May 09, 2024

Sitian Shen, Jing Xu, Yuheng Yuan, Xingyi Yang, Qiuhong Shen, Xinchao Wang

Figure 1 for DragGaussian: Enabling Drag-style Manipulation on 3D Gaussian Representation

Figure 2 for DragGaussian: Enabling Drag-style Manipulation on 3D Gaussian Representation

Figure 3 for DragGaussian: Enabling Drag-style Manipulation on 3D Gaussian Representation

Figure 4 for DragGaussian: Enabling Drag-style Manipulation on 3D Gaussian Representation

Abstract:User-friendly 3D object editing is a challenging task that has attracted significant attention recently. The limitations of direct 3D object editing without 2D prior knowledge have prompted increased attention towards utilizing 2D generative models for 3D editing. While existing methods like Instruct NeRF-to-NeRF offer a solution, they often lack user-friendliness, particularly due to semantic guided editing. In the realm of 3D representation, 3D Gaussian Splatting emerges as a promising approach for its efficiency and natural explicit property, facilitating precise editing tasks. Building upon these insights, we propose DragGaussian, a 3D object drag-editing framework based on 3D Gaussian Splatting, leveraging diffusion models for interactive image editing with open-vocabulary input. This framework enables users to perform drag-based editing on pre-trained 3D Gaussian object models, producing modified 2D images through multi-view consistent editing. Our contributions include the introduction of a new task, the development of DragGaussian for interactive point-based 3D editing, and comprehensive validation of its effectiveness through qualitative and quantitative experiments.

Via

Access Paper or Ask Questions

Distilled Datamodel with Reverse Gradient Matching

Apr 22, 2024

Jingwen Ye, Ruonan Yu, Songhua Liu, Xinchao Wang

Figure 1 for Distilled Datamodel with Reverse Gradient Matching

Figure 2 for Distilled Datamodel with Reverse Gradient Matching

Figure 3 for Distilled Datamodel with Reverse Gradient Matching

Figure 4 for Distilled Datamodel with Reverse Gradient Matching

Abstract:The proliferation of large-scale AI models trained on extensive datasets has revolutionized machine learning. With these models taking on increasingly central roles in various applications, the need to understand their behavior and enhance interpretability has become paramount. To investigate the impact of changes in training data on a pre-trained model, a common approach is leave-one-out retraining. This entails systematically altering the training dataset by removing specific samples to observe resulting changes within the model. However, retraining the model for each altered dataset presents a significant computational challenge, given the need to perform this operation for every dataset variation. In this paper, we introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages. During the offline training phase, we approximate the influence of training data on the target model through a distilled synset, formulated as a reversed gradient matching problem. For online evaluation, we expedite the leave-one-out process using the synset, which is then utilized to compute the attribution matrix based on the evaluation objective. Experimental evaluations, including training data attribution and assessments of data quality, demonstrate that our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.

* Accepted by CVPR2024

Via

Access Paper or Ask Questions

Ungeneralizable Examples

Apr 22, 2024

Jingwen Ye, Xinchao Wang

Abstract:The training of contemporary deep learning models heavily relies on publicly available data, posing a risk of unauthorized access to online data and raising concerns about data privacy. Current approaches to creating unlearnable data involve incorporating small, specially designed noises, but these methods strictly limit data usability, overlooking its potential usage in authorized scenarios. In this paper, we extend the concept of unlearnable data to conditional data learnability and introduce \textbf{U}n\textbf{G}eneralizable \textbf{E}xamples (UGEs). UGEs exhibit learnability for authorized users while maintaining unlearnability for potential hackers. The protector defines the authorized network and optimizes UGEs to match the gradients of the original data and its ungeneralizable version, ensuring learnability. To prevent unauthorized learning, UGEs are trained by maximizing a designated distance loss in a common feature space. Additionally, to further safeguard the authorized side from potential attacks, we introduce additional undistillation optimization. Experimental results on multiple datasets and various networks demonstrate that the proposed UGEs framework preserves data usability while reducing training performance on hacker networks, even under different types of attacks.

* Accepted by CVPR2024

Via

Access Paper or Ask Questions