Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shu-Tao Xia

MambaIRv2: Attentive State Space Restoration

Nov 22, 2024

Hang Guo, Yong Guo, Yaohua Zha, Yulun Zhang, Wenbo Li, Tao Dai, Shu-Tao Xia, Yawei Li

Figure 1 for MambaIRv2: Attentive State Space Restoration

Figure 2 for MambaIRv2: Attentive State Space Restoration

Figure 3 for MambaIRv2: Attentive State Space Restoration

Figure 4 for MambaIRv2: Attentive State Space Restoration

Abstract:The Mamba-based image restoration backbones have recently demonstrated significant potential in balancing global reception and computational efficiency. However, the inherent causal modeling limitation of Mamba, where each token depends solely on its predecessors in the scanned sequence, restricts the full utilization of pixels across the image and thus presents new challenges in image restoration. In this work, we propose MambaIRv2, which equips Mamba with the non-causal modeling ability similar to ViTs to reach the attentive state space restoration model. Specifically, the proposed attentive state-space equation allows to attend beyond the scanned sequence and facilitate image unfolding with just one single scan. Moreover, we further introduce a semantic-guided neighboring mechanism to encourage interaction between distant but similar pixels. Extensive experiments show our MambaIRv2 outperforms SRFormer by \textbf{even 0.35dB} PSNR for lightweight SR even with \textbf{9.3\% less} parameters and suppresses HAT on classic SR by \textbf{up to 0.29dB}. Code is available at \url{https://github.com/csguoh/MambaIR}.

* Technical report

Via

Access Paper or Ask Questions

IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models

Oct 29, 2024

Hang Guo, Yawei Li, Tao Dai, Shu-Tao Xia, Luca Benini

Figure 1 for IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models

Figure 2 for IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models

Figure 3 for IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models

Figure 4 for IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models

Abstract:Fine-tuning large-scale text-to-image diffusion models for various downstream tasks has yielded impressive results. However, the heavy computational burdens of tuning large models prevent personal customization. Recent advances have attempted to employ parameter-efficient fine-tuning (PEFT) techniques to adapt the floating-point (FP) or quantized pre-trained weights. Nonetheless, the adaptation parameters in existing works are still restricted to FP arithmetic, hindering hardware-friendly acceleration. In this work, we propose IntLoRA, to further push the efficiency limits by using integer type (INT) low-rank parameters to adapt the quantized diffusion models. By working in the integer arithmetic, our IntLoRA offers three key advantages: (i) for fine-tuning, the pre-trained weights are quantized, reducing memory usage; (ii) for storage, both pre-trained and low-rank weights are in INT which consumes less disk space; (iii) for inference, IntLoRA weights can be naturally merged into quantized pre-trained weights through efficient integer multiplication or bit-shifting, eliminating additional post-training quantization. Extensive experiments demonstrate that IntLoRA can achieve performance on par with or even superior to the vanilla LoRA, accompanied by significant efficiency improvements. Code is available at \url{https://github.com/csguoh/IntLoRA}.

* Technical Report

Via

Access Paper or Ask Questions

BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping

Oct 24, 2024

Taolin Zhang, Jinpeng Wang, Hang Guo, Tao Dai, Bin Chen, Shu-Tao Xia

Figure 1 for BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping

Figure 2 for BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping

Figure 3 for BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping

Figure 4 for BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping

Abstract:Adaptation of pretrained vision-language models such as CLIP to various downstream tasks have raised great interest in recent researches. Previous works have proposed a variety of test-time adaptation (TTA) methods to achieve strong generalization without any knowledge of the target domain. However, existing training-required TTA approaches like TPT necessitate entropy minimization that involves large computational overhead, while training-free methods like TDA overlook the potential for information mining from the test samples themselves. In this paper, we break down the design of existing popular training-required and training-free TTA methods and bridge the gap between them within our framework. Specifically, we maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples. The historical samples are filtered from the testing data stream and serve to extract useful information from the target distribution, while the boosting samples are drawn from regional bootstrapping and capture the knowledge of the test sample itself. We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets, showcasing its applicability in real-world situations.

* NeurIPS 2024

Via

Access Paper or Ask Questions

BoostAdapter: Improving Test-Time Adaptation via Regional Bootstrapping

Oct 20, 2024

Taolin Zhang, Jinpeng Wang, Hang Guo, Tao Dai, Bin Chen, Shu-Tao Xia

Figure 1 for BoostAdapter: Improving Test-Time Adaptation via Regional Bootstrapping

Figure 2 for BoostAdapter: Improving Test-Time Adaptation via Regional Bootstrapping

Figure 3 for BoostAdapter: Improving Test-Time Adaptation via Regional Bootstrapping

Figure 4 for BoostAdapter: Improving Test-Time Adaptation via Regional Bootstrapping

* NeurIPS 2024

Via

Access Paper or Ask Questions

Denial-of-Service Poisoning Attacks against Large Language Models

Oct 14, 2024

Kuofeng Gao, Tianyu Pang, Chao Du, Yong Yang, Shu-Tao Xia, Min Lin

Figure 1 for Denial-of-Service Poisoning Attacks against Large Language Models

Figure 2 for Denial-of-Service Poisoning Attacks against Large Language Models

Figure 3 for Denial-of-Service Poisoning Attacks against Large Language Models

Figure 4 for Denial-of-Service Poisoning Attacks against Large Language Models

Abstract:Recent studies have shown that LLMs are vulnerable to denial-of-service (DoS) attacks, where adversarial inputs like spelling errors or non-semantic prompts trigger endless outputs without generating an [EOS] token. These attacks can potentially cause high latency and make LLM services inaccessible to other users or tasks. However, when there are speech-to-text interfaces (e.g., voice commands to a robot), executing such DoS attacks becomes challenging, as it is difficult to introduce spelling errors or non-semantic prompts through speech. A simple DoS attack in these scenarios would be to instruct the model to "Keep repeating Hello", but we observe that relying solely on natural instructions limits output length, which is bounded by the maximum length of the LLM's supervised finetuning (SFT) data. To overcome this limitation, we propose poisoning-based DoS (P-DoS) attacks for LLMs, demonstrating that injecting a single poisoned sample designed for DoS purposes can break the output length limit. For example, a poisoned sample can successfully attack GPT-4o and GPT-4o mini (via OpenAI's finetuning API) using less than $1, causing repeated outputs up to the maximum inference length (16K tokens, compared to 0.5K before poisoning). Additionally, we perform comprehensive ablation studies on open-source LLMs and extend our method to LLM agents, where attackers can control both the finetuning dataset and algorithm. Our findings underscore the urgent need for defenses against P-DoS attacks to secure LLMs. Our code is available at https://github.com/sail-sg/P-DoS.

Via

Access Paper or Ask Questions

Block-to-Scene Pre-training for Point Cloud Hybrid-Domain Masked Autoencoders

Oct 13, 2024

Yaohua Zha, Tao Dai, Yanzi Wang, Hang Guo, Taolin Zhang, Zhihao Ouyang, Chunlin Fan, Bin Chen, Ke Chen, Shu-Tao Xia

Figure 1 for Block-to-Scene Pre-training for Point Cloud Hybrid-Domain Masked Autoencoders

Figure 2 for Block-to-Scene Pre-training for Point Cloud Hybrid-Domain Masked Autoencoders

Figure 3 for Block-to-Scene Pre-training for Point Cloud Hybrid-Domain Masked Autoencoders

Figure 4 for Block-to-Scene Pre-training for Point Cloud Hybrid-Domain Masked Autoencoders

Abstract:Point clouds, as a primary representation of 3D data, can be categorized into scene domain point clouds and object domain point clouds based on the modeled content. Masked autoencoders (MAE) have become the mainstream paradigm in point clouds self-supervised learning. However, existing MAE-based methods are domain-specific, limiting the model's generalization. In this paper, we propose to pre-train a general Point cloud Hybrid-Domain Masked AutoEncoder (PointHDMAE) via a block-to-scene pre-training strategy. We first propose a hybrid-domain masked autoencoder consisting of an encoder and decoder belonging to the scene domain and object domain, respectively. The object domain encoder specializes in handling object point clouds and multiple shared object encoders assist the scene domain encoder in analyzing the scene point clouds. Furthermore, we propose a block-to-scene strategy to pre-train our hybrid-domain model. Specifically, we first randomly select point blocks within a scene and apply a set of transformations to convert each point block coordinates from the scene space to the object space. Then, we employ an object-level mask and reconstruction pipeline to recover the masked points of each block, enabling the object encoder to learn a universal object representation. Finally, we introduce a scene-level block position regression pipeline, which utilizes the blocks' features in the object space to regress these blocks' initial positions within the scene space, facilitating the learning of scene representations. Extensive experiments across different datasets and tasks demonstrate the generalization and superiority of our hybrid-domain model.

Via

Access Paper or Ask Questions

Towards Scalable Semantic Representation for Recommendation

Oct 12, 2024

Taolin Zhang, Junwei Pan, Jinpeng Wang, Yaohua Zha, Tao Dai, Bin Chen, Ruisheng Luo, Xiaoxiang Deng, Yuan Wang, Ming Yue(+2 more)

Figure 1 for Towards Scalable Semantic Representation for Recommendation

Figure 2 for Towards Scalable Semantic Representation for Recommendation

Figure 3 for Towards Scalable Semantic Representation for Recommendation

Figure 4 for Towards Scalable Semantic Representation for Recommendation

Abstract:With recent advances in large language models (LLMs), there has been emerging numbers of research in developing Semantic IDs based on LLMs to enhance the performance of recommendation systems. However, the dimension of these embeddings needs to match that of the ID embedding in recommendation, which is usually much smaller than the original length. Such dimension compression results in inevitable losses in discriminability and dimension robustness of the LLM embeddings, which motivates us to scale up the semantic representation. In this paper, we propose Mixture-of-Codes, which first constructs multiple independent codebooks for LLM representation in the indexing stage, and then utilizes the Semantic Representation along with a fusion module for the downstream recommendation stage. Extensive analysis and experiments demonstrate that our method achieves superior discriminability and dimension robustness scalability, leading to the best scale-up performance in recommendations.

Via

Access Paper or Ask Questions

CALoR: Towards Comprehensive Model Inversion Defense

Oct 08, 2024

Hongyao Yu, Yixiang Qiu, Hao Fang, Bin Chen, Sijin Yu, Bin Wang, Shu-Tao Xia, Ke Xu

Figure 1 for CALoR: Towards Comprehensive Model Inversion Defense

Figure 2 for CALoR: Towards Comprehensive Model Inversion Defense

Figure 3 for CALoR: Towards Comprehensive Model Inversion Defense

Figure 4 for CALoR: Towards Comprehensive Model Inversion Defense

Abstract:Model Inversion Attacks (MIAs) aim at recovering privacy-sensitive training data from the knowledge encoded in the released machine learning models. Recent advances in the MIA field have significantly enhanced the attack performance under multiple scenarios, posing serious privacy risks of Deep Neural Networks (DNNs). However, the development of defense strategies against MIAs is relatively backward to resist the latest MIAs and existing defenses fail to achieve further trade-off between model utility and model robustness. In this paper, we provide an in-depth analysis from the perspective of intrinsic vulnerabilities of MIAs, comprehensively uncovering the weaknesses inherent in the basic pipeline, which are partially investigated in the previous defenses. Building upon these new insights, we propose a robust defense mechanism, integrating Confidence Adaptation and Low-Rank compression(CALoR). Our method includes a novel robustness-enhanced classification loss specially-designed for model inversion defenses and reveals the extraordinary effectiveness of compressing the classification header. With CALoR, we can mislead the optimization objective, reduce the leaked information and impede the backpropagation of MIAs, thus mitigating the risk of privacy leakage. Extensive experimental results demonstrate that our method achieves state-of-the-art (SOTA) defense performance against MIAs and exhibits superior generalization to existing defenses across various scenarios.

* 26 pages

Via

Access Paper or Ask Questions

MIBench: A Comprehensive Benchmark for Model Inversion Attack and Defense

Oct 07, 2024

Yixiang Qiu, Hongyao Yu, Hao Fang, Wenbo Yu, Bin Chen, Xuan Wang, Shu-Tao Xia, Ke Xu

Abstract:Model Inversion (MI) attacks aim at leveraging the output information of target models to reconstruct privacy-sensitive training data, raising widespread concerns on privacy threats of Deep Neural Networks (DNNs). Unfortunately, in tandem with the rapid evolution of MI attacks, the lack of a comprehensive, aligned, and reliable benchmark has emerged as a formidable challenge. This deficiency leads to inadequate comparisons between different attack methods and inconsistent experimental setups. In this paper, we introduce the first practical benchmark for model inversion attacks and defenses to address this critical gap, which is named \textit{MIBench}. This benchmark serves as an extensible and reproducible modular-based toolbox and currently integrates a total of 16 state-of-the-art attack and defense methods. Moreover, we furnish a suite of assessment tools encompassing 9 commonly used evaluation protocols to facilitate standardized and fair evaluation and analysis. Capitalizing on this foundation, we conduct extensive experiments from multiple perspectives to holistically compare and analyze the performance of various methods across different scenarios, which overcomes the misalignment issues and discrepancy prevalent in previous works. Based on the collected attack methods and defense strategies, we analyze the impact of target resolution, defense robustness, model predictive power, model architectures, transferability and loss function. Our hope is that this \textit{MIBench} could provide a unified, practical and extensible toolbox and is widely utilized by researchers in the field to rigorously test and compare their novel methods, ensuring equitable evaluations and thereby propelling further advancements in the future development.

* 23 pages

Via

Access Paper or Ask Questions

3D-GP-LMVIC: Learning-based Multi-View Image Coding with 3D Gaussian Geometric Priors

Sep 06, 2024

Yujun Huang, Bin Chen, Niu Lian, Baoyi An, Shu-Tao Xia

Figure 1 for 3D-GP-LMVIC: Learning-based Multi-View Image Coding with 3D Gaussian Geometric Priors

Figure 2 for 3D-GP-LMVIC: Learning-based Multi-View Image Coding with 3D Gaussian Geometric Priors

Figure 3 for 3D-GP-LMVIC: Learning-based Multi-View Image Coding with 3D Gaussian Geometric Priors

Figure 4 for 3D-GP-LMVIC: Learning-based Multi-View Image Coding with 3D Gaussian Geometric Priors

Abstract:Multi-view image compression is vital for 3D-related applications. To effectively model correlations between views, existing methods typically predict disparity between two views on a 2D plane, which works well for small disparities, such as in stereo images, but struggles with larger disparities caused by significant view changes. To address this, we propose a novel approach: learning-based multi-view image coding with 3D Gaussian geometric priors (3D-GP-LMVIC). Our method leverages 3D Gaussian Splatting to derive geometric priors of the 3D scene, enabling more accurate disparity estimation across views within the compression model. Additionally, we introduce a depth map compression model to reduce redundancy in geometric information between views. A multi-view sequence ordering method is also proposed to enhance correlations between adjacent views. Experimental results demonstrate that 3D-GP-LMVIC surpasses both traditional and learning-based methods in performance, while maintaining fast encoding and decoding speed.

* 19pages, 8 figures, conference

Via

Access Paper or Ask Questions