Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Joey Tianyi Zhou

Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Apr 21, 2024

Yang He, Joey Tianyi Zhou

Figure 1 for Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Figure 2 for Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Figure 3 for Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Figure 4 for Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Abstract:Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local self-attention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the "local" attention weights are compared at a "global" level, which may cause some "locally" important weights to be pruned due to their relatively small magnitude "globally". The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to fine-grained features at various hierarchical levels. To solve the aforementioned issues, we have developed a Data-independent Module-Aware Pruning method (DIMAP) to compress hierarchical ViTs. To ensure that "local" attention weights at different hierarchical levels are compared fairly in terms of their contribution, we treat them as a module and examine their contribution by analyzing their information distortion. Furthermore, we introduce a novel weight metric that is solely based on weights and does not require input images, thereby eliminating the dependence on the patch merging process. Our method validates its usefulness and strengths on Swin Transformers of different sizes on ImageNet-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model. Code is available at: https://github.com/he-y/Data-independent-Module-Aware-Pruning

* Accepted by ICLR 2024

Via

Access Paper or Ask Questions

Shortcuts Arising from Contrast: Effective and Covert Clean-Label Attacks in Prompt-Based Learning

Mar 30, 2024

Xiaopeng Xie, Ming Yan, Xiwen Zhou, Chenlong Zhao, Suli Wang, Yong Zhang, Joey Tianyi Zhou

Abstract:Prompt-based learning paradigm has demonstrated remarkable efficacy in enhancing the adaptability of pretrained language models (PLMs), particularly in few-shot scenarios. However, this learning paradigm has been shown to be vulnerable to backdoor attacks. The current clean-label attack, employing a specific prompt as a trigger, can achieve success without the need for external triggers and ensure correct labeling of poisoned samples, which is more stealthy compared to the poisoned-label attack, but on the other hand, it faces significant issues with false activations and poses greater challenges, necessitating a higher rate of poisoning. Using conventional negative data augmentation methods, we discovered that it is challenging to trade off between effectiveness and stealthiness in a clean-label setting. In addressing this issue, we are inspired by the notion that a backdoor acts as a shortcut and posit that this shortcut stems from the contrast between the trigger and the data utilized for poisoning. In this study, we propose a method named Contrastive Shortcut Injection (CSI), by leveraging activation values, integrates trigger design and data selection strategies to craft stronger shortcut features. With extensive experiments on full-shot and few-shot text classification tasks, we empirically validate CSI's high effectiveness and high stealthiness at low poisoning rates. Notably, we found that the two approaches play leading roles in full-shot and few-shot settings, respectively.

* 10 pages, 6 figures, conference

Via

Access Paper or Ask Questions

Collaborative Knowledge Infusion for Low-resource Stance Detection

Mar 28, 2024

Ming Yan, Joey Tianyi Zhou, Ivor W. Tsang

Abstract:Stance detection is the view towards a specific target by a given context (\textit{e.g.} tweets, commercial reviews). Target-related knowledge is often needed to assist stance detection models in understanding the target well and making detection correctly. However, prevailing works for knowledge-infused stance detection predominantly incorporate target knowledge from a singular source that lacks knowledge verification in limited domain knowledge. The low-resource training data further increases the challenge for the data-driven large models in this task. To address those challenges, we propose a collaborative knowledge infusion approach for low-resource stance detection tasks, employing a combination of aligned knowledge enhancement and efficient parameter learning techniques. Specifically, our stance detection approach leverages target background knowledge collaboratively from different knowledge sources with the help of knowledge alignment. Additionally, we also introduce the parameter-efficient collaborative adaptor with a staged optimization algorithm, which collaboratively addresses the challenges associated with low-resource stance detection tasks from both network structure and learning perspectives. To assess the effectiveness of our method, we conduct extensive experiments on three public stance detection datasets, including low-resource and cross-target settings. The results demonstrate significant performance improvements compared to the existing stance detection approaches.

* 13 pages, 3 figures, Big Data Mining and Analysis

Via

Access Paper or Ask Questions

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Mar 15, 2024

Tingbing Yan, Wenzheng Zeng, Yang Xiao, Xingyu Tong, Bo Tan, Zhiwen Fang, Zhiguo Cao, Joey Tianyi Zhou

Figure 1 for CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Figure 2 for CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Figure 3 for CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Figure 4 for CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Abstract:Most existing one-shot skeleton-based action recognition focuses on raw low-level information (e.g., joint location), and may suffer from local information loss and low generalization ability. To alleviate these, we propose to leverage text description generated from large language models (LLM) that contain high-level human knowledge, to guide feature learning, in a global-local-global way. Particularly, during training, we design $2$ prompts to gain global and local text descriptions of each action from an LLM. We first utilize the global text description to guide the skeleton encoder focus on informative joints (i.e.,global-to-local). Then we build non-local interaction between local text and joint features, to form the final global representation (i.e., local-to-global). To mitigate the asymmetry issue between the training and inference phases, we further design a dual-branch architecture that allows the model to perform novel class inference without any text input, also making the additional inference cost neglectable compared with the base skeleton encoder. Extensive experiments on three different benchmarks show that CrossGLG consistently outperforms the existing SOTA methods with large margins, and the inference cost (model size) is only $2.8$\% than the previous SOTA. CrossGLG can also serve as a plug-and-play module that can substantially enhance the performance of different SOTA skeleton encoders with a neglectable cost during inference. The source code will be released soon.

Via

Access Paper or Ask Questions

Multisize Dataset Condensation

Mar 10, 2024

Yang He, Lingao Xiao, Joey Tianyi Zhou, Ivor Tsang

Figure 1 for Multisize Dataset Condensation

Figure 2 for Multisize Dataset Condensation

Figure 3 for Multisize Dataset Condensation

Figure 4 for Multisize Dataset Condensation

Abstract:While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the "subset degradation problem" in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an "adaptive subset loss" on top of the basic condensation loss to mitigate the "subset degradation problem". Our MDC method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including SVHN, CIFAR-10, CIFAR-100 and ImageNet. For example, we achieved 6.40% average accuracy gains on condensing CIFAR-10 to ten images per class. Code is available at: https://github.com/he-y/Multisize-Dataset-Condensation.

* Accepted by ICLR 2024 Oral

Via

Access Paper or Ask Questions

Selective Learning: Towards Robust Calibration with Dynamic Regularization

Feb 13, 2024

Zongbo Han, Yifeng Yang, Changqing Zhang, Linjun Zhang, Joey Tianyi Zhou, Qinghua Hu, Huaxiu Yao

Figure 1 for Selective Learning: Towards Robust Calibration with Dynamic Regularization

Figure 2 for Selective Learning: Towards Robust Calibration with Dynamic Regularization

Figure 3 for Selective Learning: Towards Robust Calibration with Dynamic Regularization

Figure 4 for Selective Learning: Towards Robust Calibration with Dynamic Regularization

Abstract:Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance. This problem usually arises due to the overfitting problem, which is characterized by learning everything presented in the training set, resulting in overconfident predictions during testing. Existing methods typically address overfitting and mitigate the miscalibration by adding a maximum-entropy regularizer to the objective function. The objective can be understood as seeking a model that fits the ground-truth labels by increasing the confidence while also maximizing the entropy of predicted probabilities by decreasing the confidence. However, previous methods lack clear guidance on confidence adjustment, leading to conflicting objectives (increasing but also decreasing confidence). Therefore, we introduce a method called Dynamic Regularization (DReg), which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off. At a high level, DReg aims to obtain a more reliable model capable of acknowledging what it knows and does not know. Specifically, DReg effectively fits the labels for in-distribution samples (samples that should be learned) while applying regularization dynamically to samples beyond model capabilities (e.g., outliers), thereby obtaining a robust calibrated model especially on the samples beyond model capabilities. Both theoretical and empirical analyses sufficiently demonstrate the superiority of DReg compared with previous methods.

Via

Access Paper or Ask Questions

Two Trades is not Baffled: Condensing Graph via Crafting Rational Gradient Matching

Feb 08, 2024

Tianle Zhang, Yuchen Zhang, Kun Wang, Kai Wang, Beining Yang, Kaipeng Zhang, Wenqi Shao, Ping Liu, Joey Tianyi Zhou, Yang You

Figure 1 for Two Trades is not Baffled: Condensing Graph via Crafting Rational Gradient Matching

Figure 2 for Two Trades is not Baffled: Condensing Graph via Crafting Rational Gradient Matching

Figure 3 for Two Trades is not Baffled: Condensing Graph via Crafting Rational Gradient Matching

Figure 4 for Two Trades is not Baffled: Condensing Graph via Crafting Rational Gradient Matching

Abstract:Training on large-scale graphs has achieved remarkable results in graph representation learning, but its cost and storage have raised growing concerns. As one of the most promising directions, graph condensation methods address these issues by employing gradient matching, aiming to condense the full graph into a more concise yet information-rich synthetic set. Though encouraging, these strategies primarily emphasize matching directions of the gradients, which leads to deviations in the training trajectories. Such deviations are further magnified by the differences between the condensation and evaluation phases, culminating in accumulated errors, which detrimentally affect the performance of the condensed graphs. In light of this, we propose a novel graph condensation method named \textbf{C}raf\textbf{T}ing \textbf{R}ationa\textbf{L} trajectory (\textbf{CTRL}), which offers an optimized starting point closer to the original dataset's feature distribution and a more refined strategy for gradient matching. Theoretically, CTRL can effectively neutralize the impact of accumulated errors on the performance of condensed graphs. We provide extensive experiments on various graph datasets and downstream tasks to support the effectiveness of CTRL. Code is released at https://github.com/NUS-HPC-AI-Lab/CTRL.

* An effective method for graph condensation

Via

Access Paper or Ask Questions

A Concise but Effective Network for Image Guided Depth Completion in Autonomous Driving

Jan 29, 2024

Moyun Liu, Youping Chen, Jingming Xie, Lei Yao, Yang Zhang, Joey Tianyi Zhou

Figure 1 for A Concise but Effective Network for Image Guided Depth Completion in Autonomous Driving

Figure 2 for A Concise but Effective Network for Image Guided Depth Completion in Autonomous Driving

Figure 3 for A Concise but Effective Network for Image Guided Depth Completion in Autonomous Driving

Figure 4 for A Concise but Effective Network for Image Guided Depth Completion in Autonomous Driving

Abstract:Depth completion is a crucial task in autonomous driving, aiming to convert a sparse depth map into a dense depth prediction. Due to its potentially rich semantic information, RGB image is commonly fused to enhance the completion effect. Image-guided depth completion involves three key challenges: 1) how to effectively fuse the two modalities; 2) how to better recover depth information; and 3) how to achieve real-time prediction for practical autonomous driving. To solve the above problems, we propose a concise but effective network, named CENet, to achieve high-performance depth completion with a simple and elegant structure. Firstly, we use a fast guidance module to fuse the two sensor features, utilizing abundant auxiliary features extracted from the color space. Unlike other commonly used complicated guidance modules, our approach is intuitive and low-cost. In addition, we find and analyze the optimization inconsistency problem for observed and unobserved positions, and a decoupled depth prediction head is proposed to alleviate the issue. The proposed decoupled head can better output the depth of valid and invalid positions with very few extra inference time. Based on the simple structure of dual-encoder and single-decoder, our CENet can achieve superior balance between accuracy and efficiency. In the KITTI depth completion benchmark, our CENet attains competitive performance and inference speed compared with the state-of-the-art methods. To validate the generalization of our method, we also evaluate on indoor NYUv2 dataset, and our CENet still achieve impressive results. The code of this work will be available at https://github.com/lmomoy/CENet.

Via

Access Paper or Ask Questions

FedLoGe: Joint Local and Generic Federated Learning under Long-tailed Data

Jan 17, 2024

Zikai Xiao, Zihan Chen, Liyinglan Liu, Yang Feng, Jian Wu, Wanlu Liu, Joey Tianyi Zhou, Howard Hao Yang, Zuozhu Liu

Figure 1 for FedLoGe: Joint Local and Generic Federated Learning under Long-tailed Data

Figure 2 for FedLoGe: Joint Local and Generic Federated Learning under Long-tailed Data

Figure 3 for FedLoGe: Joint Local and Generic Federated Learning under Long-tailed Data

Figure 4 for FedLoGe: Joint Local and Generic Federated Learning under Long-tailed Data

Abstract:Federated Long-Tailed Learning (Fed-LT), a paradigm wherein data collected from decentralized local clients manifests a globally prevalent long-tailed distribution, has garnered considerable attention in recent times. In the context of Fed-LT, existing works have predominantly centered on addressing the data imbalance issue to enhance the efficacy of the generic global model while neglecting the performance at the local level. In contrast, conventional Personalized Federated Learning (pFL) techniques are primarily devised to optimize personalized local models under the presumption of a balanced global data distribution. This paper introduces an approach termed Federated Local and Generic Model Training in Fed-LT (FedLoGe), which enhances both local and generic model performance through the integration of representation learning and classifier alignment within a neural collapse framework. Our investigation reveals the feasibility of employing a shared backbone as a foundational framework for capturing overarching global trends, while concurrently employing individualized classifiers to encapsulate distinct refinements stemming from each client's local features. Building upon this discovery, we establish the Static Sparse Equiangular Tight Frame Classifier (SSE-C), inspired by neural collapse principles that naturally prune extraneous noisy features and foster the acquisition of potent data representations. Furthermore, leveraging insights from imbalance neural collapse's classifier norm patterns, we develop Global and Local Adaptive Feature Realignment (GLA-FR) via an auxiliary global classifier and personalized Euclidean norm transfer to align global features with client preferences. Extensive experimental results on CIFAR-10/100-LT, ImageNet, and iNaturalist demonstrate the advantage of our method over state-of-the-art pFL and Fed-LT approaches.

* Accepted by ICLR 2024

Via

Access Paper or Ask Questions

Direct Distillation between Different Domains

Jan 12, 2024

Jialiang Tang, Shuo Chen, Gang Niu, Hongyuan Zhu, Joey Tianyi Zhou, Chen Gong, Masashi Sugiyama

Figure 1 for Direct Distillation between Different Domains

Figure 2 for Direct Distillation between Different Domains

Figure 3 for Direct Distillation between Different Domains

Figure 4 for Direct Distillation between Different Domains

Abstract:Knowledge Distillation (KD) aims to learn a compact student network using knowledge from a large pre-trained teacher network, where both networks are trained on data from the same distribution. However, in practical applications, the student network may be required to perform in a new scenario (i.e., the target domain), which usually exhibits significant differences from the known scenario of the teacher network (i.e., the source domain). The traditional domain adaptation techniques can be integrated with KD in a two-stage process to bridge the domain gap, but the ultimate reliability of two-stage approaches tends to be limited due to the high computational consumption and the additional errors accumulated from both stages. To solve this problem, we propose a new one-stage method dubbed ``Direct Distillation between Different Domains" (4Ds). We first design a learnable adapter based on the Fourier transform to separate the domain-invariant knowledge from the domain-specific knowledge. Then, we build a fusion-activation mechanism to transfer the valuable domain-invariant knowledge to the student network, while simultaneously encouraging the adapter within the teacher network to learn the domain-specific knowledge of the target data. As a result, the teacher network can effectively transfer categorical knowledge that aligns with the target domain of the student network. Intensive experiments on various benchmark datasets demonstrate that our proposed 4Ds method successfully produces reliable student networks and outperforms state-of-the-art approaches.

Via

Access Paper or Ask Questions