Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mohan Kankanhalli

Prior-Free Continual Learning with Unlabeled Data in the Wild

Oct 16, 2023

Tao Zhuo, Zhiyong Cheng, Hehe Fan, Mohan Kankanhalli

Figure 1 for Prior-Free Continual Learning with Unlabeled Data in the Wild

Figure 2 for Prior-Free Continual Learning with Unlabeled Data in the Wild

Figure 3 for Prior-Free Continual Learning with Unlabeled Data in the Wild

Figure 4 for Prior-Free Continual Learning with Unlabeled Data in the Wild

Abstract:Continual Learning (CL) aims to incrementally update a trained model on new tasks without forgetting the acquired knowledge of old ones. Existing CL methods usually reduce forgetting with task priors, \ie using task identity or a subset of previously seen samples for model training. However, these methods would be infeasible when such priors are unknown in real-world applications. To address this fundamental but seldom-studied problem, we propose a Prior-Free Continual Learning (PFCL) method, which learns new tasks without knowing the task identity or any previous data. First, based on a fixed single-head architecture, we eliminate the need for task identity to select the task-specific output head. Second, we employ a regularization-based strategy for consistent predictions between the new and old models, avoiding revisiting previous samples. However, using this strategy alone often performs poorly in class-incremental scenarios, particularly for a long sequence of tasks. By analyzing the effectiveness and limitations of conventional regularization-based methods, we propose enhancing model consistency with an auxiliary unlabeled dataset additionally. Moreover, since some auxiliary data may degrade the performance, we further develop a reliable sample selection strategy to obtain consistent performance improvement. Extensive experiments on multiple image classification benchmark datasets show that our PFCL method significantly mitigates forgetting in all three learning scenarios. Furthermore, when compared to the most recent rehearsal-based methods that replay a limited number of previous samples, PFCL achieves competitive accuracy. Our code is available at: https://github.com/visiontao/pfcl

Via

Access Paper or Ask Questions

AutoLoRa: A Parameter-Free Automated Robust Fine-Tuning Framework

Oct 03, 2023

Xilie Xu, Jingfeng Zhang, Mohan Kankanhalli

Abstract:Robust Fine-Tuning (RFT) is a low-cost strategy to obtain adversarial robustness in downstream applications, without requiring a lot of computational resources and collecting significant amounts of data. This paper uncovers an issue with the existing RFT, where optimizing both adversarial and natural objectives through the feature extractor (FE) yields significantly divergent gradient directions. This divergence introduces instability in the optimization process, thereby hindering the attainment of adversarial robustness and rendering RFT highly sensitive to hyperparameters. To mitigate this issue, we propose a low-rank (LoRa) branch that disentangles RFT into two distinct components: optimizing natural objectives via the LoRa branch and adversarial objectives via the FE. Besides, we introduce heuristic strategies for automating the scheduling of the learning rate and the scalars of loss terms. Extensive empirical evaluations demonstrate that our proposed automated RFT disentangled via the LoRa branch (AutoLoRa) achieves new state-of-the-art results across a range of downstream tasks. AutoLoRa holds significant practical utility, as it automatically converts a pre-trained FE into an adversarially robust model for downstream tasks without the need for searching hyperparameters.

Via

Access Paper or Ask Questions

ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens

Sep 28, 2023

Yangyang Guo, Haoyu Zhang, Liqiang Nie, Yongkang Wong, Mohan Kankanhalli

Abstract:Learning a versatile language-image model is computationally prohibitive under a limited computing budget. This paper delves into the efficient language-image pre-training, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method, ie ELIP, to remove less influential tokens based on the supervision of language outputs. Our method is designed with several strengths, such as being computation-efficient, memory-efficient, and trainable-parameter-free, and is distinguished from previous vision-only token pruning approaches by its alignment with task objectives. We implement this method in a progressively pruning manner using several sequential blocks. To evaluate its generalization performance, we apply ELIP to three commonly used language-image pre-training models and utilize public image-caption pairs with 4M images for pre-training. Our experiments demonstrate that with the removal of ~30$\%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance with baselines ($\sim$0.32 accuracy drop on average) over various downstream tasks including cross-modal retrieval, VQA, image captioning, etc. In addition, the spared GPU resources by our ELIP allow us to scale up with larger batch sizes, thereby accelerating model pre-training and even sometimes enhancing downstream model performance. Our code will be released at https://github.com/guoyang9/ELIP.

Via

Access Paper or Ask Questions

Distill to Delete: Unlearning in Graph Networks with Knowledge Distillation

Sep 28, 2023

Yash Sinha, Murari Mandal, Mohan Kankanhalli

Figure 1 for Distill to Delete: Unlearning in Graph Networks with Knowledge Distillation

Figure 2 for Distill to Delete: Unlearning in Graph Networks with Knowledge Distillation

Figure 3 for Distill to Delete: Unlearning in Graph Networks with Knowledge Distillation

Figure 4 for Distill to Delete: Unlearning in Graph Networks with Knowledge Distillation

Abstract:Graph unlearning has emerged as a pivotal method to delete information from a pre-trained graph neural network (GNN). One may delete nodes, a class of nodes, edges, or a class of edges. An unlearning method enables the GNN model to comply with data protection regulations (i.e., the right to be forgotten), adapt to evolving data distributions, and reduce the GPU-hours carbon footprint by avoiding repetitive retraining. Existing partitioning and aggregation-based methods have limitations due to their poor handling of local graph dependencies and additional overhead costs. More recently, GNNDelete offered a model-agnostic approach that alleviates some of these issues. Our work takes a novel approach to address these challenges in graph unlearning through knowledge distillation, as it distills to delete in GNN (D2DGN). It is a model-agnostic distillation framework where the complete graph knowledge is divided and marked for retention and deletion. It performs distillation with response-based soft targets and feature-based node embedding while minimizing KL divergence. The unlearned model effectively removes the influence of deleted graph elements while preserving knowledge about the retained graph elements. D2DGN surpasses the performance of existing methods when evaluated on various real-world graph datasets by up to $43.1\%$ (AUC) in edge and node unlearning tasks. Other notable advantages include better efficiency, better performance in removing target elements, preservation of performance for the retained elements, and zero overhead costs. Notably, our D2DGN surpasses the state-of-the-art GNNDelete in AUC by $2.4\%$, improves membership inference ratio by $+1.3$, requires $10.2\times10^6$ fewer FLOPs per forward pass and up to $\mathbf{3.2}\times$ faster.

Via

Access Paper or Ask Questions

Semantic-Guided Feature Distillation for Multimodal Recommendation

Aug 06, 2023

Fan Liu, Huilin Chen, Zhiyong Cheng, Liqiang Nie, Mohan Kankanhalli

Figure 1 for Semantic-Guided Feature Distillation for Multimodal Recommendation

Figure 2 for Semantic-Guided Feature Distillation for Multimodal Recommendation

Figure 3 for Semantic-Guided Feature Distillation for Multimodal Recommendation

Figure 4 for Semantic-Guided Feature Distillation for Multimodal Recommendation

Abstract:Multimodal recommendation exploits the rich multimodal information associated with users or items to enhance the representation learning for better performance. In these methods, end-to-end feature extractors (e.g., shallow/deep neural networks) are often adopted to tailor the generic multimodal features that are extracted from raw data by pre-trained models for recommendation. However, compact extractors, such as shallow neural networks, may find it challenging to extract effective information from complex and high-dimensional generic modality features. Conversely, DNN-based extractors may encounter the data sparsity problem in recommendation. To address this problem, we propose a novel model-agnostic approach called Semantic-guided Feature Distillation (SGFD), which employs a teacher-student framework to extract feature for multimodal recommendation. The teacher model first extracts rich modality features from the generic modality feature by considering both the semantic information of items and the complementary information of multiple modalities. SGFD then utilizes response-based and feature-based distillation loss to effectively transfer the knowledge encoded in the teacher model to the student model. To evaluate the effectiveness of our SGFD, we integrate SGFD into three backbone multimodal recommendation models. Extensive experiments on three public real-world datasets demonstrate that SGFD-enhanced models can achieve substantial improvement over their counterparts.

* In Proceedings of the 31st ACM International Conference on Multimedia (MM '23), 2023
* ACM Multimedia 2023 Accepted

Via

Access Paper or Ask Questions

DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action Segmentation

Jul 31, 2023

Yue Zhang, Hehe Fan, Yi Yang, Mohan Kankanhalli

Abstract:In this technical report, we present our findings from the research conducted on the Human-Object Interaction 4D (HOI4D) dataset for egocentric action segmentation task. As a relatively novel research area, point cloud video methods might not be good at temporal modeling, especially for long point cloud videos (\eg, 150 frames). In contrast, traditional video understanding methods have been well developed. Their effectiveness on temporal modeling has been widely verified on many large scale video datasets. Therefore, we convert point cloud videos into depth videos and employ traditional video modeling methods to improve 4D action segmentation. By ensembling depth and point cloud video methods, the accuracy is significantly improved. The proposed method, named Mixture of Depth and Point cloud video experts (DPMix), achieved the first place in the 4D Action Segmentation Track of the HOI4D Challenge 2023.

Via

Access Paper or Ask Questions

Towards Generalizable Deepfake Detection by Primary Region Regularization

Jul 28, 2023

Harry Cheng, Yangyang Guo, Tianyi Wang, Liqiang Nie, Mohan Kankanhalli

Figure 1 for Towards Generalizable Deepfake Detection by Primary Region Regularization

Figure 2 for Towards Generalizable Deepfake Detection by Primary Region Regularization

Figure 3 for Towards Generalizable Deepfake Detection by Primary Region Regularization

Figure 4 for Towards Generalizable Deepfake Detection by Primary Region Regularization

Abstract:The existing deepfake detection methods have reached a bottleneck in generalizing to unseen forgeries and manipulation approaches. Based on the observation that the deepfake detectors exhibit a preference for overfitting the specific primary regions in input, this paper enhances the generalization capability from a novel regularization perspective. This can be simply achieved by augmenting the images through primary region removal, thereby preventing the detector from over-relying on data bias. Our method consists of two stages, namely the static localization for primary region maps, as well as the dynamic exploitation of primary region masks. The proposed method can be seamlessly integrated into different backbones without affecting their inference efficiency. We conduct extensive experiments over three widely used deepfake datasets - DFDC, DF-1.0, and Celeb-DF with five backbones. Our method demonstrates an average performance improvement of 6% across different backbones and performs competitively with several state-of-the-art baselines.

* 12 pages. v2 corrected one minor citation error. Code and Dataset: https://github.com/xaCheng1996/PRLE

Via

Access Paper or Ask Questions

Sample Less, Learn More: Efficient Action Recognition via Frame Feature Restoration

Jul 27, 2023

Harry Cheng, Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Mohan Kankanhalli

Abstract:Training an effective video action recognition model poses significant computational challenges, particularly under limited resource budgets. Current methods primarily aim to either reduce model size or utilize pre-trained models, limiting their adaptability to various backbone architectures. This paper investigates the issue of over-sampled frames, a prevalent problem in many approaches yet it has received relatively little attention. Despite the use of fewer frames being a potential solution, this approach often results in a substantial decline in performance. To address this issue, we propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames. This feature restoration technique brings a negligible increase in computational requirements compared to resource-intensive image encoders, such as ViT. To evaluate the effectiveness of our method, we conduct extensive experiments on four public datasets, including Kinetics-400, ActivityNet, UCF-101, and HMDB-51. With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy. In addition, our method also surprisingly helps improve the generalization ability of the models under zero-shot settings.

* 13 pages. Code and pretrained weight will be released at https://github.com/xaCheng1996/SLLM

Via

Access Paper or Ask Questions

Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Jul 25, 2023

Yi Cheng, Hehe Fan, Dongyun Lin, Ying Sun, Mohan Kankanhalli, Joo-Hwee Lim

Figure 1 for Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Figure 2 for Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Figure 3 for Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Figure 4 for Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Abstract:The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions. Existing graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features without considering relative relations between objects, which may lead to inferior performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA. First, to make question features aware of keywords, we employ an attention mechanism to assign high weights to keywords during question encoding. The keyword-aware question features are then used to guide video graph construction. Second, because relations are relative, we integrate the relative relation modeling to better capture the spatio-temporal dynamics among object nodes. Moreover, we disentangle the spatio-temporal reasoning into an object-level spatial graph and a frame-level temporal graph, which reduces the impact of spatial and temporal relation reasoning on each other. Extensive experiments on the TGIF-QA, MSVD-QA and MSRVTT-QA datasets demonstrate the superiority of our KRST over multiple state-of-the-art methods.

* under review

Via

Access Paper or Ask Questions

Mining Conditional Part Semantics with Occluded Extrapolation for Human-Object Interaction Detection

Jul 19, 2023

Guangzhi Wang, Yangyang Guo, Mohan Kankanhalli

Figure 1 for Mining Conditional Part Semantics with Occluded Extrapolation for Human-Object Interaction Detection

Figure 2 for Mining Conditional Part Semantics with Occluded Extrapolation for Human-Object Interaction Detection

Figure 3 for Mining Conditional Part Semantics with Occluded Extrapolation for Human-Object Interaction Detection

Figure 4 for Mining Conditional Part Semantics with Occluded Extrapolation for Human-Object Interaction Detection

Abstract:Human-Object Interaction Detection is a crucial aspect of human-centric scene understanding, with important applications in various domains. Despite recent progress in this field, recognizing subtle and detailed interactions remains challenging. Existing methods try to use human-related clues to alleviate the difficulty, but rely heavily on external annotations or knowledge, limiting their practical applicability in real-world scenarios. In this work, we propose a novel Part Semantic Network (PSN) to solve this problem. The core of PSN is a Conditional Part Attention (CPA) mechanism, where human features are taken as keys and values, and the object feature is used as query for the computation in a cross-attention mechanism. In this way, our model learns to automatically focus on the most informative human parts conditioned on the involved object, generating more semantically meaningful features for interaction recognition. Additionally, we propose an Occluded Part Extrapolation (OPE) strategy to facilitate interaction recognition under occluded scenarios, which teaches the model to extrapolate detailed features from partially occluded ones. Our method consistently outperforms prior approaches on the V-COCO and HICO-DET datasets, without external data or extra annotations. Additional ablation studies validate the effectiveness of each component of our proposed method.

* Preprint

Via

Access Paper or Ask Questions