Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rongrong Ji

Xiamen University, Peng Cheng Laboratory

Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models

Sep 06, 2023

Qiong Wu, Wei Yu, Yiyi Zhou, Shubin Huang, Xiaoshuai Sun, Rongrong Ji

Abstract:With ever increasing parameters and computation, vision-language pre-trained (VLP) models exhibit prohibitive expenditure in downstream task adaption. Recent endeavors mainly focus on parameter efficient transfer learning (PETL) for VLP models by only updating a small number of parameters. However, excessive computational overhead still plagues the application of VLPs. In this paper, we aim at parameter and computation efficient transfer learning (PCETL) for VLP models. In particular, PCETL not only needs to limit the number of trainable parameters in VLP models, but also to reduce the computational redundancy during inference, thus enabling a more efficient transfer. To approach this target, we propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL. Instead of directly optimizing the intrinsic architectures of VLP models, DAS first observes the significances of their modules to downstream tasks via a reinforcement learning (RL) based process, and then skips the redundant ones with lightweight networks, i.e., adapters, according to the obtained rewards. In this case, the VLP model can well maintain the scale of trainable parameters while speeding up its inference on downstream tasks. To validate DAS, we apply it to two representative VLP models, namely ViLT and METER, and conduct extensive experiments on a bunch of VL tasks. The experimental results not only show the great advantages of DAS in reducing computational complexity, e.g. -11.97% FLOPs of METER on VQA2.0, but also confirm its competitiveness against existing PETL methods in terms of parameter scale and performance. Our source code is given in our appendix.

Via

Access Paper or Ask Questions

Towards Unified Token Learning for Vision-Language Tracking

Aug 27, 2023

Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, Xianxian Li

Figure 1 for Towards Unified Token Learning for Vision-Language Tracking

Figure 2 for Towards Unified Token Learning for Vision-Language Tracking

Figure 3 for Towards Unified Token Learning for Vision-Language Tracking

Figure 4 for Towards Unified Token Learning for Vision-Language Tracking

Abstract:In this paper, we present a simple, flexible and effective vision-language (VL) tracking pipeline, termed \textbf{MMTrack}, which casts VL tracking as a token generation task. Traditional paradigms address VL tracking task indirectly with sophisticated prior designs, making them over-specialize on the features of specific architectures or mechanisms. In contrast, our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target in an auto-regressive manner. The design without other prior modules avoids multiple sub-tasks learning and hand-designed loss functions, significantly reducing the complexity of VL tracking modeling and allowing our tracker to use a simple cross-entropy loss as unified optimization objective for VL tracking task. Extensive experiments on TNL2K, LaSOT, LaSOT$_{\rm{ext}}$ and OTB99-Lang benchmarks show that our approach achieves promising results, compared to other state-of-the-arts.

Via

Access Paper or Ask Questions

DLIP: Distilling Language-Image Pre-training

Aug 24, 2023

Huafeng Kuang, Jie Wu, Xiawu Zheng, Ming Li, Xuefeng Xiao, Rui Wang, Min Zheng, Rongrong Ji

Figure 1 for DLIP: Distilling Language-Image Pre-training

Figure 2 for DLIP: Distilling Language-Image Pre-training

Figure 3 for DLIP: Distilling Language-Image Pre-training

Figure 4 for DLIP: Distilling Language-Image Pre-training

Abstract:Vision-Language Pre-training (VLP) shows remarkable progress with the assistance of extremely heavy parameters, which challenges deployment in real applications. Knowledge distillation is well recognized as the essential procedure in model compression. However, existing knowledge distillation techniques lack an in-depth investigation and analysis of VLP, and practical guidelines for VLP-oriented distillation are still not yet explored. In this paper, we present DLIP, a simple yet efficient Distilling Language-Image Pre-training framework, through which we investigate how to distill a light VLP model. Specifically, we dissect the model distillation from multiple dimensions, such as the architecture characteristics of different modules and the information transfer of different modalities. We conduct comprehensive experiments and provide insights on distilling a light but performant VLP model. Experimental results reveal that DLIP can achieve a state-of-the-art accuracy/efficiency trade-off across diverse cross-modal tasks, e.g., image-text retrieval, image captioning and visual question answering. For example, DLIP compresses BLIP by 1.9x, from 213M to 108M parameters, while achieving comparable or better performance. Furthermore, DLIP succeeds in retaining more than 95% of the performance with 22.4% parameters and 24.8% FLOPs compared to the teacher model and accelerates inference speed by 2.7x.

Via

Access Paper or Ask Questions

M3PS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization in E-commerce

Aug 22, 2023

Tao Chen, Ze Lin, Hui Li, Jiayi Ji, Yiyi Zhou, Guanbin Li, Rongrong Ji

Abstract:Given the long textual product information and the product image, Multi-Modal Product Summarization (MMPS) aims to attract customers' interest and increase their desire to purchase by highlighting product characteristics with a short textual summary. Existing MMPS methods have achieved promising performance. Nevertheless, there still exist several problems: 1) lack end-to-end product summarization, 2) lack multi-grained multi-modal modeling, and 3) lack multi-modal attribute modeling. To address these issues, we propose an end-to-end multi-grained multi-modal attribute-aware product summarization method (M3PS) for generating high-quality product summaries in e-commerce. M3PS jointly models product attributes and generates product summaries. Meanwhile, we design several multi-grained multi-modal tasks to better guide the multi-modal learning of M3PS. Furthermore, we model product attributes based on both text and image modalities so that multi-modal product characteristics can be manifested in the generated summaries. Extensive experiments on a real large-scale Chinese e-commence dataset demonstrate that our model outperforms state-of-the-art product summarization methods w.r.t. several summarization metrics.

Via

Access Paper or Ask Questions

HODN: Disentangling Human-Object Feature for HOI Detection

Aug 20, 2023

Shuman Fang, Zhiwen Lin, Ke Yan, Jie Li, Xianming Lin, Rongrong Ji

Figure 1 for HODN: Disentangling Human-Object Feature for HOI Detection

Figure 2 for HODN: Disentangling Human-Object Feature for HOI Detection

Figure 3 for HODN: Disentangling Human-Object Feature for HOI Detection

Figure 4 for HODN: Disentangling Human-Object Feature for HOI Detection

Abstract:The task of Human-Object Interaction (HOI) detection is to detect humans and their interactions with surrounding objects, where transformer-based methods show dominant advances currently. However, these methods ignore the relationship among humans, objects, and interactions: 1) human features are more contributive than object ones to interaction prediction; 2) interactive information disturbs the detection of objects but helps human detection. In this paper, we propose a Human and Object Disentangling Network (HODN) to model the HOI relationships explicitly, where humans and objects are first detected by two disentangling decoders independently and then processed by an interaction decoder. Considering that human features are more contributive to interaction, we propose a Human-Guide Linking method to make sure the interaction decoder focuses on the human-centric regions with human features as the positional embeddings. To handle the opposite influences of interactions on humans and objects, we propose a Stop-Gradient Mechanism to stop interaction gradients from optimizing the object detection but to allow them to optimize the human detection. Our proposed method achieves competitive performance on both the V-COCO and the HICO-Det datasets. It can be combined with existing methods easily for state-of-the-art results.

* Accepted by TMM 2023

Via

Access Paper or Ask Questions

Continual Face Forgery Detection via Historical Distribution Preserving

Aug 11, 2023

Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji

Figure 1 for Continual Face Forgery Detection via Historical Distribution Preserving

Figure 2 for Continual Face Forgery Detection via Historical Distribution Preserving

Figure 3 for Continual Face Forgery Detection via Historical Distribution Preserving

Figure 4 for Continual Face Forgery Detection via Historical Distribution Preserving

Abstract:Face forgery techniques have advanced rapidly and pose serious security threats. Existing face forgery detection methods try to learn generalizable features, but they still fall short of practical application. Additionally, finetuning these methods on historical training data is resource-intensive in terms of time and storage. In this paper, we focus on a novel and challenging problem: Continual Face Forgery Detection (CFFD), which aims to efficiently learn from new forgery attacks without forgetting previous ones. Specifically, we propose a Historical Distribution Preserving (HDP) framework that reserves and preserves the distributions of historical faces. To achieve this, we use universal adversarial perturbation (UAP) to simulate historical forgery distribution, and knowledge distillation to maintain the distribution variation of real faces across different models. We also construct a new benchmark for CFFD with three evaluation protocols. Our extensive experiments on the benchmarks show that our method outperforms the state-of-the-art competitors.

Via

Access Paper or Ask Questions

Pseudo-label Alignment for Semi-supervised Instance Segmentation

Aug 10, 2023

Jie Hu, Chen Chen, Liujuan Cao, Shengchuan Zhang, Annan Shu, Guannan Jiang, Rongrong Ji

Abstract:Pseudo-labeling is significant for semi-supervised instance segmentation, which generates instance masks and classes from unannotated images for subsequent training. However, in existing pipelines, pseudo-labels that contain valuable information may be directly filtered out due to mismatches in class and mask quality. To address this issue, we propose a novel framework, called pseudo-label aligning instance segmentation (PAIS), in this paper. In PAIS, we devise a dynamic aligning loss (DALoss) that adjusts the weights of semi-supervised loss terms with varying class and mask score pairs. Through extensive experiments conducted on the COCO and Cityscapes datasets, we demonstrate that PAIS is a promising framework for semi-supervised instance segmentation, particularly in cases where labeled data is severely limited. Notably, with just 1\% labeled data, PAIS achieves 21.2 mAP (based on Mask-RCNN) and 19.9 mAP (based on K-Net) on the COCO dataset, outperforming the current state-of-the-art model, \ie, NoisyBoundary with 7.7 mAP, by a margin of over 12 points. Code is available at: \url{https://github.com/hujiecpp/PAIS}.

* ICCV 2023

Via

Access Paper or Ask Questions

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

Aug 06, 2023

Haowei Wang, Jiji Tang, Jiayi Ji, Xiaoshuai Sun, Rongsheng Zhang, Yiwei Ma, Minda Zhao, Lincheng Li, zeng zhao, Tangjie Lv(+1 more)

Figure 1 for Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

Figure 2 for Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

Figure 3 for Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

Figure 4 for Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

Abstract:In recent years, 3D representation learning has turned to 2D vision-language pre-trained models to overcome data scarcity challenges. However, existing methods simply transfer 2D alignment strategies, aligning 3D representations with single-view 2D images and coarse-grained parent category text. These approaches introduce information degradation and insufficient synergy issues, leading to performance loss. Information degradation arises from overlooking the fact that a 3D representation should be equivalent to a series of multi-view images and more fine-grained subcategory text. Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space, rather than independently aligning with each modality. In this paper, we propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image. Specifically, a novel Structured Multimodal Organizer (SMO) is proposed to address the information degradation issue, which introduces contiguous multi-view images and hierarchical text to enrich the representation of vision and language modalities. A Joint Multi-modal Alignment (JMA) is designed to tackle the insufficient synergy problem, which models the joint modality by incorporating language knowledge into the visual modality. Extensive experiments on ModelNet40 and ScanObjectNN demonstrate the effectiveness of our proposed method, JM3D, which achieves state-of-the-art performance in zero-shot 3D classification. JM3D outperforms ULIP by approximately 4.3% on PointMLP and achieves an improvement of up to 6.5% accuracy on PointNet++ in top-1 accuracy for zero-shot 3D classification on ModelNet40. The source code and trained models for all our experiments are publicly available at https://github.com/Mr-Neko/JM3D.

* ACM MM 2023

Via

Access Paper or Ask Questions

Improving Human-Object Interaction Detection via Virtual Image Learning

Aug 04, 2023

Shuman Fang, Shuai Liu, Jie Li, Guannan Jiang, Xianming Lin, Rongrong Ji

Figure 1 for Improving Human-Object Interaction Detection via Virtual Image Learning

Figure 2 for Improving Human-Object Interaction Detection via Virtual Image Learning

Figure 3 for Improving Human-Object Interaction Detection via Virtual Image Learning

Figure 4 for Improving Human-Object Interaction Detection via Virtual Image Learning

Abstract:Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects, which plays a curtail role in high-level semantic understanding tasks. However, most works pursue designing better architectures to learn overall features more efficiently, while ignoring the long-tail nature of interaction-object pair categories. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL). Firstly, a novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images. In this stage, virtual images are generated based on prompts with specific characterizations and selected by multi-filtering processes. Secondly, we use both virtual and real images to train the model with the teacher-student framework. Considering the initial labels of some virtual images are inaccurate and inadequate, we devise an Adaptive Matching-and-Filtering (AMF) module to construct pseudo-labels. Our method is independent of the internal structure of HOI detectors, so it can be combined with off-the-shelf methods by training merely 10 additional epochs. With the assistance of our method, multiple methods obtain significant improvements, and new state-of-the-art results are achieved on two benchmarks.

* Accepted by ACM MM 2023

Via

Access Paper or Ask Questions

Towards General Visual-Linguistic Face Forgery Detection

Jul 31, 2023

Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji

Figure 1 for Towards General Visual-Linguistic Face Forgery Detection

Figure 2 for Towards General Visual-Linguistic Face Forgery Detection

Figure 3 for Towards General Visual-Linguistic Face Forgery Detection

Figure 4 for Towards General Visual-Linguistic Face Forgery Detection

Abstract:Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We argue that such supervisions lack semantic information and interpretability. To address this issues, in this paper, we propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation. Since text annotations are not available in current deepfakes datasets, VLFFD first generates the mixed forgery image with corresponding fine-grained prompts via Prompt Forgery Image Generator (PFIG). Then, the fine-grained mixed data and coarse-grained original data and is jointly trained with the Coarse-and-Fine Co-training framework (C2F), enabling the model to gain more generalization and interpretability. The experiments show the proposed method improves the existing detection models on several challenging benchmarks.

Via

Access Paper or Ask Questions