Alert button
Picture for Lianli Gao

Lianli Gao

Alert button

DETA: Denoised Task Adaptation for Few-Shot Learning

Mar 11, 2023
Ji Zhang, Lianli Gao, Xu Luo, Hengtao Shen, Jingkuan Song

Figure 1 for DETA: Denoised Task Adaptation for Few-Shot Learning
Figure 2 for DETA: Denoised Task Adaptation for Few-Shot Learning
Figure 3 for DETA: Denoised Task Adaptation for Few-Shot Learning
Figure 4 for DETA: Denoised Task Adaptation for Few-Shot Learning

Test-time task adaptation in few-shot learning aims to adapt a pre-trained task-agnostic model for capturing taskspecific knowledge of the test task, rely only on few-labeled support samples. Previous approaches generally focus on developing advanced algorithms to achieve the goal, while neglecting the inherent problems of the given support samples. In fact, with only a handful of samples available, the adverse effect of either the image noise (a.k.a. X-noise) or the label noise (a.k.a. Y-noise) from support samples can be severely amplified. To address this challenge, in this work we propose DEnoised Task Adaptation (DETA), a first, unified image- and label-denoising framework orthogonal to existing task adaptation approaches. Without extra supervision, DETA filters out task-irrelevant, noisy representations by taking advantage of both global visual information and local region details of support samples. On the challenging Meta-Dataset, DETA consistently improves the performance of a broad spectrum of baseline methods applied on various pre-trained models. Notably, by tackling the overlooked image noise in Meta-Dataset, DETA establishes new state-of-the-art results. Code is released at https://github.com/nobody-1617/DETA.

* 10 pages, 5 figures 
Viaarxiv icon

A Closer Look at Few-shot Classification Again

Feb 02, 2023
Xu Luo, Hao Wu, Ji Zhang, Lianli Gao, Jing Xu, Jingkuan Song

Figure 1 for A Closer Look at Few-shot Classification Again
Figure 2 for A Closer Look at Few-shot Classification Again
Figure 3 for A Closer Look at Few-shot Classification Again
Figure 4 for A Closer Look at Few-shot Classification Again

Few-shot classification consists of a training phase where a model is learned on a relatively large dataset and an adaptation phase where the learned model is adapted to previously-unseen tasks with limited labeled samples. In this paper, we empirically prove that the training algorithm and the adaptation algorithm can be completely disentangled, which allows algorithm analysis and design to be done individually for each phase. Our meta-analysis for each phase reveals several interesting insights that may help better understand key aspects of few-shot classification and connections with other fields such as visual representation learning and transfer learning. We hope the insights and research challenges revealed in this paper can inspire future work in related directions.

Viaarxiv icon

Visual Commonsense-aware Representation Network for Video Captioning

Nov 17, 2022
Pengpeng Zeng, Haonan Zhang, Lianli Gao, Xiangpeng Li, Jin Qian, Heng Tao Shen

Figure 1 for Visual Commonsense-aware Representation Network for Video Captioning
Figure 2 for Visual Commonsense-aware Representation Network for Video Captioning
Figure 3 for Visual Commonsense-aware Representation Network for Video Captioning
Figure 4 for Visual Commonsense-aware Representation Network for Video Captioning

Generating consecutive descriptions for videos, i.e., Video Captioning, requires taking full advantage of visual representation along with the generation process. Existing video captioning methods focus on making an exploration of spatial-temporal representations and their relationships to produce inferences. However, such methods only exploit the superficial association contained in the video itself without considering the intrinsic visual commonsense knowledge that existed in a video dataset, which may hinder their capabilities of knowledge cognitive to reason accurate descriptions. To address this problem, we propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN), for video captioning. Specifically, we construct a Video Dictionary, a plug-and-play component, obtained by clustering all video features from the total dataset into multiple clustered centers without additional annotation. Each center implicitly represents a visual commonsense concept in the video domain, which is utilized in our proposed Visual Concept Selection (VCS) to obtain a video-related concept feature. Next, a Conceptual Integration Generation (CIG) is proposed to enhance the caption generation. Extensive experiments on three publicly video captioning benchmarks: MSVD, MSR-VTT, and VATEX, demonstrate that our method reaches state-of-the-art performance, indicating the effectiveness of our method. In addition, our approach is integrated into the existing method of video question answering and improves this performance, further showing the generalization of our method. Source code has been released at https://github.com/zchoi/VCRN.

Viaarxiv icon

Progressive Tree-Structured Prototype Network for End-to-End Image Captioning

Nov 17, 2022
Pengpeng Zeng, Jinkuan Zhu, Jingkuan Song, Lianli Gao

Figure 1 for Progressive Tree-Structured Prototype Network for End-to-End Image Captioning
Figure 2 for Progressive Tree-Structured Prototype Network for End-to-End Image Captioning
Figure 3 for Progressive Tree-Structured Prototype Network for End-to-End Image Captioning
Figure 4 for Progressive Tree-Structured Prototype Network for End-to-End Image Captioning

Studies of image captioning are shifting towards a trend of a fully end-to-end paradigm by leveraging powerful visual pre-trained models and transformer-based generation architecture for more flexible model training and faster inference speed. State-of-the-art approaches simply extract isolated concepts or attributes to assist description generation. However, such approaches do not consider the hierarchical semantic structure in the textual domain, which leads to an unpredictable mapping between visual representations and concept words. To this end, we propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN), which is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics. Specifically, we design a novel embedding method called tree-structured prototype, producing a set of hierarchical representative embeddings which capture the hierarchical semantic structure in textual space. To utilize such tree-structured prototypes into visual cognition, we also propose a progressive aggregation module to exploit semantic relationships within the image and prototypes. By applying our PTSN to the end-to-end captioning framework, extensive experiments conducted on MSCOCO dataset show that our method achieves a new state-of-the-art performance with 144.2% (single model) and 146.5% (ensemble of 4 models) CIDEr scores on `Karpathy' split and 141.4% (c5) and 143.9% (c40) CIDEr scores on the official online test server. Trained models and source code have been released at: https://github.com/NovaMind-Z/PTSN.

Viaarxiv icon

A Lower Bound of Hash Codes' Performance

Oct 12, 2022
Xiaosu Zhu, Jingkuan Song, Yu Lei, Lianli Gao, Heng Tao Shen

Figure 1 for A Lower Bound of Hash Codes' Performance
Figure 2 for A Lower Bound of Hash Codes' Performance
Figure 3 for A Lower Bound of Hash Codes' Performance
Figure 4 for A Lower Bound of Hash Codes' Performance

As a crucial approach for compact representation learning, hashing has achieved great success in effectiveness and efficiency. Numerous heuristic Hamming space metric learning objectives are designed to obtain high-quality hash codes. Nevertheless, a theoretical analysis of criteria for learning good hash codes remains largely unexploited. In this paper, we prove that inter-class distinctiveness and intra-class compactness among hash codes determine the lower bound of hash codes' performance. Promoting these two characteristics could lift the bound and improve hash learning. We then propose a surrogate model to fully exploit the above objective by estimating the posterior of hash codes and controlling it, which results in a low-bias optimization. Extensive experiments reveal the effectiveness of the proposed method. By testing on a series of hash-models, we obtain performance improvements among all of them, with an up to $26.5\%$ increase in mean Average Precision and an up to $20.5\%$ increase in accuracy. Our code is publicly available at \url{https://github.com/VL-Group/LBHash}.

* Accepted to NeurIPS 2022 
Viaarxiv icon

Natural Color Fool: Towards Boosting Black-box Unrestricted Attacks

Oct 05, 2022
Shengming Yuan, Qilong Zhang, Lianli Gao, Yaya Cheng, Jingkuan Song

Figure 1 for Natural Color Fool: Towards Boosting Black-box Unrestricted Attacks
Figure 2 for Natural Color Fool: Towards Boosting Black-box Unrestricted Attacks
Figure 3 for Natural Color Fool: Towards Boosting Black-box Unrestricted Attacks
Figure 4 for Natural Color Fool: Towards Boosting Black-box Unrestricted Attacks

Unrestricted color attacks, which manipulate semantically meaningful color of an image, have shown their stealthiness and success in fooling both human eyes and deep neural networks. However, current works usually sacrifice the flexibility of the uncontrolled setting to ensure the naturalness of adversarial examples. As a result, the black-box attack performance of these methods is limited. To boost transferability of adversarial examples without damaging image quality, we propose a novel Natural Color Fool (NCF) which is guided by realistic color distributions sampled from a publicly available dataset and optimized by our neighborhood search and initialization reset. By conducting extensive experiments and visualizations, we convincingly demonstrate the effectiveness of our proposed method. Notably, on average, results show that our NCF can outperform state-of-the-art approaches by 15.0%$\sim$32.9% for fooling normally trained models and 10.0%$\sim$25.3% for evading defense methods. Our code is available at https://github.com/ylhz/Natural-Color-Fool.

* NeurIPS 2022 
Viaarxiv icon

RepParser: End-to-End Multiple Human Parsing with Representative Parts

Aug 27, 2022
Xiaojia Chen, Xuanhan Wang, Lianli Gao, Jingkuan Song

Figure 1 for RepParser: End-to-End Multiple Human Parsing with Representative Parts
Figure 2 for RepParser: End-to-End Multiple Human Parsing with Representative Parts
Figure 3 for RepParser: End-to-End Multiple Human Parsing with Representative Parts
Figure 4 for RepParser: End-to-End Multiple Human Parsing with Representative Parts

Existing methods of multiple human parsing usually adopt a two-stage strategy (typically top-down and bottom-up), which suffers from either strong dependence on prior detection or highly computational redundancy during post-grouping. In this work, we present an end-to-end multiple human parsing framework using representative parts, termed RepParser. Different from mainstream methods, RepParser solves the multiple human parsing in a new single-stage manner without resorting to person detection or post-grouping.To this end, RepParser decouples the parsing pipeline into instance-aware kernel generation and part-aware human parsing, which are responsible for instance separation and instance-specific part segmentation, respectively. In particular, we empower the parsing pipeline by representative parts, since they are characterized by instance-aware keypoints and can be utilized to dynamically parse each person instance. Specifically, representative parts are obtained by jointly localizing centers of instances and estimating keypoints of body part regions. After that, we dynamically predict instance-aware convolution kernels through representative parts, thus encoding person-part context into each kernel responsible for casting an image feature as an instance-specific representation.Furthermore, a multi-branch structure is adopted to divide each instance-specific representation into several part-aware representations for separate part segmentation.In this way, RepParser accordingly focuses on person instances with the guidance of representative parts and directly outputs parsing results for each person instance, thus eliminating the requirement of the prior detection or post-grouping.Extensive experiments on two challenging benchmarks demonstrate that our proposed RepParser is a simple yet effective framework and achieves very competitive performance.

Viaarxiv icon

Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning

Aug 17, 2022
Tao He, Lianli Gao, Jingkuan Song, Yuan-Fang Li

Figure 1 for Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning
Figure 2 for Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning
Figure 3 for Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning
Figure 4 for Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning

Scene graph generation (SGG) is a fundamental task aimed at detecting visual relations between objects in an image. The prevailing SGG methods require all object classes to be given in the training set. Such a closed setting limits the practical application of SGG. In this paper, we introduce open-vocabulary scene graph generation, a novel, realistic and challenging setting in which a model is trained on a set of base object classes but is required to infer relations for unseen target object classes. To this end, we propose a two-step method that firstly pre-trains on large amounts of coarse-grained region-caption data and then leverages two prompt-based techniques to finetune the pre-trained model without updating its parameters. Moreover, our method can support inference over completely unseen object classes, which existing methods are incapable of handling. On extensive experiments on three benchmark datasets, Visual Genome, GQA, and Open-Image, our method significantly outperforms recent, strong SGG methods on the setting of Ov-SGG, as well as on the conventional closed SGG.

Viaarxiv icon

Dual-branch Hybrid Learning Network for Unbiased Scene Graph Generation

Jul 16, 2022
Chaofan Zheng, Lianli Gao, Xinyu Lyu, Pengpeng Zeng, Abdulmotaleb El Saddik, Heng Tao Shen

Figure 1 for Dual-branch Hybrid Learning Network for Unbiased Scene Graph Generation
Figure 2 for Dual-branch Hybrid Learning Network for Unbiased Scene Graph Generation
Figure 3 for Dual-branch Hybrid Learning Network for Unbiased Scene Graph Generation
Figure 4 for Dual-branch Hybrid Learning Network for Unbiased Scene Graph Generation

The current studies of Scene Graph Generation (SGG) focus on solving the long-tailed problem for generating unbiased scene graphs. However, most de-biasing methods overemphasize the tail predicates and underestimate head ones throughout training, thereby wrecking the representation ability of head predicate features. Furthermore, these impaired features from head predicates harm the learning of tail predicates. In fact, the inference of tail predicates heavily depends on the general patterns learned from head ones, e.g., "standing on" depends on "on". Thus, these de-biasing SGG methods can neither achieve excellent performance on tail predicates nor satisfying behaviors on head ones. To address this issue, we propose a Dual-branch Hybrid Learning network (DHL) to take care of both head predicates and tail ones for SGG, including a Coarse-grained Learning Branch (CLB) and a Fine-grained Learning Branch (FLB). Specifically, the CLB is responsible for learning expertise and robust features of head predicates, while the FLB is expected to predict informative tail predicates. Furthermore, DHL is equipped with a Branch Curriculum Schedule (BCS) to make the two branches work well together. Experiments show that our approach achieves a new state-of-the-art performance on VG and GQA datasets and makes a trade-off between the performance of tail predicates and head ones. Moreover, extensive experiments on two downstream tasks (i.e., Image Captioning and Sentence-to-Graph Retrieval) further verify the generalization and practicability of our method.

Viaarxiv icon