Alert button
Picture for Yaowei Li

Yaowei Li

Alert button

CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

Sep 18, 2023
Yating liu, Yaowei Li, Zimo Liu, Wenming Yang, Yaowei Wang, Qingmin Liao

Figure 1 for CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval
Figure 2 for CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval
Figure 3 for CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval
Figure 4 for CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

Text-based Person Retrieval aims to retrieve the target person images given a textual query. The primary challenge lies in bridging the substantial gap between vision and language modalities, especially when dealing with limited large-scale datasets. In this paper, we introduce a CLIP-based Synergistic Knowledge Transfer(CSKT) approach for TBPR. Specifically, to explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections. Secondly, Dual Adapters Transferring (DAT) is designed to transfer knowledge on output side of Multi-Head Self-Attention (MHSA) in vision and language. This synergistic two-way collaborative mechanism promotes the early-stage feature fusion and efficiently exploits the existing knowledge of CLIP. CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model, demonstrating its remarkable efficiency, effectiveness and generalization.

Viaarxiv icon

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

Aug 18, 2023
Hongxiang Li, Meng Cao, Xuxin Cheng, Yaowei Li, Zhihong Zhu, Yuexian Zou

Figure 1 for G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory
Figure 2 for G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory
Figure 3 for G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory
Figure 4 for G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) \emph{alignment} of features of similar samples, and (2) \emph{uniformity} of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, \ie semantic overlapping; (2) only a few moments in the video are annotated, \ie sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method.

* ICCV2023 oral 
Viaarxiv icon

Efficient Multimodal Fusion via Interactive Prompting

Apr 13, 2023
Yaowei Li, Ruijie Quan, Linchao Zhu, Yi Yang

Figure 1 for Efficient Multimodal Fusion via Interactive Prompting
Figure 2 for Efficient Multimodal Fusion via Interactive Prompting
Figure 3 for Efficient Multimodal Fusion via Interactive Prompting
Figure 4 for Efficient Multimodal Fusion via Interactive Prompting

Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era. Following this trend, the size of multi-modal learning models constantly increases, leading to an urgent need to reduce the massive computational cost of finetuning these models for downstream tasks. In this paper, we propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers. Specifically, we first present a modular multimodal fusion framework that exhibits high flexibility and facilitates mutual interactions among different modalities. In addition, we disentangle vanilla prompts into three types in order to learn different optimizing objectives for multimodal learning. It is also worth noting that we propose to add prompt vectors only on the deep layers of the unimodal transformers, thus significantly reducing the training memory usage. Experiment results show that our proposed method achieves comparable performance to several other multimodal finetuning methods with less than 3% trainable parameters and up to 66% saving of training memory usage.

* Camera-ready version for CVPR2023 
Viaarxiv icon

Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

Apr 05, 2023
Yaowei Li, Bang Yang, Xuxin Cheng, Zhihong Zhu, Hongxiang Li, Yuexian Zou

Figure 1 for Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation
Figure 2 for Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation
Figure 3 for Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation
Figure 4 for Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

Automatic radiology report generation has attracted enormous research interest due to its practical value in reducing the workload of radiologists. However, simultaneously establishing global correspondences between the image (e.g., Chest X-ray) and its related report and local alignments between image patches and keywords remains challenging. To this end, we propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments and introduce three novel modules: Latent Space Unifier (LSU), Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR). Specifically, LSU unifies multimodal data into discrete tokens, making it flexible to learn common knowledge among modalities with a shared network. The modality-agnostic CRA learns discriminative features via a set of orthonormal basis and a dual-gate mechanism first and then globally aligns visual and textual representations under a triplet contrastive loss. TIR boosts token-level local alignment via calibrating text-to-image attention with a learnable mask. Additionally, we design a two-stage training procedure to make UAR gradually grasp cross-modal alignments at different levels, which imitates radiologists' workflow: writing sentence by sentence first and then checking word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.

* Try to solve the problem that Google Scholar does not display the all authors 
Viaarxiv icon

SSVMR: Saliency-based Self-training for Video-Music Retrieval

Feb 18, 2023
Xuxin Cheng, Zhihong Zhu, Hongxiang Li, Yaowei Li, Yuexian Zou

Figure 1 for SSVMR: Saliency-based Self-training for Video-Music Retrieval
Figure 2 for SSVMR: Saliency-based Self-training for Video-Music Retrieval
Figure 3 for SSVMR: Saliency-based Self-training for Video-Music Retrieval
Figure 4 for SSVMR: Saliency-based Self-training for Video-Music Retrieval

With the rise of short videos, the demand for selecting appropriate background music (BGM) for a video has increased significantly, video-music retrieval (VMR) task gradually draws much attention by research community. As other cross-modal learning tasks, existing VMR approaches usually attempt to measure the similarity between the video and music in the feature space. However, they (1) neglect the inevitable label noise; (2) neglect to enhance the ability to capture critical video clips. In this paper, we propose a novel saliency-based self-training framework, which is termed SSVMR. Specifically, we first explore to fully make use of the information containing in the training dataset by applying a semi-supervised method to suppress the adverse impact of label noise problem, where a self-training approach is adopted. In addition, we propose to capture the saliency of the video by mixing two videos at span level and preserving the locality of the two original videos. Inspired by back translation in NLP, we also conduct back retrieval to obtain more training data. Experimental results on MVD dataset show that our SSVMR achieves the state-of-the-art performance by a large margin, obtaining a relative improvement of 34.8% over the previous best model in terms of R@1.

* Accepted by ICASSP 2023 
Viaarxiv icon

Generating Templated Caption for Video Grounding

Jan 15, 2023
Hongxiang Li, Meng Cao, Xuxin Cheng, Zhihong Zhu, Yaowei Li, Yuexian Zou

Figure 1 for Generating Templated Caption for Video Grounding
Figure 2 for Generating Templated Caption for Video Grounding
Figure 3 for Generating Templated Caption for Video Grounding
Figure 4 for Generating Templated Caption for Video Grounding

Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video. Previous works ignore the \emph{sparsity dilemma} in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. In this paper, we contend that providing easily available captions which describe general actions \ie, templated captions defined in our paper, will significantly boost the performance. To this end, we propose a Templated Caption Network (TCNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions, and then obtain templated captions by Non-Templated Caption Suppression (NTCS). To utilize templated captions better, we propose Caption Guided Attention (CGA) project the semantic relations between templated captions and query sentences into temporal space and fuse them into visual representations. Considering the gap between templated captions and ground truth, we propose Asymmetric Dual Matching Supervised Contrastive Learning (ADMSCL) for constructing more negative pairs to maximize cross-modal mutual information. Without bells and whistles, extensive experiments on three public datasets (\ie, ActivityNet Captions, TACoS and ActivityNet-CG) demonstrate that our method significantly outperforms state-of-the-art methods.

Viaarxiv icon