Alert button
Picture for Rong-Cheng Tu

Rong-Cheng Tu

Alert button

Global and Local Semantic Completion Learning for Vision-Language Pre-training

Jun 12, 2023
Rong-Cheng Tu, Yatai Ji, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, Wei Liu

Figure 1 for Global and Local Semantic Completion Learning for Vision-Language Pre-training
Figure 2 for Global and Local Semantic Completion Learning for Vision-Language Pre-training
Figure 3 for Global and Local Semantic Completion Learning for Vision-Language Pre-training
Figure 4 for Global and Local Semantic Completion Learning for Vision-Language Pre-training

Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-local alignment, i.e., associations between image patches and text tokens. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations to local features of the other modality. Therefore, in this paper, we propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously. Specifically, the GLSCL task complements the missing semantics of masked data and recovers global and local features by cross-modal interactions. Our GLSCL consists of masked global semantic completion (MGSC) and masked local token completion (MLTC). MGSC promotes learning more representative global features which have a great impact on the performance of downstream tasks, and MLTC can further enhance accurate comprehension on multimodal data. Moreover, we present a flexible vision encoder, enabling our model to simultaneously perform image-text and video-text multimodal tasks. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

* arXiv admin note: substantial text overlap with arXiv:2211.13437 
Viaarxiv icon

Unsupervised Hashing with Semantic Concept Mining

Sep 23, 2022
Rong-Cheng Tu, Xian-Ling Mao, Kevin Qinghong Lin, Chengfei Cai, Weize Qin, Hongfa Wang, Wei Wei, Heyan Huang

Figure 1 for Unsupervised Hashing with Semantic Concept Mining
Figure 2 for Unsupervised Hashing with Semantic Concept Mining
Figure 3 for Unsupervised Hashing with Semantic Concept Mining
Figure 4 for Unsupervised Hashing with Semantic Concept Mining

Recently, to improve the unsupervised image retrieval performance, plenty of unsupervised hashing methods have been proposed by designing a semantic similarity matrix, which is based on the similarities between image features extracted by a pre-trained CNN model. However, most of these methods tend to ignore high-level abstract semantic concepts contained in images. Intuitively, concepts play an important role in calculating the similarity among images. In real-world scenarios, each image is associated with some concepts, and the similarity between two images will be larger if they share more identical concepts. Inspired by the above intuition, in this work, we propose a novel Unsupervised Hashing with Semantic Concept Mining, called UHSCM, which leverages a VLP model to construct a high-quality similarity matrix. Specifically, a set of randomly chosen concepts is first collected. Then, by employing a vision-language pretraining (VLP) model with the prompt engineering which has shown strong power in visual representation learning, the set of concepts is denoised according to the training images. Next, the proposed method UHSCM applies the VLP model with prompting again to mine the concept distribution of each image and construct a high-quality semantic similarity matrix based on the mined concept distributions. Finally, with the semantic similarity matrix as guiding information, a novel hashing loss with a modified contrastive loss based regularization item is proposed to optimize the hashing network. Extensive experiments on three benchmark datasets show that the proposed method outperforms the state-of-the-art baselines in the image retrieval task.

Viaarxiv icon

HunYuan_tvr for Text-Video Retrivial

Apr 14, 2022
Shaobo Min, Weijie Kong, Rong-Cheng Tu, Dihong Gong, Chengfei Cai, Wenzhe Zhao, Chenyang Liu, Sixiao Zheng, Hongfa Wang, Zhifeng Li, Wei Liu

Figure 1 for HunYuan_tvr for Text-Video Retrivial
Figure 2 for HunYuan_tvr for Text-Video Retrivial
Figure 3 for HunYuan_tvr for Text-Video Retrivial
Figure 4 for HunYuan_tvr for Text-Video Retrivial

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short clips and phrases or single frame and word. In this paper, we propose a novel method, named HunYuan\_tvr, to explore hierarchical cross-modal interactions by simultaneously exploring video-sentence, clip-phrase, and frame-word relationships. Considering intrinsic semantic relations between frames, HunYuan\_tvr first performs self-attention to explore frame-wise correlations and adaptively clusters correlated frames into clip-level representations. Then, the clip-wise correlation is explored to aggregate clip representations into a compact one to describe the video globally. In this way, we can construct hierarchical video representations for frame-clip-video granularities, and also explore word-wise correlations to form word-phrase-sentence embeddings for the text modality. Finally, hierarchical contrastive learning is designed to explore cross-modal relationships,~\emph{i.e.,} frame-word, clip-phrase, and video-sentence, which enables HunYuan\_tvr to achieve a comprehensive multi-modal understanding. Further boosted by adaptive label denosing and marginal sample enhancement, HunYuan\_tvr obtains new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 57.8%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet respectively.

Viaarxiv icon

Deep Cross-modal Proxy Hashing

Nov 06, 2020
Rong-Cheng Tu, Xian-Ling Mao, Rongxin Tu, Wei Wei, Heyan Huang

Figure 1 for Deep Cross-modal Proxy Hashing
Figure 2 for Deep Cross-modal Proxy Hashing
Figure 3 for Deep Cross-modal Proxy Hashing
Figure 4 for Deep Cross-modal Proxy Hashing

Due to their high retrieval efficiency and low storage cost for cross-modal search task, cross-modal hashing methods have attracted considerable attention. For supervised cross-modal hashing methods, how to make the learned hash codes preserve semantic structure information sufficiently is a key point to further enhance the retrieval performance. As far as we know, almost all supervised cross-modal hashing methods preserve semantic structure information depending on at-least-one similarity definition fully or partly, i.e., it defines two datapoints as similar ones if they share at least one common category otherwise they are dissimilar. Obviously, the at-least-one similarity misses abundant semantic structure information. To tackle this problem, in this paper, we propose a novel Deep Cross-modal Proxy Hashing, called DCPH. Specifically, DCPH first learns a proxy hashing network to generate a discriminative proxy hash code for each category. Then, by utilizing the learned proxy hash code as supervised information, a novel $Margin$-$SoftMax$-$like\ loss$ is proposed without defining the at-least-one similarity between datapoints. By minimizing the novel $Margin$-$SoftMax$-$like\ loss$, the learned hash codes will simultaneously preserve the cross-modal similarity and abundant semantic structure information well. Extensive experiments on two benchmark datasets show that the proposed method outperforms the state-of-the-art baselines in cross-modal retrieval task.

Viaarxiv icon

Object Detection based Deep Unsupervised Hashing

Nov 24, 2018
Rong-Cheng Tu, Xian-Ling Mao, Bo-Si Feng, Bing-Bing Bian, Yu-shu Ying

Figure 1 for Object Detection based Deep Unsupervised Hashing
Figure 2 for Object Detection based Deep Unsupervised Hashing
Figure 3 for Object Detection based Deep Unsupervised Hashing
Figure 4 for Object Detection based Deep Unsupervised Hashing

Recently, similarity-preserving hashing methods have been extensively studied for large-scale image retrieval. Compared with unsupervised hashing, supervised hashing methods for labeled data have usually better performance by utilizing semantic label information. Intuitively, for unlabeled data, it will improve the performance of unsupervised hashing methods if we can first mine some supervised semantic 'label information' from unlabeled data and then incorporate the 'label information' into the training process. Thus, in this paper, we propose a novel Object Detection based Deep Unsupervised Hashing method (ODDUH). Specifically, a pre-trained object detection model is utilized to mining supervised 'label information', which is used to guide the learning process to generate high-quality hash codes.Extensive experiments on two public datasets demonstrate that the proposed method outperforms the state-of-the-art unsupervised hashing methods in the image retrieval task.

Viaarxiv icon