Alert button
Picture for Hongtao Xie

Hongtao Xie

Alert button

Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval

Oct 12, 2023
Pandeng Li, Hongtao Xie, Jiannan Ge, Lei Zhang, Shaobo Min, Yongdong Zhang

Unsupervised video hashing usually optimizes binary codes by learning to reconstruct input videos. Such reconstruction constraint spends much effort on frame-level temporal context changes without focusing on video-level global semantics that are more useful for retrieval. Hence, we address this problem by decomposing video information into reconstruction-dependent and semantic-dependent information, which disentangles the semantic extraction from reconstruction constraint. Specifically, we first design a simple dual-stream structure, including a temporal layer and a hash layer. Then, with the help of semantic similarity knowledge obtained from self-supervision, the hash layer learns to capture information for semantic retrieval, while the temporal layer learns to capture the information for reconstruction. In this way, the model naturally preserves the disentangled semantics into binary codes. Validated by comprehensive experiments, our method consistently outperforms the state-of-the-arts on three video benchmarks.

* 17 pages, 8 figures, ECCV 2022 
Viaarxiv icon

Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition

Oct 10, 2023
Zixiao Wang, Hongtao Xie, Yuxin Wang, Jianjun Xu, Boqiang Zhang, Yongdong Zhang

Figure 1 for Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition
Figure 2 for Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition
Figure 3 for Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition
Figure 4 for Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition

In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP) model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) to leverage both visual and linguistic knowledge in CLIP. Different from previous CLIP-based methods mainly considering feature generalization on visual encoding, we propose a symmetrical distillation strategy (SDS) that further captures the linguistic knowledge in the CLIP text encoder. By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow that covers not only visual but also linguistic information for distillation.Benefiting from the natural alignment in CLIP, such guidance flow provides a progressive optimization objective from vision to language, which can supervise the STR feature forwarding process layer-by-layer.Besides, a new Linguistic Consistency Loss (LCL) is proposed to enhance the linguistic capability by considering second-order statistics during the optimization. Overall, CLIP-OCR is the first to design a smooth transition between image and text for the STR task.Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.Code will be available at https://github.com/wzx99/CLIPOCR.

* Accepted by ACM MM 2023 
Viaarxiv icon

Learning Complete Topology-Aware Correlations Between Relations for Inductive Link Prediction

Sep 20, 2023
Jie Wang, Hanzhu Chen, Qitan Lv, Zhihao Shi, Jiajun Chen, Huarui He, Hongtao Xie, Yongdong Zhang, Feng Wu

Figure 1 for Learning Complete Topology-Aware Correlations Between Relations for Inductive Link Prediction
Figure 2 for Learning Complete Topology-Aware Correlations Between Relations for Inductive Link Prediction
Figure 3 for Learning Complete Topology-Aware Correlations Between Relations for Inductive Link Prediction
Figure 4 for Learning Complete Topology-Aware Correlations Between Relations for Inductive Link Prediction

Inductive link prediction -- where entities during training and inference stages can be different -- has shown great potential for completing evolving knowledge graphs in an entity-independent manner. Many popular methods mainly focus on modeling graph-level features, while the edge-level interactions -- especially the semantic correlations between relations -- have been less explored. However, we notice a desirable property of semantic correlations between relations is that they are inherently edge-level and entity-independent. This implies the great potential of the semantic correlations for the entity-independent inductive link prediction task. Inspired by this observation, we propose a novel subgraph-based method, namely TACO, to model Topology-Aware COrrelations between relations that are highly correlated to their topological structures within subgraphs. Specifically, we prove that semantic correlations between any two relations can be categorized into seven topological patterns, and then proposes Relational Correlation Network (RCN) to learn the importance of each pattern. To further exploit the potential of RCN, we propose Complete Common Neighbor induced subgraph that can effectively preserve complete topological patterns within the subgraph. Extensive experiments demonstrate that TACO effectively unifies the graph-level information and edge-level interactions to jointly perform reasoning, leading to a superior performance over existing state-of-the-art methods for the inductive link prediction task.

* arXiv admin note: text overlap with arXiv:2103.03642 
Viaarxiv icon

TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design

Aug 13, 2023
Yifan Gao, Jinpeng Lin, Min Zhou, Chuanbin Liu, Hongtao Xie, Tiezheng Ge, Yuning Jiang

Figure 1 for TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design
Figure 2 for TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design
Figure 3 for TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design
Figure 4 for TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design

Text design is one of the most critical procedures in poster design, as it relies heavily on the creativity and expertise of humans to design text images considering the visual harmony and text-semantic. This study introduces TextPainter, a novel multimodal approach that leverages contextual visual information and corresponding text semantics to generate text images. Specifically, TextPainter takes the global-local background image as a hint of style and guides the text image generation with visual harmony. Furthermore, we leverage the language model and introduce a text comprehension module to achieve both sentence-level and word-level style variations. Besides, we construct the PosterT80K dataset, consisting of about 80K posters annotated with sentence-level bounding boxes and text contents. We hope this dataset will pave the way for further research on multimodal text image generation. Extensive quantitative and qualitative experiments demonstrate that TextPainter can generate visually-and-semantically-harmonious text images for posters.

* Accepted to ACM MM 2023. Dataset Link: https://tianchi.aliyun.com/dataset/160034 
Viaarxiv icon

TextPainter: Multimodal Text Image Generation withVisual-harmony and Text-comprehension for Poster Design

Aug 10, 2023
Yifan Gao, Jinpeng Lin, Min Zhou, Chuanbin Liu, Hongtao Xie, Tiezheng Ge, Yuning Jiang

Figure 1 for TextPainter: Multimodal Text Image Generation withVisual-harmony and Text-comprehension for Poster Design
Figure 2 for TextPainter: Multimodal Text Image Generation withVisual-harmony and Text-comprehension for Poster Design
Figure 3 for TextPainter: Multimodal Text Image Generation withVisual-harmony and Text-comprehension for Poster Design
Figure 4 for TextPainter: Multimodal Text Image Generation withVisual-harmony and Text-comprehension for Poster Design

Text design is one of the most critical procedures in poster design, as it relies heavily on the creativity and expertise of humans to design text images considering the visual harmony and text-semantic. This study introduces TextPainter, a novel multimodal approach that leverages contextual visual information and corresponding text semantics to generate text images. Specifically, TextPainter takes the global-local background image as a hint of style and guides the text image generation with visual harmony. Furthermore, we leverage the language model and introduce a text comprehension module to achieve both sentence-level and word-level style variations. Besides, we construct the PosterT80K dataset, consisting of about 80K posters annotated with sentence-level bounding boxes and text contents. We hope this dataset will pave the way for further research on multimodal text image generation. Extensive quantitative and qualitative experiments demonstrate that TextPainter can generate visually-and-semantically-harmonious text images for posters.

* Accepted to ACM MM 2023. Dataset Link: https://tianchi.aliyun.com/dataset/160034 
Viaarxiv icon

Balanced Classification: A Unified Framework for Long-Tailed Object Detection

Aug 04, 2023
Tianhao Qi, Hongtao Xie, Pandeng Li, Jiannan Ge, Yongdong Zhang

Figure 1 for Balanced Classification: A Unified Framework for Long-Tailed Object Detection
Figure 2 for Balanced Classification: A Unified Framework for Long-Tailed Object Detection
Figure 3 for Balanced Classification: A Unified Framework for Long-Tailed Object Detection
Figure 4 for Balanced Classification: A Unified Framework for Long-Tailed Object Detection

Conventional detectors suffer from performance degradation when dealing with long-tailed data due to a classification bias towards the majority head categories. In this paper, we contend that the learning bias originates from two factors: 1) the unequal competition arising from the imbalanced distribution of foreground categories, and 2) the lack of sample diversity in tail categories. To tackle these issues, we introduce a unified framework called BAlanced CLassification (BACL), which enables adaptive rectification of inequalities caused by disparities in category distribution and dynamic intensification of sample diversities in a synchronized manner. Specifically, a novel foreground classification balance loss (FCBL) is developed to ameliorate the domination of head categories and shift attention to difficult-to-differentiate categories by introducing pairwise class-aware margins and auto-adjusted weight terms, respectively. This loss prevents the over-suppression of tail categories in the context of unequal competition. Moreover, we propose a dynamic feature hallucination module (FHM), which enhances the representation of tail categories in the feature space by synthesizing hallucinated samples to introduce additional data variances. In this divide-and-conquer approach, BACL sets a new state-of-the-art on the challenging LVIS benchmark with a decoupled training pipeline, surpassing vanilla Faster R-CNN with ResNet-50-FPN by 5.8% AP and 16.1% AP for overall and tail categories. Extensive experiments demonstrate that BACL consistently achieves performance improvements across various datasets with different backbones and architectures. Code and models are available at https://github.com/Tianhao-Qi/BACL.

* Accepted by IEEE Transactions on Multimedia, to be published; Code: https://github.com/Tianhao-Qi/BACL 
Viaarxiv icon

MomentDiff: Generative Video Moment Retrieval from Random to Real

Jul 06, 2023
Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun Zheng, Deli Zhao, Yongdong Zhang

Figure 1 for MomentDiff: Generative Video Moment Retrieval from Random to Real
Figure 2 for MomentDiff: Generative Video Moment Retrieval from Random to Real
Figure 3 for MomentDiff: Generative Video Moment Retrieval from Random to Real
Figure 4 for MomentDiff: Generative Video Moment Retrieval from Random to Real

Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description. To achieve this goal, we provide a generative diffusion-based framework called MomentDiff, which simulates a typical human retrieval process from random browsing to gradual localization. Specifically, we first diffuse the real span to random noise, and learn to denoise the random noise to the original span with the guidance of similarity between text and video. This allows the model to learn a mapping from arbitrary random locations to real moments, enabling the ability to locate segments from random initialization. Once trained, MomentDiff could sample random temporal segments as initial guesses and iteratively refine them to generate an accurate temporal boundary. Different from discriminative works (e.g., based on learnable proposals or queries), MomentDiff with random initialized spans could resist the temporal location biases from datasets. To evaluate the influence of the temporal location biases, we propose two anti-bias datasets with location distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The experimental results demonstrate that our efficient framework consistently outperforms state-of-the-art methods on three public benchmarks, and exhibits better generalization and robustness on the proposed anti-bias datasets. The code, model, and anti-bias evaluation datasets are available at https://github.com/IMCCretrieval/MomentDiff.

* 12 pages, 5 figures 
Viaarxiv icon

Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

May 10, 2023
Boqiang Zhang, Hongtao Xie, Yuxin Wang, Jianjun Xu, Yongdong Zhang

Figure 1 for Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition
Figure 2 for Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition
Figure 3 for Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition
Figure 4 for Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

Vision model have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task. However, due to lacking the perception of linguistic knowledge and information, recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper. (2) the visual feature is suboptimal for the recognition in some vision-missing cases (e.g. occlusion, etc.). To address these issues, we propose a $\textbf{L}$inguistic $\textbf{P}$erception $\textbf{V}$ision model (LPV), which explores the linguistic capability of vision model for accurate text recognition. To alleviate the LID problem, we introduce a Cascade Position Attention (CPA) mechanism that obtains high-quality and accurate attention maps through step-wise optimization and linguistic information mining. Furthermore, a Global Linguistic Reconstruction Module (GLRM) is proposed to improve the representation of visual features by perceiving the linguistic information in the visual space, which gradually converts visual features into semantically rich ones during the cascade process. Different from previous methods, our method obtains SOTA results while keeping low complexity (92.4% accuracy with only 8.11M parameters). Code is available at https://github.com/CyrilSterling/LPV.

* Accepted to IJCAI 2023 
Viaarxiv icon

TPS++: Attention-Enhanced Thin-Plate Spline for Scene Text Recognition

May 09, 2023
Tianlun Zheng, Zhineng Chen, Jinfeng Bai, Hongtao Xie, Yu-Gang Jiang

Figure 1 for TPS++: Attention-Enhanced Thin-Plate Spline for Scene Text Recognition
Figure 2 for TPS++: Attention-Enhanced Thin-Plate Spline for Scene Text Recognition
Figure 3 for TPS++: Attention-Enhanced Thin-Plate Spline for Scene Text Recognition
Figure 4 for TPS++: Attention-Enhanced Thin-Plate Spline for Scene Text Recognition

Text irregularities pose significant challenges to scene text recognizers. Thin-Plate Spline (TPS)-based rectification is widely regarded as an effective means to deal with them. Currently, the calculation of TPS transformation parameters purely depends on the quality of regressed text borders. It ignores the text content and often leads to unsatisfactory rectified results for severely distorted text. In this work, we introduce TPS++, an attention-enhanced TPS transformation that incorporates the attention mechanism to text rectification for the first time. TPS++ formulates the parameter calculation as a joint process of foreground control point regression and content-based attention score estimation, which is computed by a dedicated designed gated-attention block. TPS++ builds a more flexible content-aware rectifier, generating a natural text correction that is easier to read by the subsequent recognizer. Moreover, TPS++ shares the feature backbone with the recognizer in part and implements the rectification at feature-level rather than image-level, incurring only a small overhead in terms of parameters and inference time. Experiments on public benchmarks show that TPS++ consistently improves the recognition and achieves state-of-the-art accuracy. Meanwhile, it generalizes well on different backbones and recognizers. Code is at https://github.com/simplify23/TPS_PP.

* Accepted by IJCAI 2023 
Viaarxiv icon

ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting

Dec 12, 2022
Shancheng Fang, Zhendong Mao, Hongtao Xie, Yuxin Wang, Chenggang Yan, Yongdong Zhang

Figure 1 for ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting
Figure 2 for ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting
Figure 3 for ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting
Figure 4 for ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting

Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how to effectively model the linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting. Firstly, the autonomous suggests enforcing explicitly language modeling by decoupling the recognizer into vision model and language model and blocking gradient flow between both models. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for the language model which can effectively alleviate the impact of noise input. Finally, to polish ABINet++ in long text recognition, we propose to aggregate horizontal features by embedding Transformer units inside a U-Net, and design a position and content attention module which integrates character order and content to attend to character features precisely. ABINet++ achieves state-of-the-art performance on both scene text recognition and scene text spotting benchmarks, which consistently demonstrates the superiority of our method in various environments especially on low-quality images. Besides, extensive experiments including in English and Chinese also prove that, a text spotter that incorporates our language modeling method can significantly improve its performance both in accuracy and speed compared with commonly used attention-based recognizers.

* Accepted by TPAMI. Code is available at https://github.com/FangShancheng/ABINet-PP. arXiv admin note: substantial text overlap with arXiv:2103.06495 (conference version) 
Viaarxiv icon