Alert button
Picture for Yunlu Xu

Yunlu Xu

Alert button

Distilling Object Detectors With Global Knowledge

Oct 17, 2022
Sanli Tang, Zhongyu Zhang, Zhanzhan Cheng, Jing Lu, Yunlu Xu, Yi Niu, Fan He

Figure 1 for Distilling Object Detectors With Global Knowledge
Figure 2 for Distilling Object Detectors With Global Knowledge
Figure 3 for Distilling Object Detectors With Global Knowledge
Figure 4 for Distilling Object Detectors With Global Knowledge

Knowledge distillation learns a lightweight student model that mimics a cumbersome teacher. Existing methods regard the knowledge as the feature of each instance or their relations, which is the instance-level knowledge only from the teacher model, i.e., the local knowledge. However, the empirical studies show that the local knowledge is much noisy in object detection tasks, especially on the blurred, occluded, or small instances. Thus, a more intrinsic approach is to measure the representations of instances w.r.t. a group of common basis vectors in the two feature spaces of the teacher and the student detectors, i.e., global knowledge. Then, the distilling algorithm can be applied as space alignment. To this end, a novel prototype generation module (PGM) is proposed to find the common basis vectors, dubbed prototypes, in the two feature spaces. Then, a robust distilling module (RDM) is applied to construct the global knowledge based on the prototypes and filtrate noisy global and local knowledge by measuring the discrepancy of the representations in two feature spaces. Experiments with Faster-RCNN and RetinaNet on PASCAL and COCO datasets show that our method achieves the best performance for distilling object detectors with various backbones, which even surpasses the performance of the teacher model. We also show that the existing methods can be easily combined with global knowledge and obtain further improvement. Code is available: https://github.com/hikvision-research/DAVAR-Lab-ML.

* Accepted by ECCV2022 
Viaarxiv icon

E2-AEN: End-to-End Incremental Learning with Adaptively Expandable Network

Jul 14, 2022
Guimei Cao, Zhanzhan Cheng, Yunlu Xu, Duo Li, Shiliang Pu, Yi Niu, Fei Wu

Figure 1 for E2-AEN: End-to-End Incremental Learning with Adaptively Expandable Network
Figure 2 for E2-AEN: End-to-End Incremental Learning with Adaptively Expandable Network
Figure 3 for E2-AEN: End-to-End Incremental Learning with Adaptively Expandable Network
Figure 4 for E2-AEN: End-to-End Incremental Learning with Adaptively Expandable Network

Expandable networks have demonstrated their advantages in dealing with catastrophic forgetting problem in incremental learning. Considering that different tasks may need different structures, recent methods design dynamic structures adapted to different tasks via sophisticated skills. Their routine is to search expandable structures first and then train on the new tasks, which, however, breaks tasks into multiple training stages, leading to suboptimal or overmuch computational cost. In this paper, we propose an end-to-end trainable adaptively expandable network named E2-AEN, which dynamically generates lightweight structures for new tasks without any accuracy drop in previous tasks. Specifically, the network contains a serial of powerful feature adapters for augmenting the previously learned representations to new tasks, and avoiding task interference. These adapters are controlled via an adaptive gate-based pruning strategy which decides whether the expanded structures can be pruned, making the network structure dynamically changeable according to the complexity of the new tasks. Moreover, we introduce a novel sparsity-activation regularization to encourage the model to learn discriminative features with limited parameters. E2-AEN reduces cost and can be built upon any feed-forward architectures in an end-to-end manner. Extensive experiments on both classification (i.e., CIFAR and VDD) and detection (i.e., COCO, VOC and ICCV2021 SSLAD challenge) benchmarks demonstrate the effectiveness of the proposed method, which achieves the new remarkable results.

Viaarxiv icon

TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents

Jul 14, 2022
Zhanzhan Cheng, Peng Zhang, Can Li, Qiao Liang, Yunlu Xu, Pengfei Li, Shiliang Pu, Yi Niu, Fei Wu

Figure 1 for TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents
Figure 2 for TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents
Figure 3 for TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents
Figure 4 for TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents

Recently, automatically extracting information from visually rich documents (e.g., tickets and resumes) has become a hot and vital research topic due to its widespread commercial value. Most existing methods divide this task into two subparts: the text reading part for obtaining the plain text from the original document images and the information extraction part for extracting key contents. These methods mainly focus on improving the second, while neglecting that the two parts are highly correlated. This paper proposes a unified end-to-end information extraction framework from visually rich documents, where text reading and information extraction can reinforce each other via a well-designed multi-modal context block. Specifically, the text reading part provides multi-modal features like visual, textual and layout features. The multi-modal context block is developed to fuse the generated multi-modal features and even the prior knowledge from the pre-trained language model for better semantic representation. The information extraction part is responsible for generating key contents with the fused context features. The framework can be trained in an end-to-end trainable manner, achieving global optimization. What is more, we define and group visually rich documents into four categories across two dimensions, the layout and text type. For each document category, we provide or recommend the corresponding benchmarks, experimental settings and strong baselines for remedying the problem that this research area lacks the uniform evaluation standard. Extensive experiments on four kinds of benchmarks (from fixed layout to variable layout, from full-structured text to semi-unstructured text) are reported, demonstrating the proposed method's effectiveness. Data, source code and models are available.

Viaarxiv icon

DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding

Jul 14, 2022
Liang Qiao, Hui Jiang, Ying Chen, Can Li, Pengfei Li, Zaisheng Li, Baorui Zou, Dashan Guo, Yingda Xu, Yunlu Xu, Zhanzhan Cheng, Yi Niu

Figure 1 for DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding
Figure 2 for DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding
Figure 3 for DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding
Figure 4 for DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding

This paper presents DavarOCR, an open-source toolbox for OCR and document understanding tasks. DavarOCR currently implements 19 advanced algorithms, covering 9 different task forms. DavarOCR provides detailed usage instructions and the trained models for each algorithm. Compared with the previous opensource OCR toolbox, DavarOCR has relatively more complete support for the sub-tasks of the cutting-edge technology of document understanding. In order to promote the development and application of OCR technology in academia and industry, we pay more attention to the use of modules that different sub-domains of technology can share. DavarOCR is publicly released at https://github.com/hikopensource/Davar-Lab-OCR.

* Short paper, Accept by ACM MM2022 
Viaarxiv icon

Technical Report for ICCV 2021 Challenge SSLAD-Track3B: Transformers Are Better Continual Learners

Jan 13, 2022
Duo Li, Guimei Cao, Yunlu Xu, Zhanzhan Cheng, Yi Niu

Figure 1 for Technical Report for ICCV 2021 Challenge SSLAD-Track3B: Transformers Are Better Continual Learners
Figure 2 for Technical Report for ICCV 2021 Challenge SSLAD-Track3B: Transformers Are Better Continual Learners

In the SSLAD-Track 3B challenge on continual learning, we propose the method of COntinual Learning with Transformer (COLT). We find that transformers suffer less from catastrophic forgetting compared to convolutional neural network. The major principle of our method is to equip the transformer based feature extractor with old knowledge distillation and head expanding strategies to compete catastrophic forgetting. In this report, we first introduce the overall framework of continual learning for object detection. Then, we analyse the key elements' effect on withstanding catastrophic forgetting in our solution. Our method achieves 70.78 mAP on the SSLAD-Track 3B challenge test set.

* Rank 1st on ICCV2021 SSLAD-Track 3B 
Viaarxiv icon

A Strong Baseline for Semi-Supervised Incremental Few-Shot Learning

Nov 04, 2021
Linlan Zhao, Dashan Guo, Yunlu Xu, Liang Qiao, Zhanzhan Cheng, Shiliang Pu, Yi Niu, Xiangzhong Fang

Figure 1 for A Strong Baseline for Semi-Supervised Incremental Few-Shot Learning
Figure 2 for A Strong Baseline for Semi-Supervised Incremental Few-Shot Learning
Figure 3 for A Strong Baseline for Semi-Supervised Incremental Few-Shot Learning
Figure 4 for A Strong Baseline for Semi-Supervised Incremental Few-Shot Learning

Few-shot learning (FSL) aims to learn models that generalize to novel classes with limited training samples. Recent works advance FSL towards a scenario where unlabeled examples are also available and propose semi-supervised FSL methods. Another line of methods also cares about the performance of base classes in addition to the novel ones and thus establishes the incremental FSL scenario. In this paper, we generalize the above two under a more realistic yet complex setting, named by Semi-Supervised Incremental Few-Shot Learning (S2 I-FSL). To tackle the task, we propose a novel paradigm containing two parts: (1) a well-designed meta-training algorithm for mitigating ambiguity between base and novel classes caused by unreliable pseudo labels and (2) a model adaptation mechanism to learn discriminative features for novel classes while preserving base knowledge using few labeled and all the unlabeled data. Extensive experiments on standard FSL, semi-supervised FSL, incremental FSL, and the firstly built S2 I-FSL benchmarks demonstrate the effectiveness of our proposed method.

* Accepted by BMVC2021 
Viaarxiv icon

Reciprocal Feature Learning via Explicit and Implicit Tasks in Scene Text Recognition

May 13, 2021
Hui Jiang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Yi Niu, Wenqi Ren, Fei Wu, Wenming Tan

Figure 1 for Reciprocal Feature Learning via Explicit and Implicit Tasks in Scene Text Recognition
Figure 2 for Reciprocal Feature Learning via Explicit and Implicit Tasks in Scene Text Recognition
Figure 3 for Reciprocal Feature Learning via Explicit and Implicit Tasks in Scene Text Recognition
Figure 4 for Reciprocal Feature Learning via Explicit and Implicit Tasks in Scene Text Recognition

Text recognition is a popular topic for its broad applications. In this work, we excavate the implicit task, character counting within the traditional text recognition, without additional labor annotation cost. The implicit task plays as an auxiliary branch for complementing the sequential recognition. We design a two-branch reciprocal feature learning framework in order to adequately utilize the features from both the tasks. Through exploiting the complementary effect between explicit and implicit tasks, the feature is reliably enhanced. Extensive experiments on 7 benchmarks show the advantages of the proposed methods in both text recognition and the new-built character counting tasks. In addition, it is convenient yet effective to equip with variable networks and tasks. We offer abundant ablation studies, generalizing experiments with deeper understanding on the tasks. Code is available.

* Accepted by ICDAR 2021 
Viaarxiv icon

MANGO: A Mask Attention Guided One-Stage Scene Text Spotter

Dec 08, 2020
Liang Qiao, Ying Chen, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, Fei Wu

Figure 1 for MANGO: A Mask Attention Guided One-Stage Scene Text Spotter
Figure 2 for MANGO: A Mask Attention Guided One-Stage Scene Text Spotter
Figure 3 for MANGO: A Mask Attention Guided One-Stage Scene Text Spotter
Figure 4 for MANGO: A Mask Attention Guided One-Stage Scene Text Spotter

Recently end-to-end scene text spotting has become a popular research topic due to its advantages of global optimization and high maintainability in real applications. Most methods attempt to develop various region of interest (RoI) operations to concatenate the detection part and the sequence recognition part into a two-stage text spotting framework. However, in such framework, the recognition part is highly sensitive to the detected results (\emph{e.g.}, the compactness of text contours). To address this problem, in this paper, we propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO, in which character sequences can be directly recognized without RoI operation. Concretely, a position-aware mask attention module is developed to generate attention weights on each text instance and its characters. It allows different text instances in an image to be allocated on different feature map channels which are further grouped as a batch of instance features. Finally, a lightweight sequence decoder is applied to generate the character sequences. It is worth noting that MANGO inherently adapts to arbitrary-shaped text spotting and can be trained end-to-end with only coarse position information (\emph{e.g.}, rectangular bounding box) and text annotations. Experimental results show that the proposed method achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks, i.e., ICDAR 2013, ICDAR 2015, Total-Text, and SCUT-CTW1500.

* Accepted to AAAI2021. The code will be published soon 
Viaarxiv icon

TRIE: End-to-End Text Reading and Information Extraction for Document Understanding

May 27, 2020
Peng Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Jing Lu, Liang Qiao, Yi Niu, Fei Wu

Figure 1 for TRIE: End-to-End Text Reading and Information Extraction for Document Understanding
Figure 2 for TRIE: End-to-End Text Reading and Information Extraction for Document Understanding
Figure 3 for TRIE: End-to-End Text Reading and Information Extraction for Document Understanding
Figure 4 for TRIE: End-to-End Text Reading and Information Extraction for Document Understanding

Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting and recognizing texts in the images and (2) information extraction for analyzing and extracting key elements from previously extracted plain text. However, they mainly focus on improving information extraction task, while neglecting the fact that text reading and information extraction are mutually correlated. In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. Specifically, the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text reading. On three real-world datasets with diverse document images (from fixed layout to variable layout, from structured text to semi-structured text), our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.

Viaarxiv icon