Alert button
Picture for Dongbao Yang

Dongbao Yang

Alert button

Masked and Permuted Implicit Context Learning for Scene Text Recognition

May 25, 2023
Xiaomeng Yang, Zhi Qiao, Jin Wei, Yu Zhou, Ye Yuan, Zhilong Ji, Dongbao Yang, Weiping Wang

Figure 1 for Masked and Permuted Implicit Context Learning for Scene Text Recognition
Figure 2 for Masked and Permuted Implicit Context Learning for Scene Text Recognition
Figure 3 for Masked and Permuted Implicit Context Learning for Scene Text Recognition
Figure 4 for Masked and Permuted Implicit Context Learning for Scene Text Recognition

Scene Text Recognition (STR) is a challenging task due to variations in text style, shape, and background. Incorporating linguistic information is an effective way to enhance the robustness of STR models. Existing methods rely on permuted language modeling (PLM) or masked language modeling (MLM) to learn contextual information implicitly, either through an ensemble of permuted autoregressive (AR) LMs training or iterative non-autoregressive (NAR) decoding procedure. However, these methods exhibit limitations: PLM's AR decoding results in the lack of information about future characters, while MLM provides global information of the entire text but neglects dependencies among each predicted character. In this paper, we propose a Masked and Permuted Implicit Context Learning Network for STR, which unifies PLM and MLM within a single decoding architecture, inheriting the advantages of both approaches. We utilize the training procedure of PLM, and to integrate MLM, we incorporate word length information into the decoding process by introducing specific numbers of mask tokens. Experimental results demonstrate that our proposed model achieves state-of-the-art performance on standard benchmarks using both AR and NAR decoding procedures.

Viaarxiv icon

Multi-View Correlation Distillation for Incremental Object Detection

Jul 05, 2021
Dongbao Yang, Yu Zhou, Weiping Wang

Figure 1 for Multi-View Correlation Distillation for Incremental Object Detection
Figure 2 for Multi-View Correlation Distillation for Incremental Object Detection
Figure 3 for Multi-View Correlation Distillation for Incremental Object Detection
Figure 4 for Multi-View Correlation Distillation for Incremental Object Detection

In real applications, new object classes often emerge after the detection model has been trained on a prepared dataset with fixed classes. Due to the storage burden and the privacy of old data, sometimes it is impractical to train the model from scratch with both old and new data. Fine-tuning the old model with only new data will lead to a well-known phenomenon of catastrophic forgetting, which severely degrades the performance of modern object detectors. In this paper, we propose a novel \textbf{M}ulti-\textbf{V}iew \textbf{C}orrelation \textbf{D}istillation (MVCD) based incremental object detection method, which explores the correlations in the feature space of the two-stage object detector (Faster R-CNN). To better transfer the knowledge learned from the old classes and maintain the ability to learn new classes, we design correlation distillation losses from channel-wise, point-wise and instance-wise views to regularize the learning of the incremental model. A new metric named Stability-Plasticity-mAP is proposed to better evaluate both the stability for old classes and the plasticity for new classes in incremental object detection. The extensive experiments conducted on VOC2007 and COCO demonstrate that MVCD can effectively learn to detect objects of new classes and mitigate the problem of catastrophic forgetting.

Viaarxiv icon

Two-Level Residual Distillation based Triple Network for Incremental Object Detection

Jul 27, 2020
Dongbao Yang, Yu Zhou, Dayan Wu, Can Ma, Fei Yang, Weiping Wang

Figure 1 for Two-Level Residual Distillation based Triple Network for Incremental Object Detection
Figure 2 for Two-Level Residual Distillation based Triple Network for Incremental Object Detection
Figure 3 for Two-Level Residual Distillation based Triple Network for Incremental Object Detection
Figure 4 for Two-Level Residual Distillation based Triple Network for Incremental Object Detection

Modern object detection methods based on convolutional neural network suffer from severe catastrophic forgetting in learning new classes without original data. Due to time consumption, storage burden and privacy of old data, it is inadvisable to train the model from scratch with both old and new data when new object classes emerge after the model trained. In this paper, we propose a novel incremental object detector based on Faster R-CNN to continuously learn from new object classes without using old data. It is a triple network where an old model and a residual model as assistants for helping the incremental model learning on new classes without forgetting the previous learned knowledge. To better maintain the discrimination of features between old and new classes, the residual model is jointly trained on new classes in the incremental learning procedure. In addition, a corresponding distillation scheme is designed to guide the training process, which consists of a two-level residual distillation loss and a joint classification distillation loss. Extensive experiments on VOC2007 and COCO are conducted, and the results demonstrate that the proposed method can effectively learn to incrementally detect objects of new classes, and the problem of catastrophic forgetting is mitigated in this context.

Viaarxiv icon

Self-Training for Domain Adaptive Scene Text Detection

May 23, 2020
Yudi Chen, Wei Wang, Yu Zhou, Fei Yang, Dongbao Yang, Weiping Wang

Figure 1 for Self-Training for Domain Adaptive Scene Text Detection
Figure 2 for Self-Training for Domain Adaptive Scene Text Detection
Figure 3 for Self-Training for Domain Adaptive Scene Text Detection
Figure 4 for Self-Training for Domain Adaptive Scene Text Detection

Though deep learning based scene text detection has achieved great progress, well-trained detectors suffer from severe performance degradation for different domains. In general, a tremendous amount of data is indispensable to train the detector in the target domain. However, data collection and annotation are expensive and time-consuming. To address this problem, we propose a self-training framework to automatically mine hard examples with pseudo-labels from unannotated videos or images. To reduce the noise of hard examples, a novel text mining module is implemented based on the fusion of detection and tracking results. Then, an image-to-video generation method is designed for the tasks that videos are unavailable and only images can be used. Experimental results on standard benchmarks, including ICDAR2015, MSRA-TD500, ICDAR2017 MLT, demonstrate the effectiveness of our self-training method. The simple Mask R-CNN adapted with self-training and fine-tuned on real data can achieve comparable or even superior results with the state-of-the-art methods.

Viaarxiv icon

SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

May 22, 2020
Zhi Qiao, Yu Zhou, Dongbao Yang, Yucan Zhou, Weiping Wang

Figure 1 for SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition
Figure 2 for SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition
Figure 3 for SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition
Figure 4 for SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

Scene text recognition is a hot research topic in computer vision. Recently, many recognition methods based on the encoder-decoder framework have been proposed, and they can handle scene texts of perspective distortion and curve shape. Nevertheless, they still face lots of challenges like image blur, uneven illumination, and incomplete characters. We argue that most encoder-decoder methods are based on local visual features without explicit global semantic information. In this work, we propose a semantics enhanced encoder-decoder framework to robustly recognize low-quality scene texts. The semantic information is used both in the encoder module for supervision and in the decoder module for initializing. In particular, the state-of-the art ASTER method is integrated into the proposed framework as an exemplar. Extensive experiments demonstrate that the proposed framework is more robust for low-quality text images, and achieves state-of-the-art results on several benchmark datasets.

* CVPR 2020 
Viaarxiv icon

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

Jan 02, 2020
Dezhao Luo, Chang Liu, Yu Zhou, Dongbao Yang, Can Ma, Qixiang Ye, Weiping Wang

Figure 1 for Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning
Figure 2 for Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning
Figure 3 for Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning
Figure 4 for Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

We propose a novel self-supervised method, referred to as Video Cloze Procedure (VCP), to learn rich spatial-temporal representations. VCP first generates "blanks" by withholding video clips and then creates "options" by applying spatio-temporal operations on the withheld clips. Finally, it fills the blanks with "options" and learns representations by predicting the categories of operations applied on the clips. VCP can act as either a proxy task or a target task in self-supervised learning. As a proxy task, it converts rich self-supervised representations into video clip operations (options), which enhances the flexibility and reduces the complexity of representation learning. As a target task, it can assess learned representation models in a uniform and interpretable manner. With VCP, we train spatial-temporal representation models (3D-CNNs) and apply such models on action recognition and video retrieval tasks. Experiments on commonly used benchmarks show that the trained models outperform the state-of-the-art self-supervised models with significant margins.

* AAAI2020(Oral) 
Viaarxiv icon

Curved Text Detection in Natural Scene Images with Semi- and Weakly-Supervised Learning

Aug 27, 2019
Xugong Qin, Yu Zhou, Dongbao Yang, Weiping Wang

Figure 1 for Curved Text Detection in Natural Scene Images with Semi- and Weakly-Supervised Learning
Figure 2 for Curved Text Detection in Natural Scene Images with Semi- and Weakly-Supervised Learning
Figure 3 for Curved Text Detection in Natural Scene Images with Semi- and Weakly-Supervised Learning
Figure 4 for Curved Text Detection in Natural Scene Images with Semi- and Weakly-Supervised Learning

Detecting curved text in the wild is very challenging. Recently, most state-of-the-art methods are segmentation based and require pixel-level annotations. We propose a novel scheme to train an accurate text detector using only a small amount of pixel-level annotated data and a large amount of data annotated with rectangles or even unlabeled data. A baseline model is first obtained by training with the pixel-level annotated data and then used to annotate unlabeled or weakly labeled data. A novel strategy which utilizes ground-truth bounding boxes to generate pseudo mask annotations is proposed in weakly-supervised learning. Experimental results on CTW1500 and Total-Text demonstrate that our method can substantially reduce the requirement of pixel-level annotated data. Our method can also generalize well across two datasets. The performance of the proposed method is comparable with the state-of-the-art methods with only 10% pixel-level annotated data and 90% rectangle-level weakly annotated data.

* Accepted by ICDAR 2019 
Viaarxiv icon