Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cong Yao

SynthText3D: Synthesizing Scene Text Images from 3D Virtual Worlds

Jul 13, 2019

Minghui Liao, Boyu Song, Minghang He, Shangbang Long, Cong Yao, Xiang Bai

Figure 1 for SynthText3D: Synthesizing Scene Text Images from 3D Virtual Worlds

Figure 2 for SynthText3D: Synthesizing Scene Text Images from 3D Virtual Worlds

Figure 3 for SynthText3D: Synthesizing Scene Text Images from 3D Virtual Worlds

Figure 4 for SynthText3D: Synthesizing Scene Text Images from 3D Virtual Worlds

Abstract:With the development of deep neural networks, the demand for a significant amount of annotated training data becomes the performance bottlenecks in many fields of research and applications. Image synthesis can generate annotated images automatically and freely, which gains increasing attention recently. In this paper, we propose to synthesize scene text images from the 3D virtual worlds, where the precise descriptions of scenes, editable illumination/visibility, and realistic physics are provided. Different from the previous methods which paste the rendered text on static 2D images, our method can render the 3D virtual scene and text instances as an entirety. In this way, complex perspective transforms, various illuminations, and occlusions can be realized in our synthesized scene text images. Moreover, the same text instances with various viewpoints can be produced by randomly moving and rotating the virtual camera, which acts as human eyes. The experiments on the standard scene text detection benchmarks using the generated synthetic data demonstrate the effectiveness and superiority of the proposed method. The code and synthetic data will be made available at https://github.com/MhLiao/SynthText3D

Via

Access Paper or Ask Questions

Intra-Ensemble in Neural Networks

Apr 09, 2019

Yuan Gao, Zixiang Cai, Yimin Chen, Wenke Chen, Kan Yang, Chen Sun, Cong Yao

Figure 1 for Intra-Ensemble in Neural Networks

Figure 2 for Intra-Ensemble in Neural Networks

Figure 3 for Intra-Ensemble in Neural Networks

Figure 4 for Intra-Ensemble in Neural Networks

Abstract:Improving model performance is always the key problem in machine learning including deep learning. However, stand-alone neural networks always suffer from marginal effect when stacking more layers. At the same time, ensemble is a useful technique to further enhance model performance. Nevertheless, training several independent stand-alone deep neural networks costs multiple resources. In this work, we propose Intra-Ensemble, an end-to-end strategy with stochastic training operations to train several sub-networks simultaneously within one neural network. Additional parameter size is marginal since the majority of parameters are mutually shared. Meanwhile, stochastic training increases the diversity of sub-networks with weight sharing, which significantly enhances intra-ensemble performance. Extensive experiments prove the applicability of intra-ensemble on various kinds of datasets and network architectures. Our models achieve comparable results with the state-of-the-art architectures on CIFAR-10 and CIFAR-100.

Via

Access Paper or Ask Questions

Scene Text Detection and Recognition: The Deep Learning Era

Dec 06, 2018

Shangbang Long, Xin He, Cong Yao

Figure 1 for Scene Text Detection and Recognition: The Deep Learning Era

Figure 2 for Scene Text Detection and Recognition: The Deep Learning Era

Figure 3 for Scene Text Detection and Recognition: The Deep Learning Era

Figure 4 for Scene Text Detection and Recognition: The Deep Learning Era

Abstract:With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inescapably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, approach and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and the grand challenges still remained. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected and compiled in our Github repository: https://github.com/Jyouhou/SceneTextPapers.

Via

Access Paper or Ask Questions

Scene Text Detection with Supervised Pyramid Context Network

Nov 21, 2018

Enze Xie, Yuhang Zang, Shuai Shao, Gang Yu, Cong Yao, Guangyao Li

Figure 1 for Scene Text Detection with Supervised Pyramid Context Network

Figure 2 for Scene Text Detection with Supervised Pyramid Context Network

Figure 3 for Scene Text Detection with Supervised Pyramid Context Network

Figure 4 for Scene Text Detection with Supervised Pyramid Context Network

Abstract:Scene text detection methods based on deep learning have achieved remarkable results over the past years. However, due to the high diversity and complexity of natural scenes, previous state-of-the-art text detection methods may still produce a considerable amount of false positives, when applied to images captured in real-world environments. To tackle this issue, mainly inspired by Mask R-CNN, we propose in this paper an effective model for scene text detection, which is based on Feature Pyramid Network (FPN) and instance segmentation. We propose a supervised pyramid context network (SPCNET) to precisely locate text regions while suppressing false positives. Benefited from the guidance of semantic information and sharing FPN, SPCNET obtains significantly enhanced performance while introducing marginal extra computation. Experiments on standard datasets demonstrate that our SPCNET clearly outperforms start-of-the-art methods. Specifically, it achieves an F-measure of 92.1% on ICDAR2013, 87.2% on ICDAR2015, 74.1% on ICDAR2017 MLT and 82.9% on Total-Text.

* Accepted by AAAI 2019

Via

Access Paper or Ask Questions

ICDAR2017 Competition on Reading Chinese Text in the Wild (RCTW-17)

Sep 26, 2018

Baoguang Shi, Cong Yao, Minghui Liao, Mingkun Yang, Pei Xu, Linyan Cui, Serge Belongie, Shijian Lu, Xiang Bai

Figure 1 for ICDAR2017 Competition on Reading Chinese Text in the Wild (RCTW-17)

Figure 2 for ICDAR2017 Competition on Reading Chinese Text in the Wild (RCTW-17)

Figure 3 for ICDAR2017 Competition on Reading Chinese Text in the Wild (RCTW-17)

Figure 4 for ICDAR2017 Competition on Reading Chinese Text in the Wild (RCTW-17)

Abstract:Chinese is the most widely used language in the world. Algorithms that read Chinese text in natural images facilitate applications of various kinds. Despite the large potential value, datasets and competitions in the past primarily focus on English, which bares very different characteristics than Chinese. This report introduces RCTW, a new competition that focuses on Chinese text reading. The competition features a large-scale dataset with 12,263 annotated images. Two tasks, namely text localization and end-to-end recognition, are set up. The competition took place from January 20 to May 31, 2017. 23 valid submissions were received from 19 teams. This report includes dataset description, task definitions, evaluation protocols, and results summaries and analysis. Through this competition, we call for more future research on the Chinese text reading problem. The official website for the competition is http://rctw.vlrlab.net

Via

Access Paper or Ask Questions

Scene Text Recognition from Two-Dimensional Perspective

Sep 18, 2018

Minghui Liao, Jian Zhang, Zhaoyi Wan, Fengming Xie, Jiajun Liang, Pengyuan Lyu, Cong Yao, Xiang Bai

Figure 1 for Scene Text Recognition from Two-Dimensional Perspective

Figure 2 for Scene Text Recognition from Two-Dimensional Perspective

Figure 3 for Scene Text Recognition from Two-Dimensional Perspective

Figure 4 for Scene Text Recognition from Two-Dimensional Perspective

Abstract:Inspired by speech recognition, recent state-of-the-art algorithms mostly consider scene text recognition as a sequence prediction problem. Though achieving excellent performance, these methods usually neglect an important fact that text in images are actually distributed in two-dimensional space. It is a nature quite different from that of speech, which is essentially a one-dimensional signal. In principle, directly compressing features of text into a one-dimensional form may lose useful information and introduce extra noise. In this paper, we approach scene text recognition from a two-dimensional perspective. A simple yet effective model, called Character Attention Fully Convolutional Network (CA-FCN), is devised for recognizing text of arbitrary shapes. Scene text recognition is realized with a semantic segmentation network, where an attention mechanism for characters is adopted. Combined with a word formation module, CA-FCN can simultaneously recognize the script and predict the position of each character. Experiments demonstrate that the proposed algorithm outperforms previous methods on both regular and irregular text datasets. Moreover, it is proven to be more robust to imprecise localizations in the text detection phase, which are very common in practice.

* 9 pages, 7 figures

Via

Access Paper or Ask Questions

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Aug 01, 2018

Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai

Figure 1 for Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Figure 2 for Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Figure 3 for Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Figure 4 for Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Abstract:Recently, models based on deep neural networks have dominated the fields of scene text detection and recognition. In this paper, we investigate the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images. An end-to-end trainable neural network model for scene text spotting is proposed. The proposed model, named as Mask TextSpotter, is inspired by the newly published work Mask R-CNN. Different from previous methods that also accomplish text spotting with end-to-end trainable deep neural networks, Mask TextSpotter takes advantage of simple and smooth end-to-end learning procedure, in which precise text detection and recognition are acquired via semantic segmentation. Moreover, it is superior to previous methods in handling text instances of irregular shapes, for example, curved text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the proposed method achieves state-of-the-art results in both scene text detection and end-to-end text recognition tasks.

* To appear in ECCV 2018

Via

Access Paper or Ask Questions

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Jul 04, 2018

Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, Cong Yao

Figure 1 for TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Figure 2 for TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Figure 3 for TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Figure 4 for TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Abstract:Driven by deep neural networks and large scale datasets, scene text detection methods have progressed substantially over the past years, continuously refreshing the performance records on various standard benchmarks. However, limited by the representations (axis-aligned rectangles, rotated rectangles or quadrangles) adopted to describe text, existing methods may fall short when dealing with much more free-form text instances, such as curved text, which are actually very common in real-world scenarios. To tackle this problem, we propose a more flexible representation for scene text, termed as TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms. In TextSnake, a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes, each of which is associated with potentially variable radius and orientation. Such geometry attributes are estimated via a Fully Convolutional Network (FCN) model. In experiments, the text detector based on TextSnake achieves state-of-the-art or comparable performance on Total-Text and SCUT-CTW1500, the two newly published benchmarks with special emphasis on curved text in natural images, as well as the widely-used datasets ICDAR 2015 and MSRA-TD500. Specifically, TextSnake outperforms the baseline on Total-Text by more than 40% in F-measure.

* 17 pages, accepted to ECCV2018

Via

Access Paper or Ask Questions

Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation

Feb 27, 2018

Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, Xiang Bai

Figure 1 for Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation

Figure 2 for Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation

Figure 3 for Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation

Figure 4 for Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation

Abstract:Previous deep learning based state-of-the-art scene text detection methods can be roughly classified into two categories. The first category treats scene text as a type of general objects and follows general object detection paradigm to localize scene text by regressing the text box locations, but troubled by the arbitrary-orientation and large aspect ratios of scene text. The second one segments text regions directly, but mostly needs complex post processing. In this paper, we present a method that combines the ideas of the two types of methods while avoiding their shortcomings. We propose to detect scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions. In inference stage, candidate boxes are generated by sampling and grouping corner points, which are further scored by segmentation maps and suppressed by NMS. Compared with previous methods, our method can handle long oriented text naturally and doesn't need complex post processing. The experiments on ICDAR2013, ICDAR2015, MSRA-TD500, MLT and COCO-Text demonstrate that the proposed algorithm achieves better or comparable results in both accuracy and efficiency. Based on VGG16, it achieves an F-measure of 84.3% on ICDAR2015 and 81.5% on MSRA-TD500.

* To appear in CVPR2018

Via

Access Paper or Ask Questions

EAST: An Efficient and Accurate Scene Text Detector

Jul 10, 2017

Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, Jiajun Liang

Figure 1 for EAST: An Efficient and Accurate Scene Text Detector

Figure 2 for EAST: An Efficient and Accurate Scene Text Detector

Figure 3 for EAST: An Efficient and Accurate Scene Text Detector

Figure 4 for EAST: An Efficient and Accurate Scene Text Detector

Abstract:Previous approaches for scene text detection have already achieved promising performances across various benchmarks. However, they usually fall short when dealing with challenging scenarios, even when equipped with deep neural network models, because the overall performance is determined by the interplay of multiple stages and components in the pipelines. In this work, we propose a simple yet powerful pipeline that yields fast and accurate text detection in natural scenes. The pipeline directly predicts words or text lines of arbitrary orientations and quadrilateral shapes in full images, eliminating unnecessary intermediate steps (e.g., candidate aggregation and word partitioning), with a single neural network. The simplicity of our pipeline allows concentrating efforts on designing loss functions and neural network architecture. Experiments on standard datasets including ICDAR 2015, COCO-Text and MSRA-TD500 demonstrate that the proposed algorithm significantly outperforms state-of-the-art methods in terms of both accuracy and efficiency. On the ICDAR 2015 dataset, the proposed algorithm achieves an F-score of 0.7820 at 13.2fps at 720p resolution.

* Accepted to CVPR 2017, fix equation (3)

Via

Access Paper or Ask Questions