Alert button
Picture for Shangbang Long

Shangbang Long

Alert button

ICDAR 2023 Competition on Hierarchical Text Detection and Recognition

May 16, 2023
Shangbang Long, Siyang Qin, Dmitry Panteleev, Alessandro Bissacco, Yasuhisa Fujii, Michalis Raptis

Figure 1 for ICDAR 2023 Competition on Hierarchical Text Detection and Recognition
Figure 2 for ICDAR 2023 Competition on Hierarchical Text Detection and Recognition
Figure 3 for ICDAR 2023 Competition on Hierarchical Text Detection and Recognition
Figure 4 for ICDAR 2023 Competition on Hierarchical Text Detection and Recognition

We organize a competition on hierarchical text detection and recognition. The competition is aimed to promote research into deep learning models and systems that can jointly perform text detection and recognition and geometric layout analysis. We present details of the proposed competition organization, including tasks, datasets, evaluations, and schedule. During the competition period (from January 2nd 2023 to April 1st 2023), at least 50 submissions from more than 20 teams were made in the 2 proposed tasks. Considering the number of teams and submissions, we conclude that the HierText competition has been successfully held. In this report, we will also present the competition results and insights from them.

* ICDAR 2023 competition report by organizers (accepted and to be published officially later) 
Viaarxiv icon

FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

May 04, 2023
Chen-Yu Lee, Chun-Liang Li, Hao Zhang, Timothy Dozat, Vincent Perot, Guolong Su, Xiang Zhang, Kihyuk Sohn, Nikolai Glushnev, Renshen Wang, Joshua Ainslie, Shangbang Long, Siyang Qin, Yasuhisa Fujii, Nan Hua, Tomas Pfister

Figure 1 for FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction
Figure 2 for FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction
Figure 3 for FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction
Figure 4 for FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.

* Accepted to ACL 2023 
Viaarxiv icon

Towards End-to-End Unified Scene Text Detection and Layout Analysis

Mar 28, 2022
Shangbang Long, Siyang Qin, Dmitry Panteleev, Alessandro Bissacco, Yasuhisa Fujii, Michalis Raptis

Figure 1 for Towards End-to-End Unified Scene Text Detection and Layout Analysis
Figure 2 for Towards End-to-End Unified Scene Text Detection and Layout Analysis
Figure 3 for Towards End-to-End Unified Scene Text Detection and Layout Analysis
Figure 4 for Towards End-to-End Unified Scene Text Detection and Layout Analysis

Scene text detection and document layout analysis have long been treated as two separate tasks in different image domains. In this paper, we bring them together and introduce the task of unified scene text detection and layout analysis. The first hierarchical scene text dataset is introduced to enable this novel research task. We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way. Comprehensive experiments show that our unified model achieves better performance than multiple well-designed baseline methods. Additionally, this model achieves state-of-the-art results on multiple scene text detection datasets without the need of complex post-processing. Dataset and code: https://github.com/google-research-datasets/hiertext.

* To appear at CVPR 2022 
Viaarxiv icon

UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World

Mar 24, 2020
Shangbang Long, Cong Yao

Figure 1 for UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World
Figure 2 for UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World
Figure 3 for UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World
Figure 4 for UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World

Synthetic data has been a critical tool for training scene text detection and recognition models. On the one hand, synthetic word images have proven to be a successful substitute for real images in training scene text recognizers. On the other hand, however, scene text detectors still heavily rely on a large amount of manually annotated real-world images, which are expensive. In this paper, we introduce UnrealText, an efficient image synthesis method that renders realistic images via a 3D graphics engine. 3D synthetic engine provides realistic appearance by rendering scene and text as a whole, and allows for better text region proposals with access to precise scene information, e.g. normal and even object meshes. The comprehensive experiments verify its effectiveness on both scene text detection and recognition. We also generate a multilingual version for future research into multilingual scene text detection and recognition. The code and the generated datasets are released at: https://github.com/Jyouhou/UnrealText/ .

* Accepted to CVPR 2020 
Viaarxiv icon

A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling

Feb 10, 2020
Shangbang Long, Yushuo Guan, Kaigui Bian, Cong Yao

Figure 1 for A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling
Figure 2 for A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling
Figure 3 for A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling
Figure 4 for A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling

Irregular scene text recognition has attracted much attention from the research community, mainly due to the complexity of shapes of text in natural scene. However, recent methods either rely on shape-sensitive modules such as bounding box regression, or discard sequence learning. To tackle these issues, we propose a pair of coupling modules, termed as Character Anchoring Module (CAM) and Anchor Pooling Module (APM), to extract high-level semantics from two-dimensional space to form feature sequences. The proposed CAM localizes the text in a shape-insensitive way by design by anchoring characters individually. APM then interpolates and gathers features flexibly along the character anchors which enables sequence learning. The complementary modules realize a harmonic unification of spatial information and sequence learning. With the proposed modules, our recognition system surpasses previous state-of-the-art scores on irregular and perspective text datasets, including, ICDAR 2015, CUTE, and Total-Text, while paralleling state-of-the-art performance on regular text datasets.

* To appear at ICASSP 2020 
Viaarxiv icon

Alchemy: Techniques for Rectification Based Irregular Scene Text Recognition

Aug 30, 2019
Shangbang Long, Yushuo Guan, Bingxuan Wang, Kaigui Bian, Cong Yao

Figure 1 for Alchemy: Techniques for Rectification Based Irregular Scene Text Recognition
Figure 2 for Alchemy: Techniques for Rectification Based Irregular Scene Text Recognition
Figure 3 for Alchemy: Techniques for Rectification Based Irregular Scene Text Recognition
Figure 4 for Alchemy: Techniques for Rectification Based Irregular Scene Text Recognition

Reading text from natural images is challenging due to the great variety in text font, color, size, complex background and etc.. The perspective distortion and non-linear spatial arrangement of characters make it further difficult. While rectification based method is intuitively grounded and has pushed the envelope by far, its potential is far from being well exploited. In this paper, we present a bag of tricks that prove to significantly improve the performance of rectification based method. On curved text dataset, our method achieves an accuracy of 89.6% on CUTE-80 and 76.3% on Total-Text, an improvement over previous state-of-the-art by 6.3% and 14.7% respectively. Furthermore, our combination of tricks helps us win the ICDAR 2019 Arbitrary-Shaped Text Challenge (Latin script), achieving an accuracy of 74.3% on the held-out test set. We release our code as well as data samples for further exploration at https://github.com/Jyouhou/ICDAR2019-ArT-Recognition-Alchemy

* Technical report for participation in ICDAR2019-ArT recognition track 
Viaarxiv icon

SynthText3D: Synthesizing Scene Text Images from 3D Virtual Worlds

Jul 13, 2019
Minghui Liao, Boyu Song, Minghang He, Shangbang Long, Cong Yao, Xiang Bai

Figure 1 for SynthText3D: Synthesizing Scene Text Images from 3D Virtual Worlds
Figure 2 for SynthText3D: Synthesizing Scene Text Images from 3D Virtual Worlds
Figure 3 for SynthText3D: Synthesizing Scene Text Images from 3D Virtual Worlds
Figure 4 for SynthText3D: Synthesizing Scene Text Images from 3D Virtual Worlds

With the development of deep neural networks, the demand for a significant amount of annotated training data becomes the performance bottlenecks in many fields of research and applications. Image synthesis can generate annotated images automatically and freely, which gains increasing attention recently. In this paper, we propose to synthesize scene text images from the 3D virtual worlds, where the precise descriptions of scenes, editable illumination/visibility, and realistic physics are provided. Different from the previous methods which paste the rendered text on static 2D images, our method can render the 3D virtual scene and text instances as an entirety. In this way, complex perspective transforms, various illuminations, and occlusions can be realized in our synthesized scene text images. Moreover, the same text instances with various viewpoints can be produced by randomly moving and rotating the virtual camera, which acts as human eyes. The experiments on the standard scene text detection benchmarks using the generated synthetic data demonstrate the effectiveness and superiority of the proposed method. The code and synthetic data will be made available at https://github.com/MhLiao/SynthText3D

Viaarxiv icon

Scene Text Detection and Recognition: The Deep Learning Era

Dec 06, 2018
Shangbang Long, Xin He, Cong Yao

Figure 1 for Scene Text Detection and Recognition: The Deep Learning Era
Figure 2 for Scene Text Detection and Recognition: The Deep Learning Era
Figure 3 for Scene Text Detection and Recognition: The Deep Learning Era
Figure 4 for Scene Text Detection and Recognition: The Deep Learning Era

With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inescapably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, approach and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and the grand challenges still remained. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected and compiled in our Github repository: https://github.com/Jyouhou/SceneTextPapers.

Viaarxiv icon

Automatic Judgment Prediction via Legal Reading Comprehension

Sep 18, 2018
Shangbang Long, Cunchao Tu, Zhiyuan Liu, Maosong Sun

Figure 1 for Automatic Judgment Prediction via Legal Reading Comprehension
Figure 2 for Automatic Judgment Prediction via Legal Reading Comprehension
Figure 3 for Automatic Judgment Prediction via Legal Reading Comprehension
Figure 4 for Automatic Judgment Prediction via Legal Reading Comprehension

Automatic judgment prediction aims to predict the judicial results based on case materials. It has been studied for several decades mainly by lawyers and judges, considered as a novel and prospective application of artificial intelligence techniques in the legal field. Most existing methods follow the text classification framework, which fails to model the complex interactions among complementary case materials. To address this issue, we formalize the task as Legal Reading Comprehension according to the legal scenario. Following the working protocol of human judges, LRC predicts the final judgment results based on three types of information, including fact description, plaintiffs' pleas, and law articles. Moreover, we propose a novel LRC model, AutoJudge, which captures the complex semantic interactions among facts, pleas, and laws. In experiments, we construct a real-world civil case dataset for LRC. Experimental results on this dataset demonstrate that our model achieves significant improvement over state-of-the-art models. We will publish all source codes and datasets of this work on \urlgithub.com for further research.

* 10 pages, 4 figures, 6 tables 
Viaarxiv icon

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Jul 04, 2018
Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, Cong Yao

Figure 1 for TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes
Figure 2 for TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes
Figure 3 for TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes
Figure 4 for TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Driven by deep neural networks and large scale datasets, scene text detection methods have progressed substantially over the past years, continuously refreshing the performance records on various standard benchmarks. However, limited by the representations (axis-aligned rectangles, rotated rectangles or quadrangles) adopted to describe text, existing methods may fall short when dealing with much more free-form text instances, such as curved text, which are actually very common in real-world scenarios. To tackle this problem, we propose a more flexible representation for scene text, termed as TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms. In TextSnake, a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes, each of which is associated with potentially variable radius and orientation. Such geometry attributes are estimated via a Fully Convolutional Network (FCN) model. In experiments, the text detector based on TextSnake achieves state-of-the-art or comparable performance on Total-Text and SCUT-CTW1500, the two newly published benchmarks with special emphasis on curved text in natural images, as well as the widely-used datasets ICDAR 2015 and MSRA-TD500. Specifically, TextSnake outperforms the baseline on Total-Text by more than 40% in F-measure.

* 17 pages, accepted to ECCV2018 
Viaarxiv icon