Alert button
Picture for Guan Pang

Guan Pang

Alert button

DISGO: Automatic End-to-End Evaluation for Scene Text OCR

Aug 25, 2023
Mei-Yuh Hwang, Yangyang Shi, Ankit Ramchandani, Guan Pang, Praveen Krishnan, Lucas Kabela, Frank Seide, Samyak Datta, Jun Liu

Figure 1 for DISGO: Automatic End-to-End Evaluation for Scene Text OCR
Figure 2 for DISGO: Automatic End-to-End Evaluation for Scene Text OCR
Figure 3 for DISGO: Automatic End-to-End Evaluation for Scene Text OCR
Figure 4 for DISGO: Automatic End-to-End Evaluation for Scene Text OCR

This paper discusses the challenges of optical character recognition (OCR) on natural scenes, which is harder than OCR on documents due to the wild content and various image backgrounds. We propose to uniformly use word error rates (WER) as a new measurement for evaluating scene-text OCR, both end-to-end (e2e) performance and individual system component performances. Particularly for the e2e metric, we name it DISGO WER as it considers Deletion, Insertion, Substitution, and Grouping/Ordering errors. Finally we propose to utilize the concept of super blocks to automatically compute BLEU scores for e2e OCR machine translation. The small SCUT public test set is used to demonstrate WER performance by a modularized OCR system.

* 9 pages 
Viaarxiv icon

Text-Conditional Contextualized Avatars For Zero-Shot Personalization

Apr 14, 2023
Samaneh Azadi, Thomas Hayes, Akbar Shah, Guan Pang, Devi Parikh, Sonal Gupta

Figure 1 for Text-Conditional Contextualized Avatars For Zero-Shot Personalization
Figure 2 for Text-Conditional Contextualized Avatars For Zero-Shot Personalization
Figure 3 for Text-Conditional Contextualized Avatars For Zero-Shot Personalization
Figure 4 for Text-Conditional Contextualized Avatars For Zero-Shot Personalization

Recent large-scale text-to-image generation models have made significant improvements in the quality, realism, and diversity of the synthesized images and enable users to control the created content through language. However, the personalization aspect of these generative models is still challenging and under-explored. In this work, we propose a pipeline that enables personalization of image generation with avatars capturing a user's identity in a delightful way. Our pipeline is zero-shot, avatar texture and style agnostic, and does not require training on the avatar at all - it is scalable to millions of users who can generate a scene with their avatar. To render the avatar in a pose faithful to the given text prompt, we propose a novel text-to-3D pose diffusion model trained on a curated large-scale dataset of in-the-wild human poses improving the performance of the SOTA text-to-motion models significantly. We show, for the first time, how to leverage large-scale image datasets to learn human 3D pose parameters and overcome the limitations of motion capture datasets.

Viaarxiv icon

MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

Apr 28, 2022
Thomas Hayes, Songyang Zhang, Xi Yin, Guan Pang, Sasha Sheng, Harry Yang, Songwei Ge, Qiyuan Hu, Devi Parikh

Figure 1 for MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
Figure 2 for MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
Figure 3 for MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
Figure 4 for MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

Multimodal video-audio-text understanding and generation can benefit from datasets that are narrow but rich. The narrowness allows bite-sized challenges that the research community can make progress on. The richness ensures we are making progress along the core challenges. To this end, we present a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun [11]. We made substantial modifications to make the game richer by introducing audio and enabling new interactions. We trained RL agents with different objectives to navigate the game and interact with 13 objects and characters. This allows us to automatically extract a large collection of diverse videos and associated audio. We sample 375K video clips (3.2s each) and collect text descriptions from human annotators. Each video has additional annotations that are extracted automatically from the game engine, such as accurate semantic maps for each frame and templated textual descriptions. Altogether, MUGEN can help progress research in many tasks in multimodal understanding and generation. We benchmark representative approaches on tasks involving video-audio-text retrieval and generation. Our dataset and code are released at: https://mugen-org.github.io/.

Viaarxiv icon

Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

Apr 07, 2022
Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, Devi Parikh

Figure 1 for Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
Figure 2 for Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
Figure 3 for Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
Figure 4 for Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

Videos are created to express emotion, exchange information, and share experiences. Video synthesis has intrigued researchers for a long time. Despite the rapid progress driven by advances in visual synthesis, most existing studies focus on improving the frames' quality and the transitions between them, while little progress has been made in generating longer videos. In this paper, we present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames. Our evaluation shows that our model trained on 16-frame video clips from standard benchmarks such as UCF-101, Sky Time-lapse, and Taichi-HD datasets can generate diverse, coherent, and high-quality long videos. We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio. Videos and code can be found at https://songweige.github.io/projects/tats/index.html.

Viaarxiv icon

TextStyleBrush: Transfer of Text Aesthetics from a Single Example

Jun 15, 2021
Praveen Krishnan, Rama Kovvuri, Guan Pang, Boris Vassilev, Tal Hassner

Figure 1 for TextStyleBrush: Transfer of Text Aesthetics from a Single Example
Figure 2 for TextStyleBrush: Transfer of Text Aesthetics from a Single Example
Figure 3 for TextStyleBrush: Transfer of Text Aesthetics from a Single Example
Figure 4 for TextStyleBrush: Transfer of Text Aesthetics from a Single Example

We present a novel approach for disentangling the content of a text image from all aspects of its appearance. The appearance representation we derive can then be applied to new content, for one-shot transfer of the source style to new content. We learn this disentanglement in a self-supervised manner. Our method processes entire word boxes, without requiring segmentation of text from background, per-character processing, or making assumptions on string lengths. We show results in different text domains which were previously handled by specialized methods, e.g., scene text, handwritten text. To these ends, we make a number of technical contributions: (1) We disentangle the style and content of a textual image into a non-parametric, fixed-dimensional vector. (2) We propose a novel approach inspired by StyleGAN but conditioned over the example style at different resolution and content. (3) We present novel self-supervised training criteria which preserve both source style and target content using a pre-trained font classifier and text recognizer. Finally, (4) we also introduce Imgur5K, a new challenging dataset for handwritten word images. We offer numerous qualitative photo-realistic results of our method. We further show that our method surpasses previous work in quantitative tests on scene text and handwriting datasets, as well as in a user study.

* 18 pages, 13 figures 
Viaarxiv icon

TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

May 12, 2021
Amanpreet Singh, Guan Pang, Mandy Toh, Jing Huang, Wojciech Galuba, Tal Hassner

Figure 1 for TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text
Figure 2 for TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text
Figure 3 for TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text
Figure 4 for TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

A crucial component for the scene text based reasoning required for TextVQA and TextCaps datasets involve detecting and recognizing text present in the images using an optical character recognition (OCR) system. The current systems are crippled by the unavailability of ground truth text annotations for these datasets as well as lack of scene text detection and recognition datasets on real images disallowing the progress in the field of OCR and evaluation of scene text based reasoning in isolation from OCR systems. In this work, we propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset. We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR and that training on TextOCR helps achieve state-of-the-art performance on multiple other OCR datasets as well. We use a TextOCR trained OCR model to create PixelM4C model which can do scene text based reasoning on an image in an end-to-end fashion, allowing us to revisit several design choices to achieve new state-of-the-art performance on TextVQA dataset.

* To appear in CVPR 2021. 15 pages, 7 figures. First three authors contributed equally. More info at https://textvqa.org/textocr 
Viaarxiv icon

A Multiplexed Network for End-to-End, Multilingual OCR

Mar 29, 2021
Jing Huang, Guan Pang, Rama Kovvuri, Mandy Toh, Kevin J Liang, Praveen Krishnan, Xi Yin, Tal Hassner

Figure 1 for A Multiplexed Network for End-to-End, Multilingual OCR
Figure 2 for A Multiplexed Network for End-to-End, Multilingual OCR
Figure 3 for A Multiplexed Network for End-to-End, Multilingual OCR
Figure 4 for A Multiplexed Network for End-to-End, Multilingual OCR

Recent advances in OCR have shown that an end-to-end (E2E) training pipeline that includes both detection and recognition leads to the best results. However, many existing methods focus primarily on Latin-alphabet languages, often even only case-insensitive English characters. In this paper, we propose an E2E approach, Multiplexed Multilingual Mask TextSpotter, that performs script identification at the word level and handles different scripts with different recognition heads, all while maintaining a unified loss that simultaneously optimizes script identification and multiple recognition heads. Experiments show that our method outperforms the single-head model with similar number of parameters in end-to-end recognition tasks, and achieves state-of-the-art results on MLT17 and MLT19 joint text detection and script identification benchmarks. We believe that our work is a step towards the end-to-end trainable and scalable multilingual multi-purpose OCR system. Our code and model will be released.

Viaarxiv icon

img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation

Dec 14, 2020
Vitor Albiero, Xingyu Chen, Xi Yin, Guan Pang, Tal Hassner

Figure 1 for img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation
Figure 2 for img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation
Figure 3 for img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation
Figure 4 for img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation

We propose real-time, six degrees of freedom (6DoF), 3D face pose estimation without face detection or landmark localization. We observe that estimating the 6DoF rigid transformation of a face is a simpler problem than facial landmark detection, often used for 3D face alignment. In addition, 6DoF offers more information than face bounding box labels. We leverage these observations to make multiple contributions: (a) We describe an easily trained, efficient, Faster R-CNN--based model which regresses 6DoF pose for all faces in the photo, without preliminary face detection. (b) We explain how pose is converted and kept consistent between the input photo and arbitrary crops created while training and evaluating our model. (c) Finally, we show how face poses can replace detection bounding box training labels. Tests on AFLW2000-3D and BIWI show that our method runs at real-time and outperforms state of the art (SotA) face pose estimators. Remarkably, our method also surpasses SotA models of comparable complexity on the WIDER FACE detection benchmark, despite not been optimized on bounding box labels.

* Joint first authorship: Vitor Albiero and Xingyu Chen 
Viaarxiv icon

SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervision and Dynamic Self-Training

Dec 05, 2020
Weijia Wu, Enze Xie, Ruimao Zhang, Wenhai Wang, Guan Pang, Zhen Li, Hong Zhou, Ping Luo

Figure 1 for SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervision and Dynamic Self-Training
Figure 2 for SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervision and Dynamic Self-Training
Figure 3 for SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervision and Dynamic Self-Training
Figure 4 for SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervision and Dynamic Self-Training

Although a polygon is a more accurate representation than an upright bounding box for text detection, the annotations of polygons are extremely expensive and challenging. Unlike existing works that employ fully-supervised training with polygon annotations, we propose a novel text detection system termed SelfText Beyond Polygon (SBP) with Bounding Box Supervision (BBS) and Dynamic Self Training (DST), where training a polygon-based text detector with only a limited set of upright bounding box annotations. For BBS, we firstly utilize the synthetic data with character-level annotations to train a Skeleton Attention Segmentation Network (SASN). Then the box-level annotations are adopted to guide the generation of high-quality polygon-liked pseudo labels, which can be used to train any detectors. In this way, our method achieves the same performance as text detectors trained with polygon annotations (i.e., both are 85.0% F-score for PSENet on ICDAR2015 ). For DST, through dynamically removing the false alarms, it is able to leverage limited labeled data as well as massive unlabeled data to further outperform the expensive baseline. We hope SBP can provide a new perspective for text detection to save huge labeling costs. Code is available at: github.com/weijiawu/SBP.

Viaarxiv icon