In this work, we study the problem of word-level confidence calibration for scene-text recognition (STR). Although the topic of confidence calibration has been an active research area for the last several decades, the case of structured and sequence prediction calibration has been scarcely explored. We analyze several recent STR methods and show that they are consistently overconfident. We then focus on the calibration of STR models on the word rather than the character level. In particular, we demonstrate that for attention based decoders, calibration of individual character predictions increases word-level calibration error compared to an uncalibrated model. In addition, we apply existing calibration methodologies as well as new sequence-based extensions to numerous STR models, demonstrating reduced calibration error by up to a factor of nearly 7. Finally, we show consistently improved accuracy results by applying our proposed sequence calibration method as a preprocessing step to beam-search.
We present a detector for curved text in natural images. We model scene text instances as tubes around their medial axes and introduce a parametrization-invariant loss function. We train a two-stage curved text detector, and evaluate it on the curved text benchmarks CTW-1500 and Total-Text. Our approach achieves state-of-the-art results or improves upon them, notably for CTW-1500 by over 8 percentage points in F-score.