In recent years, attention-based scene text recognition methods have been very popular and attracted the interest of many researchers. Attention-based methods can adaptively focus attention on a small area or even single point during decoding, in which the attention matrix is nearly one-hot distribution. Furthermore, the whole feature maps will be weighted and summed by all attention matrices during inference, causing huge redundant computations. In this paper, we propose an efficient attention-free Single-Point Decoding Network (dubbed SPDN) for scene text recognition, which can replace the traditional attention-based decoding network. Specifically, we propose Single-Point Sampling Module (SPSM) to efficiently sample one key point on the feature map for decoding one character. In this way, our method can not only precisely locate the key point of each character but also remove redundant computations. Based on SPSM, we design an efficient and novel single-point decoding network to replace the attention-based decoding network. Extensive experiments on publicly available benchmarks verify that our SPDN can greatly improve decoding efficiency without sacrificing performance.
Fine-grained recognition, e.g., vehicle identification or bird classification, naturally has specific hierarchical labels, where fine levels are always much harder to be classified than coarse levels. However, most of the recent deep learning based methods neglect the semantic structure of fine-grained objects, and do not take advantages of the traditional fine-grained recognition techniques (e.g. coarse-to-fine classification). In this paper, we propose a novel framework, i.e., semantic bilinear pooling, for fine-grained recognition with hierarchical multi-label learning. This framework can adaptively learn the semantic information from the hierarchical labels. Specifically, a generalized softmax loss is designed for the training of the proposed framework, in order to fully exploit the semantic priors via considering the relevance between adjacent levels. A variety of experiments on several public datasets show that our proposed method has very impressive performance with low feature dimensions compared to other state-of-the-art methods.