Scene text recognition (STR) methods have struggled to attain high accuracy and fast inference speed. Autoregressive (AR)-based STR model uses the previously recognized characters to decode the next character iteratively. It shows superiority in terms of accuracy. However, the inference speed is slow also due to this iteration. Alternatively, parallel decoding (PD)-based STR model infers all the characters in a single decoding pass. It has advantages in terms of inference speed but worse accuracy, as it is difficult to build a robust recognition context in such a pass. In this paper, we first present an empirical study of AR decoding in STR. In addition to constructing a new AR model with the top accuracy, we find out that the success of AR decoder lies also in providing guidance on visual context perception rather than language modeling as claimed in existing studies. As a consequence, we propose Context Perception Parallel Decoder (CPPD) to decode the character sequence in a single PD pass. CPPD devises a character counting module and a character ordering module. Given a text instance, the former infers the occurrence count of each character, while the latter deduces the character reading order and placeholders. Together with the character prediction task, they construct a context that robustly tells what the character sequence is and where the characters appear, well mimicking the context conveyed by AR decoding. Experiments on both English and Chinese benchmarks demonstrate that CPPD models achieve highly competitive accuracy. Moreover, they run approximately 7x faster than their AR counterparts, and are also among the fastest recognizers. The code will be released soon.
Massive rumors usually appear along with breaking news or trending topics, seriously hindering the truth. Existing rumor detection methods are mostly focused on the same domain, and thus have poor performance in cross-domain scenarios due to domain shift. In this work, we propose an end-to-end instance-wise and prototype-wise contrastive learning model with a cross-attention mechanism for cross-domain rumor detection. The model not only performs cross-domain feature alignment but also enforces target samples to align with the corresponding prototypes of a given source domain. Since target labels in a target domain are unavailable, we use a clustering-based approach with carefully initialized centers by a batch of source domain samples to produce pseudo labels. Moreover, we use a cross-attention mechanism on a pair of source data and target data with the same labels to learn domain-invariant representations. Because the samples in a domain pair tend to express similar semantic patterns, especially on the people's attitudes (e.g., supporting or denying) towards the same category of rumors, the discrepancy between a pair of the source domain and target domain will be decreased. We conduct experiments on four groups of cross-domain datasets and show that our proposed model achieves state-of-the-art performance.
Many complex systems in the real world can be characterized by attributed networks. To mine the potential information in these networks, deep embedded clustering, which obtains node representations and clusters simultaneously, has been paid much attention in recent years. Under the assumption of consistency for data in different views, the cluster structure of network topology and that of node attributes should be consistent for an attributed network. However, many existing methods ignore this property, even though they separately encode node representations from network topology and node attributes meanwhile clustering nodes on representation vectors learnt from one of the views. Therefore, in this study, we propose an end-to-end deep embedded clustering model for attributed networks. It utilizes graph autoencoder and node attribute autoencoder to respectively learn node representations and cluster assignments. In addition, a distribution consistency constraint is introduced to maintain the latent consistency of cluster distributions of two views. Extensive experiments on several datasets demonstrate that the proposed model achieves significantly better or competitive performance compared with the state-of-the-art methods. The source code can be found at https://github.com/Zhengymm/DCP.
Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. In this study, we propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Afterward, hierarchical stages are recurrently carried out by component-level mixing, merging and/or combining. Global and local mixing blocks are devised to perceive the inter-character and intra-character patterns, leading to a multi-grained character component perception. Thus, characters are recognized by a simple linear prediction. Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L (Large) achieves highly competitive accuracy in English and outperforms existing methods by a large margin in Chinese, while running faster. In addition, SVTR-T (Tiny) is an effective and much smaller model, which shows appealing speed at inference. The code is publicly available at https://github.com/PaddlePaddle/PaddleOCR.
Exploring meaningful structural regularities embedded in networks is a key to understanding and analyzing the structure and function of a network. The node-attribute information can help improve such understanding and analysis. However, most of the existing methods focus on detecting traditional communities, i.e., groupings of nodes with dense internal connections and sparse external ones. In this paper, based on the connectivity behavior of nodes and homogeneity of attributes, we propose a principle model (named GNAN), which can generate both topology information and attribute information. The new model can detect not only community structure, but also a range of other types of structure in networks, such as bipartite structure, core-periphery structure, and their mixture structure, which are collectively referred to as generalized structure. The proposed model that combines topological information and node-attribute information can detect communities more accurately than the model that only uses topology information. The dependency between attributes and communities can be automatically learned by our model and thus we can ignore the attributes that do not contain useful information. The model parameters are inferred by using the expectation-maximization algorithm. And a case study is provided to show the ability of our model in the semantic interpretability of communities. Experiments on both synthetic and real-world networks show that the new model is competitive with other state-of-the-art models.
Attention mechanism has been proven effective on natural language processing. This paper proposes an attention boosted natural language inference model named aESIM by adding word attention and adaptive direction-oriented attention mechanisms to the traditional Bi-LSTM layer of natural language inference models, e.g. ESIM. This makes the inference model aESIM has the ability to effectively learn the representation of words and model the local subsentential inference between pairs of premise and hypothesis. The empirical studies on the SNLI, MultiNLI and Quora benchmarks manifest that aESIM is superior to the original ESIM model.