Scene text detection has witnessed rapid development in recent years. However, there still exists two main challenges: 1) many methods suffer from false positives in their text representations; 2) the large scale variance of scene texts makes it hard for network to learn samples. In this paper, we propose the ContourNet, which effectively handles these two problems taking a further step toward accurate arbitrary-shaped text detection. At first, a scale-insensitive Adaptive Region Proposal Network (Adaptive-RPN) is proposed to generate text proposals by only focusing on the Intersection over Union (IoU) values between predicted and ground-truth bounding boxes. Then a novel Local Orthogonal Texture-aware Module (LOTM) models the local texture information of proposal features in two orthogonal directions and represents text region with a set of contour points. Considering that the strong unidirectional or weakly orthogonal activation is usually caused by the monotonous texture characteristic of false-positive patterns (e.g. streaks.), our method effectively suppresses these false positives by only outputting predictions with high response value in both orthogonal directions. This gives more accurate description of text regions. Extensive experiments on three challenging datasets (Total-Text, CTW1500 and ICDAR2015) verify that our method achieves the state-of-the-art performance. Code is available at https://github.com/wangyuxin87/ContourNet.
Taking full advantage of the information from both vision and language is critical for the video captioning task. Existing models lack adequate visual representation due to the neglect of interaction between object, and sufficient training for content-related words due to long-tailed problems. In this paper, we propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model. The ELM generates more semantically similar word proposals which extend the ground-truth words used for training to deal with the long-tailed problem. Experimental evaluations on three benchmarks: MSVD, MSR-VTT and VATEX show the proposed ORG-TRL system achieves state-of-the-art performance. Extensive ablation studies and visualizations illustrate the effectiveness of our system.
Both the Dictionary Learning (DL) and Convolutional Neural Networks (CNN) are powerful image representation learning systems based on different mechanisms and principles, however whether we can seamlessly integrate them to improve the per-formance is noteworthy exploring. To address this issue, we propose a novel generalized end-to-end representation learning architecture, dubbed Convolutional Dictionary Pair Learning Network (CDPL-Net) in this paper, which integrates the learning schemes of the CNN and dictionary pair learning into a unified framework. Generally, the architecture of CDPL-Net includes two convolutional/pooling layers and two dictionary pair learn-ing (DPL) layers in the representation learning module. Besides, it uses two fully-connected layers as the multi-layer perception layer in the nonlinear classification module. In particular, the DPL layer can jointly formulate the discriminative synthesis and analysis representations driven by minimizing the batch based reconstruction error over the flatted feature maps from the convolution/pooling layer. Moreover, DPL layer uses l1-norm on the analysis dictionary so that sparse representation can be delivered, and the embedding process will also be robust to noise. To speed up the training process of DPL layer, the efficient stochastic gradient descent is used. Extensive simulations on real databases show that our CDPL-Net can deliver enhanced performance over other state-of-the-art methods.
In this paper, we investigate the unsupervised deep representation learning issue and technically propose a novel framework called Deep Self-representative Concept Factorization Network (DSCF-Net), for clustering deep features. To improve the representation and clustering abilities, DSCF-Net explicitly considers discovering hidden deep semantic features, enhancing the robustness proper-ties of the deep factorization to noise and preserving the local man-ifold structures of deep features. Specifically, DSCF-Net seamlessly integrates the robust deep concept factorization, deep self-expressive representation and adaptive locality preserving feature learning into a unified framework. To discover hidden deep repre-sentations, DSCF-Net designs a hierarchical factorization architec-ture using multiple layers of linear transformations, where the hierarchical representation is performed by formulating the prob-lem as optimizing the basis concepts in each layer to improve the representation indirectly. DSCF-Net also improves the robustness by subspace recovery for sparse error correction firstly and then performs the deep factorization in the recovered visual subspace. To obtain locality-preserving representations, we also present an adaptive deep self-representative weighting strategy by using the coefficient matrix as the adaptive reconstruction weights to keep the locality of representations. Extensive comparison results with several other related models show that DSCF-Net delivers state-of-the-art performance on several public databases.
In this paper, we propose a robust representation learning model called Adaptive Structure-constrained Low-Rank Coding (AS-LRC) for the latent representation of data. To recover the underlying subspaces more accurately, AS-LRC seamlessly integrates an adaptive weighting based block-diagonal structure-constrained low-rank representation and the group sparse salient feature extraction into a unified framework. Specifically, AS-LRC performs the latent decomposition of given data into a low-rank reconstruction by a block-diagonal codes matrix, a group sparse locality-adaptive salient feature part and a sparse error part. To enforce the block-diagonal structures adaptive to different real datasets for the low-rank recovery, AS-LRC clearly computes an auto-weighting matrix based on the locality-adaptive features and multiplies by the low-rank coefficients for direct minimization at the same time. This encourages the codes to be block-diagonal and can avoid the tricky issue of choosing optimal neighborhood size or kernel width for the weight assignment, suffered in most local geometrical structures-preserving low-rank coding methods. In addition, our AS-LRC selects the L2,1-norm on the projection for extracting group sparse features rather than learning low-rank features by Nuclear-norm regularization, which can make learnt features robust to noise and outliers in samples, and can also make the feature coding process efficient. Extensive visualizations and numerical results demonstrate the effectiveness of our AS-LRC for image representation and recovery.
We propose a novel and unsupervised representation learning model, i.e., Robust Block-Diagonal Adaptive Locality-constrained Latent Representation (rBDLR). rBDLR is able to recover multi-subspace structures and extract the adaptive locality-preserving salient features jointly. Leveraging on the Frobenius-norm based latent low-rank representation model, rBDLR jointly learns the coding coefficients and salient features, and improves the results by enhancing the robustness to outliers and errors in given data, preserving local information of salient features adaptively and ensuring the block-diagonal structures of the coefficients. To improve the robustness, we perform the latent representation and adaptive weighting in a recovered clean data space. To force the coefficients to be block-diagonal, we perform auto-weighting by minimizing the reconstruction error based on salient features, constrained using a block-diagonal regularizer. This ensures that a strict block-diagonal weight matrix can be obtained and salient features will possess the adaptive locality preserving ability. By minimizing the difference between the coefficient and weights matrices, we can obtain a block-diagonal coefficients matrix and it can also propagate and exchange useful information between salient features and coefficients. Extensive results demonstrate the superiority of rBDLR over other state-of-the-art methods.