Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

A multi-task convolutional neural network for mega-city analysis using very high resolution satellite imagery and geospatial data

Feb 26, 2017
Fan Zhang, Bo Du, Liangpei Zhang

Mega-city analysis with very high resolution (VHR) satellite images has been drawing increasing interest in the fields of city planning and social investigation. It is known that accurate land-use, urban density, and population distribution information is the key to mega-city monitoring and environmental studies. Therefore, how to generate land-use, urban density, and population distribution maps at a fine scale using VHR satellite images has become a hot topic. Previous studies have focused solely on individual tasks with elaborate hand-crafted features and have ignored the relationship between different tasks. In this study, we aim to propose a universal framework which can: 1) automatically learn the internal feature representation from the raw image data; and 2) simultaneously produce fine-scale land-use, urban density, and population distribution maps. For the first target, a deep convolutional neural network (CNN) is applied to learn the hierarchical feature representation from the raw image data. For the second target, a novel CNN-based universal framework is proposed to process the VHR satellite images and generate the land-use, urban density, and population distribution maps. To the best of our knowledge, this is the first CNN-based mega-city analysis method which can process a VHR remote sensing image with such a large data volume. A VHR satellite image (1.2 m spatial resolution) of the center of Wuhan covering an area of 2606 km2 was used to evaluate the proposed method. The experimental results confirm that the proposed method can achieve a promising accuracy for land-use, urban density, and population distribution maps.

  Access Paper or Ask Questions

Single Image Super Resolution via Manifold Approximation

Mar 10, 2015
Chinh Dang, Hayder Radha

Image super-resolution remains an important research topic to overcome the limitations of physical acquisition systems, and to support the development of high resolution displays. Previous example-based super-resolution approaches mainly focus on analyzing the co-occurrence properties of low resolution and high-resolution patches. Recently, we proposed a novel single image super-resolution approach based on linear manifold approximation of the high-resolution image-patch space [1]. The image super-resolution problem is then formulated as an optimization problem of searching for the best matched high resolution patch in the manifold for a given low-resolution patch. We developed a novel technique based on the l1 norm sparse graph to learn a set of low dimensional affine spaces or tangent subspaces of the high-resolution patch manifold. The optimization problem is then solved based on the learned set of tangent subspaces. In this paper, we build on our recent work as follows. First, we consider and analyze each tangent subspace as one point in a Grassmann manifold, which helps to compute geodesic pairwise distances among these tangent subspaces. Second, we develop a min-max algorithm to select an optimal subset of tangent subspaces. This optimal subset reduces the computational cost while still preserving the quality of the reconstructed high-resolution image. Third, and to further achieve lower computational complexity, we perform hierarchical clustering on the optimal subset based on Grassmann manifold distances. Finally, we analytically prove the validity of the proposed Grassmann-distance based clustering. A comparison of the obtained results with other state-of-the-art methods clearly indicates the viability of the new proposed framework.

* This paper has been withdrawn by the author due to a crucial sign error in equation 1 

  Access Paper or Ask Questions

Solving Linear Equations by Classical Jacobi-SR Based Hybrid Evolutionary Algorithm with Uniform Adaptation Technique

Apr 08, 2013
R. M. Jalal Uddin Jamali, M. M. A. Hashem, M. Mahfuz Hasan, Md. Bazlar Rahman

Solving a set of simultaneous linear equations is probably the most important topic in numerical methods. For solving linear equations, iterative methods are preferred over the direct methods especially when the coefficient matrix is sparse. The rate of convergence of iteration method is increased by using Successive Relaxation (SR) technique. But SR technique is very much sensitive to relaxation factor, {\omega}. Recently, hybridization of classical Gauss-Seidel based successive relaxation technique with evolutionary computation techniques have successfully been used to solve large set of linear equations in which relaxation factors are self-adapted. In this paper, a new hybrid algorithm is proposed in which uniform adaptive evolutionary computation techniques and classical Jacobi based SR technique are used instead of classical Gauss-Seidel based SR technique. The proposed Jacobi-SR based uniform adaptive hybrid algorithm, inherently, can be implemented in parallel processing environment efficiently. Whereas Gauss-Seidel-SR based hybrid algorithms cannot be implemented in parallel computing environment efficiently. The convergence theorem and adaptation theorem of the proposed algorithm are proved theoretically. And the performance of the proposed Jacobi-SR based uniform adaptive hybrid evolutionary algorithm is compared with Gauss-Seidel-SR based uniform adaptive hybrid evolutionary algorithm as well as with both classical Jacobi-SR method and Gauss-Seidel-SR method in the experimental domain. The proposed Jacobi-SR based hybrid algorithm outperforms the Gauss-Seidel-SR based hybrid algorithm as well as both classical Jacobi-SR method and Gauss-Seidel-SR method in terms of convergence speed and effectiveness.

* Journal of Engineering Science, Vol. 1, No. 2, pp. 11-24, [ISSN: 2075-4914](2010) 
* 14 Pages, 5 Figures and 7 Tables 

  Access Paper or Ask Questions

Semantic Segmentation for Urban-Scene Images

Oct 20, 2021
Shorya Sharma

Urban-scene Image segmentation is an important and trending topic in computer vision with wide use cases like autonomous driving [1]. Starting with the breakthrough work of Long et al. [2] that introduces Fully Convolutional Networks (FCNs), the development of novel architectures and practical uses of neural networks in semantic segmentation has been expedited in the recent 5 years. Aside from seeking solutions in general model design for information shrinkage due to pooling, urban-scene image itself has intrinsic features like positional patterns [3]. Our project seeks an advanced and integrated solution that specifically targets urban-scene image semantic segmentation among the most novel approaches in the current field. We re-implement the cutting edge model DeepLabv3+ [4] with ResNet-101 [5] backbone as our strong baseline model. Based upon DeepLabv3+, we incorporate HANet [3] to account for the vertical spatial priors in urban-scene image tasks. To boost up model efficiency and performance, we further explore the Atrous Spatial Pooling (ASP) layer in DeepLabv3+ and infuse a computational efficient variation called "Waterfall" Atrous Spatial Pooling (WASP) [6] architecture in our model. We find that our two-step integrated model improves the mean Intersection-Over-Union (mIoU) score gradually from the baseline model. In particular, HANet successfully identifies height-driven patterns and improves per-class IoU of common class labels in urban scenario like fence and bus. We also demonstrate the improvement of model efficiency with help of WASP in terms of computational times during training and parameter reduction from the original ASPP module.

* arXiv admin note: text overlap with arXiv:2003.05128, arXiv:1912.03183 by other authors 

  Access Paper or Ask Questions

Enhancing Object Detection for Autonomous Driving by Optimizing Anchor Generation and Addressing Class Imbalance

Apr 08, 2021
Manuel Carranza-García, Pedro Lara-Benítez, Jorge García-Gutiérrez, José C. Riquelme

Object detection has been one of the most active topics in computer vision for the past years. Recent works have mainly focused on pushing the state-of-the-art in the general-purpose COCO benchmark. However, the use of such detection frameworks in specific applications such as autonomous driving is yet an area to be addressed. This study presents an enhanced 2D object detector based on Faster R-CNN that is better suited for the context of autonomous vehicles. Two main aspects are improved: the anchor generation procedure and the performance drop in minority classes. The default uniform anchor configuration is not suitable in this scenario due to the perspective projection of the vehicle cameras. Therefore, we propose a perspective-aware methodology that divides the image into key regions via clustering and uses evolutionary algorithms to optimize the base anchors for each of them. Furthermore, we add a module that enhances the precision of the second-stage header network by including the spatial information of the candidate regions proposed in the first stage. We also explore different re-weighting strategies to address the foreground-foreground class imbalance, showing that the use of a reduced version of focal loss can significantly improve the detection of difficult and underrepresented objects in two-stage detectors. Finally, we design an ensemble model to combine the strengths of the different learning strategies. Our proposal is evaluated with the Waymo Open Dataset, which is the most extensive and diverse up to date. The results demonstrate an average accuracy improvement of 6.13% mAP when using the best single model, and of 9.69% mAP with the ensemble. The proposed modifications over the Faster R-CNN do not increase computational cost and can easily be extended to optimize other anchor-based detection frameworks.

* Neurocomputing, 2021 

  Access Paper or Ask Questions

Disentangling Latent Space for Unsupervised Semantic Face Editing

Nov 05, 2020
Kanglin Liu, Gaofeng Cao, Fei Zhou, Bozhi Liu, Jiang Duan, Guoping Qiu

Editing facial images created by StyleGAN is a popular research topic with important applications. Through editing the latent vectors, it is possible to control the facial attributes such as smile, age, \textit{etc}. However, facial attributes are entangled in the latent space and this makes it very difficult to independently control a specific attribute without affecting the others. The key to developing neat semantic control is to completely disentangle the latent space and perform image editing in an unsupervised manner. In this paper, we present a new technique termed Structure-Texture Independent Architecture with Weight Decomposition and Orthogonal Regularization (STIA-WO) to disentangle the latent space. The GAN model, applying STIA-WO, is referred to as STGAN-WO. STGAN-WO performs weight decomposition by utilizing the style vector to construct a fully controllable weight matrix for controlling the image synthesis, and utilizes orthogonal regularization to ensure each entry of the style vector only controls one factor of variation. To further disentangle the facial attributes, STGAN-WO introduces a structure-texture independent architecture which utilizes two independently and identically distributed (i.i.d.) latent vectors to control the synthesis of the texture and structure components in a disentangled way.Unsupervised semantic editing is achieved by moving the latent code in the coarse layers along its orthogonal directions to change texture related attributes or changing the latent code in the fine layers to manipulate structure related ones. We present experimental results which show that our new STGAN-WO can achieve better attribute editing than state of the art methods (The code is available at

* 11pages, 8 figures 

  Access Paper or Ask Questions

Equivalent Classification Mapping for Weakly Supervised Temporal Action Localization

Aug 18, 2020
Le Yang, Dingwen Zhang, Tao Zhao, Junwei Han

Weakly supervised temporal action localization is a newly emerging yet widely studied topic in recent years. The existing methods can be categorized into two localization-by-classification pipelines, i.e., the pre-classification pipeline and the post-classification pipeline. The pre-classification pipeline first performs classification on each video snippet and then aggregate the snippet-level classification scores to obtain the video-level classification score, while the post-classification pipeline aggregates the snippet-level features first and then predicts the video-level classification score based on the aggregated feature. Although the classifiers in these two pipelines are used in different ways, the role they play is exactly the same---to classify the given features to identify the corresponding action categories. To this end, an ideal classifier can make both pipelines work. This inspires us to simultaneously learn these two pipelines in a unified framework to obtain an effective classifier. Specifically, in the proposed learning framework, we implement two parallel network streams to model the two localization-by-classification pipelines simultaneously and make the two network streams share the same classifier, thus achieving the novel Equivalent Classification Mapping (ECM) mechanism. Considering that an ideal classifier would make the classification results of the two network streams be identical and make the frame-level classification scores obtained from the pre-classification pipeline and the feature aggregation weights in the post-classification pipeline be consistent, we further introduce an equivalent classification loss and an equivalent weight transition module to endow the proposed learning framework with such properties. Comprehensive experiments are carried on three benchmarks and the proposed ECM achieves superior performance over other state-of-the-art methods.

  Access Paper or Ask Questions

Towards Fine-grained Human Pose Transfer with Detail Replenishing Network

May 26, 2020
Lingbo Yang, Pan Wang, Chang Liu, Zhanning Gao, Peiran Ren, Xinfeng Zhang, Shanshe Wang, Siwei Ma, Xiansheng Hua, Wen Gao

Human pose transfer (HPT) is an emerging research topic with huge potential in fashion design, media production, online advertising and virtual reality. For these applications, the visual realism of fine-grained appearance details is crucial for production quality and user engagement. However, existing HPT methods often suffer from three fundamental issues: detail deficiency, content ambiguity and style inconsistency, which severely degrade the visual quality and realism of generated images. Aiming towards real-world applications, we develop a more challenging yet practical HPT setting, termed as Fine-grained Human Pose Transfer (FHPT), with a higher focus on semantic fidelity and detail replenishment. Concretely, we analyze the potential design flaws of existing methods via an illustrative example, and establish the core FHPT methodology by combing the idea of content synthesis and feature transfer together in a mutually-guided fashion. Thereafter, we substantiate the proposed methodology with a Detail Replenishing Network (DRN) and a corresponding coarse-to-fine model training scheme. Moreover, we build up a complete suite of fine-grained evaluation protocols to address the challenges of FHPT in a comprehensive manner, including semantic analysis, structural detection and perceptual quality assessment. Extensive experiments on the DeepFashion benchmark dataset have verified the power of proposed benchmark against start-of-the-art works, with 12\%-14\% gain on top-10 retrieval recall, 5\% higher joint localization accuracy, and near 40\% gain on face identity preservation. Moreover, the evaluation results offer further insights to the subject matter, which could inspire many promising future works along this direction.

* IEEE TIP submission 

  Access Paper or Ask Questions

CNN2Gate: Toward Designing a General Framework for Implementation of Convolutional Neural Networks on FPGA

Apr 10, 2020
Alireza Ghaffari, Yvon Savaria

Convolutional Neural Networks (CNNs) have a major impact on our society because of the numerous services they provide. On the other hand, they require considerable computing power. To satisfy these requirements, it is possible to use graphic processing units (GPUs). However, high power consumption and limited external IOs constrain their usability and suitability in industrial and mission-critical scenarios. Recently, the number of researches that utilize FPGAs to implement CNNs are increasing rapidly. This is due to the lower power consumption and easy reconfigurability offered by these platforms. Because of the research efforts put into topics such as architecture, synthesis and optimization, some new challenges are arising to integrate such hardware solutions to high-level machine learning software libraries. This paper introduces an integrated framework (CNN2Gate) that supports compilation of a CNN model for an FPGA target. CNN2Gate exploits the OpenCL synthesis workflow for FPGAs offered by commercial vendors. CNN2Gate is capable of parsing CNN models from several popular high-level machine learning libraries such as Keras, Pytorch, Caffe2 etc. CNN2Gate extracts computation flow of layers, in addition to weights and biases and applies a "given" fixed-point quantization. Furthermore, it writes this information in the proper format for OpenCL synthesis tools that are then used to build and run the project on FPGA. CNN2Gate performs design-space exploration using a reinforcement learning agent and fits the design on different FPGAs with limited logic resources automatically. This paper reports results of automatic synthesis and design-space exploration of AlexNet and VGG-16 on various Intel FPGA platforms. CNN2Gate achieves a latency of 205 ms for VGG-16 and 18 ms for AlexNet on the FPGA.

  Access Paper or Ask Questions

On the performance of different excitation-residual blocks for Acoustic Scene Classification

Mar 20, 2020
Javier Naranjo-Alcazar, Sergi Perez-Castanos, Pedro Zuccarello, Maximo Cobos

Acoustic Scene Classification (ASC) is a problem related to the field of machine listening whose objective is to classify/tag an audio clip in a predefined label describing a scene location. Interest in this topic has grown so much over the years that an annual international challenge (Dectection and Classification of Acoustic Scenes and Events) is held to propose novel solutions. Solutions to these problems often incorporate different methods such as data augmentation or with an ensemble of various models. Although the main line of research in the state-of-the-art usually implements these methods, considerable improvements and state-of-the-art results can also be achieved only by modifying the architecture of convolutional neural networks (CNNs). In this work we propose two novel squeeze-excitation blocks to improve the accuracy of an ASC framework by modifying the architecture of the residual block in a CNN together with an analysis of several state-of-the-art blocks. The main idea of squeeze-excitation blocks is to learn spatial and channel-wise feature maps independently instead of jointly as standard CNNs do. This is done by some global grouping operators, linear operators and a final calibration between the input of the block and the relationships obtained by that block. The behavior of the block that implements these operators and, therefore, the entire neural network can be modified depending on the input to the block, the residual configurations and the non-linear activations, that is, at what point of the block they are performed. The analysis has been carried out using TAU Urban Acoustic Scenes 2019 dataset presented in DCASE 2019 edition. All configurations discussed in this document exceed baseline proposed by DCASE organization by 13% percentage points. In turn, the novel configurations proposed in this paper exceed the residual configuration proposed in previous works.

* Preprint 

  Access Paper or Ask Questions