We study network pruning which aims to remove redundant channels/kernels and hence speed up the inference of deep networks. Existing pruning methods either train from scratch with sparsity constraints or minimize the reconstruction error between the feature maps of the pre-trained models and the compressed ones. Both strategies suffer from some limitations: the former kind is computationally expensive and difficult to converge, while the latter kind optimizes the reconstruction error but ignores the discriminative power of channels. In this paper, we propose a simple-yet-effective method called discrimination-aware channel pruning (DCP) to choose the channels that actually contribute to the discriminative power. Note that a channel often consists of a set of kernels. Besides the redundancy in channels, some kernels in a channel may also be redundant and fail to contribute to the discriminative power of the network, resulting in kernel level redundancy. To solve this, we propose a discrimination-aware kernel pruning (DKP) method to further compress deep networks by removing redundant kernels. To prevent DCP/DKP from selecting redundant channels/kernels, we propose a new adaptive stopping condition, which helps to automatically determine the number of selected channels/kernels and often results in more compact models with better performance. Extensive experiments on both image classification and face recognition demonstrate the effectiveness of our methods. For example, on ILSVRC-12, the resultant ResNet-50 model with 30% reduction of channels even outperforms the baseline model by 0.36% in terms of Top-1 accuracy. The pruned MobileNetV1 and MobileNetV2 achieve 1.93x and 1.42x inference acceleration on a mobile device, respectively, with negligible performance degradation. The source code and the pre-trained models are available at https://github.com/SCUT-AILab/DCP.
Designing effective architectures is one of the key factors behind the success of deep neural networks. Existing deep architectures are either manually designed or automatically searched by some Neural Architecture Search (NAS) methods. However, even a well-searched architecture may still contain many non-significant or redundant modules or operations (e.g., convolution or pooling), which may not only incur substantial memory consumption and computation cost but also deteriorate the performance. Thus, it is necessary to optimize the operations inside an architecture to improve the performance without introducing extra computation cost. Unfortunately, such a constrained optimization problem is NP-hard. To make the problem feasible, we cast the optimization problem into a Markov decision process (MDP) and seek to learn a Neural Architecture Transformer (NAT) to replace the redundant operations with the more computationally efficient ones (e.g., skip connection or directly removing the connection). Based on MDP, we learn NAT by exploiting reinforcement learning to obtain the optimization policies w.r.t. different architectures. To verify the effectiveness of the proposed strategies, we apply NAT on both hand-crafted architectures and NAS based architectures. Extensive experiments on two benchmark datasets, i.e., CIFAR-10 and ImageNet, demonstrate that the transformed architecture by NAT significantly outperforms both its original form and those architectures optimized by existing methods.
Generating images via the generative adversarial network (GAN) has attracted much attention recently. However, most of the existing GAN-based methods can only produce low-resolution images of limited quality. Directly generating high-resolution images using GANs is nontrivial, and often produces problematic images with incomplete objects. To address this issue, we develop a novel GAN called Auto-Embedding Generative Adversarial Network (AEGAN), which simultaneously encodes the global structure features and captures the fine-grained details. In our network, we use an autoencoder to learn the intrinsic high-level structure of real images and design a novel denoiser network to provide photo-realistic details for the generated images. In the experiments, we are able to produce 512x512 images of promising quality directly from the input noise. The resultant images exhibit better perceptual photo-realism, i.e., with sharper structure and richer details, than other baselines on several datasets, including Oxford-102 Flowers, Caltech-UCSD Birds (CUB), High-Quality Large-scale CelebFaces Attributes (CelebA-HQ), Large-scale Scene Understanding (LSUN) and ImageNet.
Channel pruning is one of the predominant approaches for deep model compression. Existing pruning methods either train from scratch with sparsity constraints on channels, or minimize the reconstruction error between the pre-trained feature maps and the compressed ones. Both strategies suffer from some limitations: the former kind is computationally expensive and difficult to converge, whilst the latter kind optimizes the reconstruction error but ignores the discriminative power of channels. To overcome these drawbacks, we investigate a simple-yet-effective method, called discrimination-aware channel pruning, to choose those channels that really contribute to discriminative power. To this end, we introduce additional losses into the network to increase the discriminative power of intermediate layers and then select the most discriminative channels for each layer by considering the additional loss and the reconstruction error. Last, we propose a greedy algorithm to conduct channel selection and parameter optimization in an iterative way. Extensive experiments demonstrate the effectiveness of our method. For example, on ILSVRC-12, our pruned ResNet-50 with 30% reduction of channels even outperforms the original model by 0.39% in top-1 accuracy.
Deep neural networks have exhibited promising performance in image super-resolution (SR) due to the power in learning the non-linear mapping from low-resolution (LR) images to high-resolution (HR) images. However, most deep learning methods employ feed-forward architectures, and thus the dependencies between LR and HR images are not fully exploited, leading to limited learning performance. Moreover, most deep learning based SR methods apply the pixel-wise reconstruction error as the loss, which, however, may fail to capture high-frequency information and produce perceptually unsatisfying results, whilst the recent perceptual loss relies on some pre-trained deep model and they may not generalize well. In this paper, we introduce a mask to separate the image into low- and high-frequency parts based on image gradient magnitude, and then devise a gradient sensitive loss to well capture the structures in the image without sacrificing the recovery of low-frequency content. Moreover, by investigating the duality in SR, we develop a dual reconstruction network (DRN) to improve the SR performance. We provide theoretical analysis on the generalization performance of our method and demonstrate its effectiveness and superiority with thorough experiments.
Generative adversarial networks (GANs) aim to generate realistic data from some prior distribution (e.g., Gaussian noises). However, such prior distribution is often independent of real data and thus may lose semantic information (e.g., geometric structure or content in images) of data. In practice, the semantic information might be represented by some latent distribution learned from data, which, however, is hard to be used for sampling in GANs. In this paper, rather than sampling from the pre-defined prior distribution, we propose a Local Coordinate Coding (LCC) based sampling method to improve GANs. We derive a generalization bound for LCC based GANs and prove that a small dimensional input is sufficient to achieve good generalization. Extensive experiments on various real-world datasets demonstrate the effectiveness of the proposed method.
The depth is one of the key factors behind the great success of convolutional neural networks (CNNs), with the gradient vanishing issue having been largely addressed by various nets, e.g. ResNet. However, when the depth goes very deep, the supervision information from the loss function will vanish due to the long backpropagation path, especially for those shallow layers. This means that intermediate layers receive less supervision information and will lead to redundancy in models. As a result, the model becomes very redundant and the over-fitting issue may happen. To address this, we propose a model, called AuxNet, by introducing auxiliary outputs at intermediate layers. Different from existing approaches, we propose a Multi-path training method to propagate not only gradients but also sufficient supervision informationfrommultipleauxiliaryoutputs. TheproposedAuxNetwithmulti-pathtrainingmethodgivesrisetomorecompact networks which outperform their very deep equivalent (i.e. ResNet). For example, AuxNet with 44 layers performs better than the ResNet equivalent with 110 layers on several benchmark data sets, i.e. CIFAR-10, CIFAR-100 and SVHN.
Single-pixel cameras based on the concepts of compressed sensing (CS) leverage the inherent structure of images to retrieve them with far fewer measurements and operate efficiently over a significantly broader spectral range than conventional silicon-based cameras. Recently, photonic time-stretch (PTS) technique facilitates the emergence of high-speed single-pixel cameras. A significant breakthrough in imaging speed of single-pixel cameras enables observation of fast dynamic phenomena. However, according to CS theory, image reconstruction is an iterative process that consumes enormous amounts of computational time and cannot be performed in real time. To address this challenge, we propose a novel single-pixel imaging technique that can produce high-quality images through rapid acquisition of their effective spatial Fourier spectrum. We employ phase-shifting sinusoidal structured illumination instead of random illumination for spectrum acquisition and apply inverse Fourier transform to the obtained spectrum for image restoration. We evaluate the performance of our prototype system by recognizing quick response (QR) codes and flow cytometric screening of cells. A frame rate of 625 kHz and a compression ratio of 10% are experimentally demonstrated in accordance with the recognition rate of the QR code. An imaging flow cytometer enabling high-content screening with an unprecedented throughput of 100,000 cells/s is also demonstrated. For real-time imaging applications, the proposed single-pixel microscope can significantly reduce the time required for image reconstruction by two orders of magnitude, which can be widely applied in industrial quality control and label-free biomedical imaging.