This work presents an in-depth analysis of the majority of the deep neural networks (DNNs) proposed in the state of the art for image recognition. For each DNN multiple performance indices are observed, such as recognition accuracy, model complexity, computational complexity, memory usage, and inference time. The behavior of such performance indices and some combinations of them are analyzed and discussed. To measure the indices we experiment the use of DNNs on two different computer architectures, a workstation equipped with a NVIDIA Titan X Pascal and an embedded system based on a NVIDIA Jetson TX1 board. This experimentation allows a direct comparison between DNNs running on machines with very different computational capacity. This study is useful for researchers to have a complete view of what solutions have been explored so far and in which research directions are worth exploring in the future; and for practitioners to select the DNN architecture(s) that better fit the resource constraints of practical deployments and applications. To complete this work, all the DNNs, as well as the software used for the analysis, are available online.
In a previous work, it was shown that there is a curious problem with the benchmark ColorChecker dataset for illuminant estimation. To wit, this dataset has at least 3 different sets of ground-truths. Typically, for a single algorithm a single ground-truth is used. But then different algorithms, whose performance is measured with respect to different ground-truths, are compared against each other and then ranked. This makes no sense. We show in this paper that there are also errors in how each ground-truth set was calculated. As a result, all performance rankings based on the ColorChecker dataset - and there are scores of these - are inaccurate. In this paper, we re-generate a new 'recommended' set of ground-truth based on the calculation methodology described by Shi and Funt. We then review the performance evaluation of a range of illuminant estimation algorithms. Compared with the legacy ground-truths, we find that the difference in how algorithms perform can be large, with many local rankings of algorithms being reversed. Finally, we draw the readers attention to our new 'open' data repository which, we hope, will allow the ColorChecker set to be rehabilitated and once again to become a useful benchmark for illuminant estimation algorithms.
Recent research has widely explored the problem of aesthetics assessment of images with generic content. However, few approaches have been specifically designed to predict the aesthetic quality of images containing human faces, which make up a massive portion of photos in the web. This paper introduces a method for aesthetic quality assessment of images with faces. We exploit three different Convolutional Neural Networks to encode information regarding perceptual quality, global image aesthetics, and facial attributes; then, a model is trained to combine these features to explicitly predict the aesthetics of images containing faces. Experimental results show that our approach outperforms existing methods for both binary, i.e. low/high, and continuous aesthetic score prediction on four different databases in the state-of-the-art.
Given the existence of many change detection algorithms, each with its own peculiarities and strengths, we propose a combination strategy, that we termed IUTIS (In Unity There Is Strength), based on a genetic Programming framework. This combination strategy is aimed at leveraging the strengths of the algorithms and compensate for their weakness. In this paper we show our findings in applying the proposed strategy in two different scenarios. The first scenario is purely performance-based. The second scenario performance and efficiency must be balanced. Results demonstrate that starting from simple algorithms we can achieve comparable results with respect to more complex state-of-the-art change detection algorithms, while keeping the computational complexity affordable for real-time applications.
In this work we present a method to improve the pruning step of the current state-of-the-art methodology to compress neural networks. The novelty of the proposed pruning technique is in its differentiability, which allows pruning to be performed during the backpropagation phase of the network training. This enables an end-to-end learning and strongly reduces the training time. The technique is based on a family of differentiable pruning functions and a new regularizer specifically designed to enforce pruning. The experimental results show that the joint optimization of both the thresholds and the network weights permits to reach a higher compression rate, reducing the number of weights of the pruned network by a further 14% to 33% with respect to the current state-of-the-art. Furthermore, we believe that this is the first study where the generalization capabilities in transfer learning tasks of the features extracted by a pruned network are analyzed. To achieve this goal, we show that the representations learned using the proposed pruning methodology maintain the same effectiveness and generality of those learned by the corresponding non-compressed network on a set of different recognition tasks.
In this paper we propose a method for logo recognition using deep learning. Our recognition pipeline is composed of a logo region proposal followed by a Convolutional Neural Network (CNN) specifically trained for logo classification, even if they are not precisely localized. Experiments are carried out on the FlickrLogos-32 database, and we evaluate the effect on recognition performance of synthetic versus real data augmentation, and image pre-processing. Moreover, we systematically investigate the benefits of different training choices such as class-balancing, sample-weighting and explicit modeling the background class (i.e. no-logo regions). Experimental results confirm the feasibility of the proposed method, that outperforms the methods in the state of the art.
In this work we investigate the use of deep learning for distortion-generic blind image quality assessment. We report on different design choices, ranging from the use of features extracted from pre-trained Convolutional Neural Networks (CNNs) as a generic image description, to the use of features extracted from a CNN fine-tuned for the image quality task. Our best proposal, named DeepBIQ, estimates the image quality by average pooling the scores predicted on multiple sub-regions of the original image. The score of each sub-region is computed using a Support Vector Regression (SVR) machine taking as input features extracted using a CNN fine-tuned for category-based image quality assessment. Experimental results on the LIVE In the Wild Image Quality Challenge Database and on the LIVE Image Quality Assessment Database show that DeepBIQ outperforms the state-of-the-art methods compared, having a Linear Correlation Coefficient (LCC) with human subjective scores of almost 0.91 and 0.98 respectively. Furthermore, in most of the cases, the quality score predictions of DeepBIQ are closer to the average observer than those of a generic human observer.
This paper introduces a new method for face verification across large age gaps and also a dataset containing variations of age in the wild, the Large Age-Gap (LAG) dataset, with images ranging from child/young to adult/old. The proposed method exploits a deep convolutional neural network (DCNN) pre-trained for the face recognition task on a large dataset and then fine-tuned for the large age-gap face verification task. Finetuning is performed in a Siamese architecture using a contrastive loss function. A feature injection layer is introduced to boost verification accuracy, showing the ability of the DCNN to learn a similarity metric leveraging external features. Experimental results on the LAG dataset show that our method is able to outperform the face verification solutions in the state of the art considered.
In this paper we present a method for the estimation of the color of the illuminant in RAW images. The method includes a Convolutional Neural Network that has been specially designed to produce multiple local estimates. A multiple illuminant detector determines whether or not the local outputs of the network must be aggregated into a single estimate. We evaluated our method on standard datasets with single and multiple illuminants, obtaining lower estimation errors with respect to those obtained by other general purpose methods in the state of the art.
In this paper we propose a strategy for semi-supervised image classification that leverages unsupervised representation learning and co-training. The strategy, that is called CURL from Co-trained Unsupervised Representation Learning, iteratively builds two classifiers on two different views of the data. The two views correspond to different representations learned from both labeled and unlabeled data and differ in the fusion scheme used to combine the image features. To assess the performance of our proposal, we conducted several experiments on widely used data sets for scene and object recognition. We considered three scenarios (inductive, transductive and self-taught learning) that differ in the strategy followed to exploit the unlabeled data. As image features we considered a combination of GIST, PHOG, and LBP as well as features extracted from a Convolutional Neural Network. Moreover, two embodiments of CURL are investigated: one using Ensemble Projection as unsupervised representation learning coupled with Logistic Regression, and one based on LapSVM. The results show that CURL clearly outperforms other supervised and semi-supervised learning methods in the state of the art.