We propose a new way of thinking about deep neural networks, in which the linear and non-linear components of the network are naturally derived and justified in terms of principles in probability theory. In particular, the models constructed in our framework assign probabilities to uncertain realizations, leading to Kullback-Leibler Divergence (KLD) as the linear layer. In our model construction, we also arrive at a structure similar to ReLU activation supported with Bayes' theorem. The non-linearities in our framework are normalization layers with ReLU and Sigmoid as element-wise approximations. Additionally, the pooling function is derived as a marginalization of spatial random variables according to the mechanics of the framework. As such, Max Pooling is an approximation to the aforementioned marginalization process. Since our models are comprised of finite state distributions (FSD) as variables and parameters, exact computation of information-theoretic quantities such as entropy and KLD is possible, thereby providing more objective measures to analyze networks. Unlike existing designs that rely on heuristics, the proposed framework restricts subjective interpretations of CNNs and sheds light on the functionality of neural networks from a completely new perspective.
Discovering and clustering subspaces in high-dimensional data is a fundamental problem of machine learning with a wide range of applications in data mining, computer vision, and pattern recognition. Earlier methods divided the problem into two separate stages of finding the similarity matrix and finding clusters. Similar to some recent works, we integrate these two steps using a joint optimization approach. We make the following contributions: (i) we estimate the reliability of the cluster assignment for each point before assigning a point to a subspace. We group the data points into two groups of "certain" and "uncertain", with the assignment of latter group delayed until their subspace association certainty improves. (ii) We demonstrate that delayed association is better suited for clustering subspaces that have ambiguities, i.e. when subspaces intersect or data are contaminated with outliers/noise. (iii) We demonstrate experimentally that such delayed probabilistic association leads to a more accurate self-representation and final clusters. The proposed method has higher accuracy both for points that exclusively lie in one subspace, and those that are on the intersection of subspaces. (iv) We show that delayed association leads to huge reduction of computational cost, since it allows for incremental spectral clustering.
We propose a new deep learning approach for automatic detection and segmentation of fluid within retinal OCT images. The proposed framework utilizes both ResNet and Encoder-Decoder neural network architectures. When training the network, we apply a novel data augmentation method called myopic warping together with standard rotation-based augmentation to increase the training set size to 45 times the original amount. Finally, the network output is post-processed with an energy minimization algorithm (graph cut) along with a few other knowledge guided morphological operations to finalize the segmentation process. Based on OCT imaging data and its ground truth from the RETOUCH challenge, the proposed system achieves dice indices of 0.522, 0.682, and 0.612, and average absolute volume differences of 0.285, 0.115, and 0.156 mm$^3$ for intaretinal fluid, subretinal fluid, and pigment epithelial detachment respectively.
Self-similarity was recently introduced as a measure of inter-class congruence for classification of actions. Herein, we investigate the dual problem of intra-class dissimilarity for classification of action styles. We introduce self-dissimilarity matrices that discriminate between same actions performed by different subjects regardless of viewing direction and camera parameters. We investigate two frameworks using these invariant style dissimilarity measures based on Principal Component Analysis (PCA) and Fisher Discriminant Analysis (FDA). Extensive experiments performed on IXMAS dataset indicate remarkably good discriminant characteristics for the proposed invariant measures for gender recognition from video data.
In this paper, we show that different body parts do not play equally important roles in recognizing a human action in video data. We investigate to what extent a body part plays a role in recognition of different actions and hence propose a generic method of assigning weights to different body points. The approach is inspired by the strong evidence in the applied perception community that humans perform recognition in a foveated manner, that is they recognize events or objects by only focusing on visually significant aspects. An important contribution of our method is that the computation of the weights assigned to body parts is invariant to viewing directions and camera parameters in the input data. We have performed extensive experiments to validate the proposed approach and demonstrate its significance. In particular, results show that considerable improvement in performance is gained by taking into account the relative importance of different body parts as defined by our approach.
This paper presents a new approach for tackling the shift-invariance problem in the discrete Haar domain, without trading off any of its desirable properties, such as compression, separability, orthogonality, and symmetry. The paper presents several key theoretical contributions. First, we derive closed form expressions for phase shifting in the Haar domain both in partially decimated and fully decimated transforms. Second, it is shown that the wavelet coefficients of the shifted signal can be computed solely by using the coefficients of the original transformed signal. Third, we derive closed-form expressions for non-integer shifts, which have not been previously reported in the literature. Fourth, we establish the complexity of the proposed phase shifting approach using the derived analytic expressions. As an application example of these results, we apply the new formulae to image rotation and interpolation, and evaluate its performance against standard methods.
This paper focuses on real-time all-frequency image-based rendering using an innovative solution for run-time computation of light transport. The approach is based on new results derived for non-linear phase shifting in the Haar wavelet domain. Although image-based methods for real-time rendering of dynamic glossy objects have been proposed, they do not truly scale to all possible frequencies and high sampling rates without trading storage, glossiness, or computational time, while varying both lighting and viewpoint. This is due to the fact that current approaches are limited to precomputed radiance transfer (PRT), which is prohibitively expensive in terms of memory requirements and real-time rendering when both varying light and viewpoint changes are required together with high sampling rates for high frequency lighting of glossy material. On the other hand, current methods cannot handle object rotation, which is one of the paramount issues for all PRT methods using wavelets. This latter problem arises because the precomputed data are defined in a global coordinate system and encoded in the wavelet domain, while the object is rotated in a local coordinate system. At the root of all the above problems is the lack of efficient run-time solution to the nontrivial problem of rotating wavelets (a non-linear phase-shift), which we solve in this paper.
Image search and retrieval engines rely heavily on textual annotation in order to match word queries to a set of candidate images. A system that can automatically annotate images with meaningful text can be highly beneficial for such engines. Currently, the approaches to develop such systems try to establish relationships between keywords and visual features of images. In this paper, We make three main contributions to this area: (i) We transform this problem from the low-level keyword space to the high-level semantics space that we refer to as the "{\em image theme}", (ii) Instead of treating each possible keyword independently, we use latent Dirichlet allocation to learn image themes from the associated texts in a training phase. Images are then annotated with image themes rather than keywords, using a modified continuous relevance model, which takes into account the spatial coherence and the visual continuity among images of common theme. (iii) To achieve more coherent annotations among images of common theme, we have integrated ConceptNet in learning the semantics of images, and hence augment image descriptions beyond annotations provided by humans. Images are thus further annotated by a few most significant words of the prominent image theme. Our extensive experiments show that a coherent theme-based image annotation using high-level semantics results in improved precision and recall as compared with equivalent classical keyword annotation systems.
In this paper, we present a new mathematical foundation for image-based lighting. Using a simple manipulation of the local coordinate system, we derive a closed-form solution to the light integral equation under distant environment illumination. We derive our solution for different BRDF's such as lambertian and Phong-like. The method is free of noise, and provides the possibility of using the full spectrum of frequencies captured by images taken from the environment. This allows for the color of the rendered object to be toned according to the color of the light in the environment. Experimental results also show that one can gain an order of magnitude or higher in rendering time compared to Monte Carlo quadrature methods and spherical harmonics.
Most multispectral remote sensors (e.g. QuickBird, IKONOS, and Landsat 7 ETM+) provide low-spatial high-spectral resolution multispectral (MS) or high-spatial low-spectral resolution panchromatic (PAN) images, separately. In order to reconstruct a high-spatial/high-spectral resolution multispectral image volume, either the information in MS and PAN images are fused (i.e. pansharpening) or super-resolution reconstruction (SRR) is used with only MS images captured on different dates. Existing methods do not utilize temporal information of MS and high spatial resolution of PAN images together to improve the resolution. In this paper, we propose a multiframe SRR algorithm using pansharpened MS images, taking advantage of both temporal and spatial information available in multispectral imagery, in order to exceed spatial resolution of given PAN images. We first apply pansharpening to a set of multispectral images and their corresponding PAN images captured on different dates. Then, we use the pansharpened multispectral images as input to the proposed wavelet-based multiframe SRR method to yield full volumetric SRR. The proposed SRR method is obtained by deriving the subband relations between multitemporal MS volumes. We demonstrate the results on Landsat 7 ETM+ images comparing our method to conventional techniques.