In this work, we propose the Sparse Multi-Family Deep Scattering Network (SMF-DSN), a novel architecture exploiting the interpretability of the Deep Scattering Network (DSN) and improving its expressive power. The DSN extracts salient and interpretable features in signals by cascading wavelet transforms, complex modulus and extract the representation of the data via a translation-invariant operator. First, leveraging the development of highly specialized wavelet filters over the last decades, we propose a multi-family approach to DSN. In particular, we propose to cross multiple wavelet transforms at each layer of the network, thus increasing the feature diversity and removing the need for an expert to select the appropriate filter. Secondly, we develop an optimal thresholding strategy adequate for the DSN that regularizes the network and controls possible instabilities induced by the signals, such as non-stationary noise. Our systematic and principled solution sparsifies the network's latent representation by acting as a local mask distinguishing between activity and noise. The SMF-DSN enhances the DSN by (i) increasing the diversity of the scattering coefficients and (ii) improves its robustness with respect to non-stationary noise.
Kernels derived from deep neural networks (DNNs) in the infinite-width provide not only high performance in a range of machine learning tasks but also new theoretical insights into DNN training dynamics and generalization. In this paper, we extend the family of kernels associated with recurrent neural networks (RNNs), which were previously derived only for simple RNNs, to more complex architectures that are bidirectional RNNs and RNNs with average pooling. We also develop a fast GPU implementation to exploit its full practical potential. While RNNs are typically only applied to time-series data, we demonstrate that classifiers using RNN-based kernels outperform a range of baseline methods on 90 non-time-series datasets from the UCI data repository.
High dimensionality poses many challenges to the use of data, from visualization and interpretation, to prediction and storage for historical preservation. Techniques abound to reduce the dimensionality of fixed-length sequences, yet these methods rarely generalize to variable-length sequences. To address this gap, we extend existing methods that rely on the use of kernels to variable-length sequences via use of the Recurrent Neural Tangent Kernel (RNTK). Since a deep neural network with ReLu activation is a Max-Affine Spline Operator (MASO), we dub our approach Max-Affine Spline Kernel (MASK). We demonstrate how MASK can be used to extend principal components analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) and apply these new algorithms to separate synthetic time series data sampled from second-order differential equations.
Deep Autoencoders (AEs) provide a versatile framework to learn a compressed, interpretable, or structured representation of data. As such, AEs have been used extensively for denoising, compression, data completion as well as pre-training of Deep Networks (DNs) for various tasks such as classification. By providing a careful analysis of current AEs from a spline perspective, we can interpret the input-output mapping, in turn allowing us to derive conditions for generalization and reconstruction guarantee. By assuming a Lie group structure on the data at hand, we are able to derive a novel regularization of AEs, allowing for the first time to ensure the generalization of AEs in the finite training set case. We validate our theoretical analysis by demonstrating how this regularization significantly increases the generalization of the AE on various datasets.
Most current computer vision datasets are composed of disconnected sets, such as images from different classes. We prove that distributions of this type of data cannot be represented with a continuous generative network without error. They can be represented in two ways: With an ensemble of networks or with a single network with truncated latent space. We show that ensembles are more desirable than truncated distributions in practice. We construct a regularized optimization problem that establishes the relationship between a single continuous GAN, an ensemble of GANs, conditional GANs, and Gaussian Mixture GANs. This regularization can be computed efficiently, and we show empirically that our framework has a performance sweet spot which can be found with hyperparameter tuning. This ensemble framework allows better performance than a single continuous GAN or cGAN while maintaining fewer total parameters.
The study of deep networks (DNs) in the infinite-width limit, via the so-called Neural Tangent Kernel (NTK) approach, has provided new insights into the dynamics of learning, generalization, and the impact of initialization. One key DN architecture remains to be kernelized, namely, the Recurrent Neural Network (RNN). In this paper we introduce and study the Recurrent Neural Tangent Kernel (RNTK), which sheds new insights into the behavior of overparametrized RNNs, including how different time steps are weighted by the RNTK to form the output under different initialization parameters and nonlinearity choices, and how inputs of different lengths are treated. We demonstrate via a number of experiments that the RNTK offers significant performance gains over other kernels, including standard NTKs across a range of different data sets. A unique benefit of the RNTK is that it is agnostic to the length of the input, in stark contrast to other kernels.
Deep Generative Networks (DGNs) with probabilistic modeling of their output and latent space are currently trained via Variational Autoencoders (VAEs). In the absence of a known analytical form for the posterior and likelihood expectation, VAEs resort to approximations, including (Amortized) Variational Inference (AVI) and Monte-Carlo (MC) sampling. We exploit the Continuous Piecewise Affine (CPA) property of modern DGNs to derive their posterior and marginal distributions as well as the latter's first moments. These findings enable us to derive an analytical Expectation-Maximization (EM) algorithm that enables gradient-free DGN learning. We demonstrate empirically that EM training of DGNs produces greater likelihood than VAE training. Our findings will guide the design of new VAE AVI that better approximate the true posterior and open avenues to apply standard statistical tools for model comparison, anomaly detection, and missing data imputation.
We develop an interpretable and learnable Wigner-Ville distribution that produces a super-resolved quadratic signal representation for time-series analysis. Our approach has two main hallmarks. First, it interpolates between known time-frequency representations (TFRs) in that it can reach super-resolution with increased time and frequency resolution beyond what the Heisenberg uncertainty principle prescribes and thus beyond commonly employed TFRs, Second, it is interpretable thanks to an explicit low-dimensional and physical parameterization of the Wigner-Ville distribution. We demonstrate that our approach is able to learn highly adapted TFRs and is ready and able to tackle various large-scale classification tasks, where we reach state-of-the-art performance compared to baseline and learned TFRs.
SymJAX is a symbolic programming version of JAX simplifying graph input/output/updates and providing additional functionalities for general machine learning and deep learning applications. From an user perspective SymJAX provides a la Theano experience with fast graph optimization/compilation and broad hardware support, along with Lasagne-like deep learning functionalities.
We connect a large class of Generative Deep Networks (GDNs) with spline operators in order to derive their properties, limitations, and new opportunities. By characterizing the latent space partition, dimension and angularity of the generated manifold, we relate the manifold dimension and approximation error to the sample size. The manifold-per-region affine subspace defines a local coordinate basis; we provide necessary and sufficient conditions relating those basis vectors with disentanglement. We also derive the output probability density mapped onto the generated manifold in terms of the latent space density, which enables the computation of key statistics such as its Shannon entropy. This finding also enables the computation of the GDN likelihood, which provides a new mechanism for model comparison as well as providing a quality measure for (generated) samples under the learned distribution. We demonstrate how low entropy and/or multimodal distributions are not naturally modeled by DGNs and are a cause of training instabilities.