Alert button
Picture for Yuval Kluger

Yuval Kluger

Alert button

Exponential weight averaging as damped harmonic motion

Oct 20, 2023
Jonathan Patsenker, Henry Li, Yuval Kluger

The exponential moving average (EMA) is a commonly used statistic for providing stable estimates of stochastic quantities in deep learning optimization. Recently, EMA has seen considerable use in generative models, where it is computed with respect to the model weights, and significantly improves the stability of the inference model during and after training. While the practice of weight averaging at the end of training is well-studied and known to improve estimates of local optima, the benefits of EMA over the course of training is less understood. In this paper, we derive an explicit connection between EMA and a damped harmonic system between two particles, where one particle (the EMA weights) is drawn to the other (the model weights) via an idealized zero-length spring. We then leverage this physical analogy to analyze the effectiveness of EMA, and propose an improved training algorithm, which we call BELAY. Finally, we demonstrate theoretically and empirically several advantages enjoyed by BELAY over standard EMA.

* 10 pages, 7 figures. ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems. 2023 
Viaarxiv icon

Multi-modal Differentiable Unsupervised Feature Selection

Mar 16, 2023
Junchen Yang, Ofir Lindenbaum, Yuval Kluger, Ariel Jaffe

Figure 1 for Multi-modal Differentiable Unsupervised Feature Selection
Figure 2 for Multi-modal Differentiable Unsupervised Feature Selection
Figure 3 for Multi-modal Differentiable Unsupervised Feature Selection
Figure 4 for Multi-modal Differentiable Unsupervised Feature Selection

Multi-modal high throughput biological data presents a great scientific opportunity and a significant computational challenge. In multi-modal measurements, every sample is observed simultaneously by two or more sets of sensors. In such settings, many observed variables in both modalities are often nuisance and do not carry information about the phenomenon of interest. Here, we propose a multi-modal unsupervised feature selection framework: identifying informative variables based on coupled high-dimensional measurements. Our method is designed to identify features associated with two types of latent low-dimensional structures: (i) shared structures that govern the observations in both modalities and (ii) differential structures that appear in only one modality. To that end, we propose two Laplacian-based scoring operators. We incorporate the scores with differentiable gates that mask nuisance features and enhance the accuracy of the structure captured by the graph Laplacian. The performance of the new scheme is illustrated using synthetic and real datasets, including an extended biological application to single-cell multi-omics.

Viaarxiv icon

Autoregressive Generative Modeling with Noise Conditional Maximum Likelihood Estimation

Oct 19, 2022
Henry Li, Yuval Kluger

Figure 1 for Autoregressive Generative Modeling with Noise Conditional Maximum Likelihood Estimation
Figure 2 for Autoregressive Generative Modeling with Noise Conditional Maximum Likelihood Estimation
Figure 3 for Autoregressive Generative Modeling with Noise Conditional Maximum Likelihood Estimation
Figure 4 for Autoregressive Generative Modeling with Noise Conditional Maximum Likelihood Estimation

We introduce a simple modification to the standard maximum likelihood estimation (MLE) framework. Rather than maximizing a single unconditional likelihood of the data under the model, we maximize a family of \textit{noise conditional} likelihoods consisting of the data perturbed by a continuum of noise levels. We find that models trained this way are more robust to noise, obtain higher test likelihoods, and generate higher quality images. They can also be sampled from via a novel score-based sampling scheme which combats the classical \textit{covariate shift} problem that occurs during sample generation in autoregressive models. Applying this augmentation to autoregressive image models, we obtain 3.32 bits per dimension on the ImageNet 64x64 dataset, and substantially improve the quality of generated samples in terms of the Frechet Inception distance (FID) -- from 37.50 to 12.09 on the CIFAR-10 dataset.

* 18 pages, 10 figures, 2 tables 
Viaarxiv icon

ManiFeSt: Manifold-based Feature Selection for Small Data Sets

Jul 18, 2022
David Cohen, Tal Shnitzer, Yuval Kluger, Ronen Talmon

Figure 1 for ManiFeSt: Manifold-based Feature Selection for Small Data Sets
Figure 2 for ManiFeSt: Manifold-based Feature Selection for Small Data Sets
Figure 3 for ManiFeSt: Manifold-based Feature Selection for Small Data Sets
Figure 4 for ManiFeSt: Manifold-based Feature Selection for Small Data Sets

In this paper, we present a new method for few-sample supervised feature selection (FS). Our method first learns the manifold of the feature space of each class using kernels capturing multi-feature associations. Then, based on Riemannian geometry, a composite kernel is computed, extracting the differences between the learned feature associations. Finally, a FS score based on spectral analysis is proposed. Considering multi-feature associations makes our method multivariate by design. This in turn allows for the extraction of the hidden manifold underlying the features and avoids overfitting, facilitating few-sample FS. We showcase the efficacy of our method on illustrative examples and several benchmarks, where our method demonstrates higher accuracy in selecting the informative features compared to competing methods. In addition, we show that our FS leads to improved classification and better generalization when applied to test data.

* 22 pages, 10 figures 
Viaarxiv icon

Neural Inverse Transform Sampler

Jun 22, 2022
Henry Li, Yuval Kluger

Figure 1 for Neural Inverse Transform Sampler
Figure 2 for Neural Inverse Transform Sampler
Figure 3 for Neural Inverse Transform Sampler
Figure 4 for Neural Inverse Transform Sampler

Any explicit functional representation $f$ of a density is hampered by two main obstacles when we wish to use it as a generative model: designing $f$ so that sampling is fast, and estimating $Z = \int f$ so that $Z^{-1}f$ integrates to 1. This becomes increasingly complicated as $f$ itself becomes complicated. In this paper, we show that when modeling one-dimensional conditional densities with a neural network, $Z$ can be exactly and efficiently computed by letting the network represent the cumulative distribution function of a target density, and applying a generalized fundamental theorem of calculus. We also derive a fast algorithm for sampling from the resulting representation by the inverse transform method. By extending these principles to higher dimensions, we introduce the \textbf{Neural Inverse Transform Sampler (NITS)}, a novel deep learning framework for modeling and sampling from general, multidimensional, compactly-supported probability densities. NITS is a highly expressive density estimator that boasts end-to-end differentiability, fast sampling, and exact and cheap likelihood evaluation. We demonstrate the applicability of NITS by applying it to realistic, high-dimensional density estimation tasks: likelihood-based generative modeling on the CIFAR-10 dataset, and density estimation on the UCI suite of benchmark datasets, where NITS produces compelling results rivaling or surpassing the state of the art.

* 13 pages, 3 figures 
Viaarxiv icon

Deep Unsupervised Feature Selection by Discarding Nuisance and Correlated Features

Oct 11, 2021
Uri Shaham, Ofir Lindenbaum, Jonathan Svirsky, Yuval Kluger

Figure 1 for Deep Unsupervised Feature Selection by Discarding Nuisance and Correlated Features
Figure 2 for Deep Unsupervised Feature Selection by Discarding Nuisance and Correlated Features
Figure 3 for Deep Unsupervised Feature Selection by Discarding Nuisance and Correlated Features
Figure 4 for Deep Unsupervised Feature Selection by Discarding Nuisance and Correlated Features

Modern datasets often contain large subsets of correlated features and nuisance features, which are not or loosely related to the main underlying structures of the data. Nuisance features can be identified using the Laplacian score criterion, which evaluates the importance of a given feature via its consistency with the Graph Laplacians' leading eigenvectors. We demonstrate that in the presence of large numbers of nuisance features, the Laplacian must be computed on the subset of selected features rather than on the complete feature set. To do this, we propose a fully differentiable approach for unsupervised feature selection, utilizing the Laplacian score criterion to avoid the selection of nuisance features. We employ an autoencoder architecture to cope with correlated features, trained to reconstruct the data from the subset of selected features. Building on the recently proposed concrete layer that allows controlling for the number of selected features via architectural design, simplifying the optimization process. Experimenting on several real-world datasets, we demonstrate that our proposed approach outperforms similar approaches designed to avoid only correlated or nuisance features, but not both. Several state-of-the-art clustering results are reported.

Viaarxiv icon

Probabilistic Robust Autoencoders for Anomaly Detection

Oct 01, 2021
Yariv Aizenbud, Ofir Lindenbaum, Yuval Kluger

Figure 1 for Probabilistic Robust Autoencoders for Anomaly Detection
Figure 2 for Probabilistic Robust Autoencoders for Anomaly Detection
Figure 3 for Probabilistic Robust Autoencoders for Anomaly Detection
Figure 4 for Probabilistic Robust Autoencoders for Anomaly Detection

Empirical observations often consist of anomalies (or outliers) that contaminate the data. Accurate identification of anomalous samples is crucial for the success of downstream data analysis tasks. To automatically identify anomalies, we propose a new type of autoencoder (AE) which we term Probabilistic Robust autoencoder (PRAE). PRAE is designed to simultaneously remove outliers and identify a low-dimensional representation for the inlier samples. We first describe Robust AE (RAE) as a model that aims to split the data to inlier samples from which a low dimensional representation is learned via an AE, and anomalous (outlier) samples that are excluded as they do not fit the low dimensional representation. Robust AE minimizes the reconstruction of the AE while attempting to incorporate as many observations as possible. This could be realized by subtracting from the reconstruction term an $\ell_0$ norm counting the number of selected observations. Since the $\ell_0$ norm is not differentiable, we propose two probabilistic relaxations for the RAE approach and demonstrate that they can effectively identify anomalies. We prove that the solution to PRAE is equivalent to the solution of RAE and demonstrate using extensive simulations that PRAE is at par with state-of-the-art methods for anomaly detection.

Viaarxiv icon

Locally Sparse Networks for Interpretable Predictions

Jun 11, 2021
Junchen Yang, Ofir Lindenbaum, Yuval Kluger

Figure 1 for Locally Sparse Networks for Interpretable Predictions
Figure 2 for Locally Sparse Networks for Interpretable Predictions
Figure 3 for Locally Sparse Networks for Interpretable Predictions
Figure 4 for Locally Sparse Networks for Interpretable Predictions

Despite the enormous success of neural networks, they are still hard to interpret and often overfit when applied to low-sample-size (LSS) datasets. To tackle these obstacles, we propose a framework for training locally sparse neural networks where the local sparsity is learned via a sample-specific gating mechanism that identifies the subset of most relevant features for each measurement. The sample-specific sparsity is predicted via a \textit{gating} network, which is trained in tandem with the \textit{prediction} network. By learning these subsets and weights of a prediction model, we obtain an interpretable neural network that can handle LSS data and can remove nuisance variables, which are irrelevant for the supervised learning task. Using both synthetic and real-world datasets, we demonstrate that our method outperforms state-of-the-art models when predicting the target function with far fewer features per instance.

Viaarxiv icon

Spectral Top-Down Recovery of Latent Tree Models

Feb 26, 2021
Yariv Aizenbud, Ariel Jaffe, Meng Wang, Amber Hu, Noah Amsel, Boaz Nadler, Joseph T. Chang, Yuval Kluger

Figure 1 for Spectral Top-Down Recovery of Latent Tree Models
Figure 2 for Spectral Top-Down Recovery of Latent Tree Models
Figure 3 for Spectral Top-Down Recovery of Latent Tree Models
Figure 4 for Spectral Top-Down Recovery of Latent Tree Models

Modeling the distribution of high dimensional data by a latent tree graphical model is a common approach in multiple scientific domains. A common task is to infer the underlying tree structure given only observations of the terminal nodes. Many algorithms for tree recovery are computationally intensive, which limits their applicability to trees of moderate size. For large trees, a common approach, termed divide-and-conquer, is to recover the tree structure in two steps. First, recover the structure separately for multiple randomly selected subsets of the terminal nodes. Second, merge the resulting subtrees to form a full tree. Here, we develop Spectral Top-Down Recovery (STDR), a divide-and-conquer approach for inference of large latent tree models. Unlike previous methods, STDR's partitioning step is non-random. Instead, it is based on the Fiedler vector of a suitable Laplacian matrix related to the observed nodes. We prove that under certain conditions this partitioning is consistent with the tree structure. This, in turn leads to a significantly simpler merging procedure of the small subtrees. We prove that STDR is statistically consistent, and bound the number of samples required to accurately recover the tree with high probability. Using simulated data from several common tree models in phylogenetics, we demonstrate that STDR has a significant advantage in terms of runtime, with improved or similar accuracy.

Viaarxiv icon

Deep Gated Canonical Correlation Analysis

Oct 12, 2020
Ofir Lindenbaum, Moshe Salhov, Amir Averbuch, Yuval Kluger

Figure 1 for Deep Gated Canonical Correlation Analysis
Figure 2 for Deep Gated Canonical Correlation Analysis
Figure 3 for Deep Gated Canonical Correlation Analysis
Figure 4 for Deep Gated Canonical Correlation Analysis

Canonical Correlation Analysis (CCA) models can extract informative correlated representations from multimodal unlabelled data. Despite their success, CCA models may break if the number of variables exceeds the number of samples. We propose Deep Gated-CCA, a method for learning correlated representations based on a sparse subset of variables from two observed modalities. The proposed procedure learns two non-linear transformations and simultaneously gates the input variables to identify a subset of most correlated variables. The non-linear transformations are learned by training two neural networks to maximize a shared correlation loss defined based on their outputs. Gating is obtained by adding an approximate $\ell_0$ regularization term applied to the input variables. This approximation relies on a recently proposed continuous Gaussian based relaxation for Bernoulli variables which act as gates. We demonstrate the efficacy of the method using several synthetic and real examples. Most notably, the method outperforms other linear and non-linear CCA models.

Viaarxiv icon