Alert button
Picture for Michael C. Mozer

Michael C. Mozer

Alert button

Can Neural Network Memorization Be Localized?

Jul 18, 2023
Pratyush Maini, Michael C. Mozer, Hanie Sedghi, Zachary C. Lipton, J. Zico Kolter, Chiyuan Zhang

Figure 1 for Can Neural Network Memorization Be Localized?
Figure 2 for Can Neural Network Memorization Be Localized?
Figure 3 for Can Neural Network Memorization Be Localized?
Figure 4 for Can Neural Network Memorization Be Localized?

Recent efforts at explaining the interplay of memorization and generalization in deep overparametrized networks have posited that neural networks $\textit{memorize}$ "hard" examples in the final few layers of the model. Memorization refers to the ability to correctly predict on $\textit{atypical}$ examples of the training set. In this work, we show that rather than being confined to individual layers, memorization is a phenomenon confined to a small set of neurons in various layers of the model. First, via three experimental sources of converging evidence, we find that most layers are redundant for the memorization of examples and the layers that contribute to example memorization are, in general, not the final layers. The three sources are $\textit{gradient accounting}$ (measuring the contribution to the gradient norms from memorized and clean examples), $\textit{layer rewinding}$ (replacing specific model weights of a converged model with previous training checkpoints), and $\textit{retraining}$ (training rewound layers only on clean examples). Second, we ask a more generic question: can memorization be localized $\textit{anywhere}$ in a model? We discover that memorization is often confined to a small number of neurons or channels (around 5) of the model. Based on these insights we propose a new form of dropout -- $\textit{example-tied dropout}$ that enables us to direct the memorization of examples to an apriori determined set of neurons. By dropping out these neurons, we are able to reduce the accuracy on memorized examples from $100\%\to3\%$, while also reducing the generalization gap.

* Accepted at ICML 2023 
Viaarxiv icon

Spotlight Attention: Robust Object-Centric Learning With a Spatial Locality Prior

May 31, 2023
Ayush Chakravarthy, Trang Nguyen, Anirudh Goyal, Yoshua Bengio, Michael C. Mozer

Figure 1 for Spotlight Attention: Robust Object-Centric Learning With a Spatial Locality Prior
Figure 2 for Spotlight Attention: Robust Object-Centric Learning With a Spatial Locality Prior
Figure 3 for Spotlight Attention: Robust Object-Centric Learning With a Spatial Locality Prior
Figure 4 for Spotlight Attention: Robust Object-Centric Learning With a Spatial Locality Prior

The aim of object-centric vision is to construct an explicit representation of the objects in a scene. This representation is obtained via a set of interchangeable modules called \emph{slots} or \emph{object files} that compete for local patches of an image. The competition has a weak inductive bias to preserve spatial continuity; consequently, one slot may claim patches scattered diffusely throughout the image. In contrast, the inductive bias of human vision is strong, to the degree that attention has classically been described with a spotlight metaphor. We incorporate a spatial-locality prior into state-of-the-art object-centric vision models and obtain significant improvements in segmenting objects in both synthetic and real-world datasets. Similar to human visual attention, the combination of image content and spatial constraints yield robust unsupervised object-centric learning, including less sensitivity to model hyperparameters.

* 16 pages, 3 figures, under review at NeurIPS 2023 
Viaarxiv icon

Layer-Stack Temperature Scaling

Nov 18, 2022
Amr Khalifa, Michael C. Mozer, Hanie Sedghi, Behnam Neyshabur, Ibrahim Alabdulmohsin

Figure 1 for Layer-Stack Temperature Scaling
Figure 2 for Layer-Stack Temperature Scaling
Figure 3 for Layer-Stack Temperature Scaling
Figure 4 for Layer-Stack Temperature Scaling

Recent works demonstrate that early layers in a neural network contain useful information for prediction. Inspired by this, we show that extending temperature scaling across all layers improves both calibration and accuracy. We call this procedure "layer-stack temperature scaling" (LATES). Informally, LATES grants each layer a weighted vote during inference. We evaluate it on five popular convolutional neural network architectures both in- and out-of-distribution and observe a consistent improvement over temperature scaling in terms of accuracy, calibration, and AUC. All conclusions are supported by comprehensive statistical analyses. Since LATES neither retrains the architecture nor introduces many more parameters, its advantages can be reaped without requiring additional data beyond what is used in temperature scaling. Finally, we show that combining LATES with Monte Carlo Dropout matches state-of-the-art results on CIFAR10/100.

* 10 pages, 7 figures, 3 tables 
Viaarxiv icon

An Empirical Study on Clustering Pretrained Embeddings: Is Deep Strictly Better?

Nov 09, 2022
Tyler R. Scott, Ting Liu, Michael C. Mozer, Andrew C. Gallagher

Figure 1 for An Empirical Study on Clustering Pretrained Embeddings: Is Deep Strictly Better?
Figure 2 for An Empirical Study on Clustering Pretrained Embeddings: Is Deep Strictly Better?
Figure 3 for An Empirical Study on Clustering Pretrained Embeddings: Is Deep Strictly Better?
Figure 4 for An Empirical Study on Clustering Pretrained Embeddings: Is Deep Strictly Better?

Recent research in clustering face embeddings has found that unsupervised, shallow, heuristic-based methods -- including $k$-means and hierarchical agglomerative clustering -- underperform supervised, deep, inductive methods. While the reported improvements are indeed impressive, experiments are mostly limited to face datasets, where the clustered embeddings are highly discriminative or well-separated by class (Recall@1 above 90% and often nearing ceiling), and the experimental methodology seemingly favors the deep methods. We conduct a large-scale empirical study of 17 clustering methods across three datasets and obtain several robust findings. Notably, deep methods are surprisingly fragile for embeddings with more uncertainty, where they match or even perform worse than shallow, heuristic-based methods. When embeddings are highly discriminative, deep methods do outperform the baselines, consistent with past results, but the margin between methods is much smaller than previously reported. We believe our benchmarks broaden the scope of supervised clustering methods beyond the face domain and can serve as a foundation on which these methods could be improved. To enable reproducibility, we include all necessary details in the appendices, and plan to release the code.

Viaarxiv icon

SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos

Jun 15, 2022
Gamaleldin F. Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C. Mozer, Thomas Kipf

Figure 1 for SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
Figure 2 for SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
Figure 3 for SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
Figure 4 for SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos

The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven challenging for end-to-end computer vision approaches unless explicit instance-level supervision is provided. Slot-based models leveraging motion cues have recently shown great promise in learning to represent, segment, and track objects without direct supervision, but they still fail to scale to complex real-world multi-object videos. In an effort to bridge this gap, we take inspiration from human development and hypothesize that information about scene geometry in the form of depth signals can facilitate object-centric learning. We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation. By further leveraging best practices for model scaling, we are able to train SAVi++ to segment complex dynamic scenes recorded with moving cameras, containing both static and moving objects of diverse appearance on naturalistic backgrounds, without the need for segmentation supervision. Finally, we demonstrate that by using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Waymo Open dataset.

* Project page at https://slot-attention-video.github.io/savi++/ 
Viaarxiv icon

Overcoming Temptation: Incentive Design For Intertemporal Choice

Mar 14, 2022
Shruthi Sukumar, Adrian F. Ward, Camden Elliott-Williams, Shabnam Hakimi, Michael C. Mozer

Figure 1 for Overcoming Temptation: Incentive Design For Intertemporal Choice
Figure 2 for Overcoming Temptation: Incentive Design For Intertemporal Choice
Figure 3 for Overcoming Temptation: Incentive Design For Intertemporal Choice
Figure 4 for Overcoming Temptation: Incentive Design For Intertemporal Choice

Individuals are often faced with temptations that can lead them astray from long-term goals. We're interested in developing interventions that steer individuals toward making good initial decisions and then maintaining those decisions over time. In the realm of financial decision making, a particularly successful approach is the prize-linked savings account: individuals are incentivized to make deposits by tying deposits to a periodic lottery that awards bonuses to the savers. Although these lotteries have been very effective in motivating savers across the globe, they are a one-size-fits-all solution. We investigate whether customized bonuses can be more effective. We formalize a delayed-gratification task as a Markov decision problem and characterize individuals as rational agents subject to temporal discounting, a cost associated with effort, and fluctuations in willpower. Our theory is able to explain key behavioral findings in intertemporal choice. We created an online delayed-gratification game in which the player scores points by selecting a queue to wait in and then performing a series of actions to advance to the front. Data collected from the game is fit to the model, and the instantiated model is then used to optimize predicted player performance over a space of incentives. We demonstrate that customized incentive structures can improve an individual's goal-directed decision making.

Viaarxiv icon

Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning

Jan 10, 2022
Utku Evci, Vincent Dumoulin, Hugo Larochelle, Michael C. Mozer

Figure 1 for Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
Figure 2 for Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
Figure 3 for Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
Figure 4 for Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning

Transfer-learning methods aim to improve performance in a data-scarce target domain using a model pretrained on a data-rich source domain. A cost-efficient strategy, linear probing, involves freezing the source model and training a new classification head for the target domain. This strategy is outperformed by a more costly but state-of-the-art method -- fine-tuning all parameters of the source model to the target domain -- possibly because fine-tuning allows the model to leverage useful information from intermediate layers which is otherwise discarded by the later pretrained layers. We explore the hypothesis that these intermediate layers might be directly exploited. We propose a method, Head-to-Toe probing (Head2Toe), that selects features from all layers of the source model to train a classification head for the target-domain. In evaluations on the VTAB-1k, Head2Toe matches performance obtained with fine-tuning on average while reducing training and storage cost hundred folds or more, but critically, for out-of-distribution transfer, Head2Toe outperforms fine-tuning.

Viaarxiv icon

Online Unsupervised Learning of Visual Representations and Categories

Sep 13, 2021
Mengye Ren, Tyler R. Scott, Michael L. Iuzzolino, Michael C. Mozer, Richard Zemel

Figure 1 for Online Unsupervised Learning of Visual Representations and Categories
Figure 2 for Online Unsupervised Learning of Visual Representations and Categories
Figure 3 for Online Unsupervised Learning of Visual Representations and Categories
Figure 4 for Online Unsupervised Learning of Visual Representations and Categories

Real world learning scenarios involve a nonstationary distribution of classes with sequential dependencies among the samples, in contrast to the standard machine learning formulation of drawing samples independently from a fixed, typically uniform distribution. Furthermore, real world interactions demand learning on-the-fly from few or no class labels. In this work, we propose an unsupervised model that simultaneously performs online visual representation learning and few-shot learning of new categories without relying on any class labels. Our model is a prototype-based memory network with a control component that determines when to form a new class prototype. We formulate it as an online Gaussian mixture model, where components are created online with only a single new example, and assignments do not have to be balanced, which permits an approximation to natural imbalanced distributions from uncurated raw data. Learning includes a contrastive loss that encourages different views of the same image to be assigned to the same prototype. The result is a mechanism that forms categorical representations of objects in nonstationary environments. Experiments show that our method can learn from an online stream of visual input data and is significantly better at category recognition compared to state-of-the-art self-supervised learning methods.

* 29 pages 
Viaarxiv icon

Learning Neural Causal Models with Active Interventions

Sep 06, 2021
Nino Scherrer, Olexa Bilaniuk, Yashas Annadani, Anirudh Goyal, Patrick Schwab, Bernhard Schölkopf, Michael C. Mozer, Yoshua Bengio, Stefan Bauer, Nan Rosemary Ke

Figure 1 for Learning Neural Causal Models with Active Interventions
Figure 2 for Learning Neural Causal Models with Active Interventions
Figure 3 for Learning Neural Causal Models with Active Interventions
Figure 4 for Learning Neural Causal Models with Active Interventions

Discovering causal structures from data is a challenging inference problem of fundamental importance in all areas of science. The appealing scaling properties of neural networks have recently led to a surge of interest in differentiable neural network-based methods for learning causal structures from data. So far differentiable causal discovery has focused on static datasets of observational or interventional origin. In this work, we introduce an active intervention-targeting mechanism which enables a quick identification of the underlying causal structure of the data-generating process. Our method significantly reduces the required number of interactions compared with random intervention targeting and is applicable for both discrete and continuous optimization formulations of learning the underlying directed acyclic graph (DAG) from data. We examine the proposed method across a wide range of settings and demonstrate superior performance on multiple benchmarks from simulated to real-world data.

Viaarxiv icon

Soft Calibration Objectives for Neural Networks

Jul 30, 2021
Archit Karandikar, Nicholas Cain, Dustin Tran, Balaji Lakshminarayanan, Jonathon Shlens, Michael C. Mozer, Becca Roelofs

Figure 1 for Soft Calibration Objectives for Neural Networks
Figure 2 for Soft Calibration Objectives for Neural Networks
Figure 3 for Soft Calibration Objectives for Neural Networks
Figure 4 for Soft Calibration Objectives for Neural Networks

Optimal decision making requires that classifiers produce uncertainty estimates consistent with their empirical accuracy. However, deep neural networks are often under- or over-confident in their predictions. Consequently, methods have been developed to improve the calibration of their predictive uncertainty both during training and post-hoc. In this work, we propose differentiable losses to improve calibration based on a soft (continuous) version of the binning operation underlying popular calibration-error estimators. When incorporated into training, these soft calibration losses achieve state-of-the-art single-model ECE across multiple datasets with less than 1% decrease in accuracy. For instance, we observe an 82% reduction in ECE (70% relative to the post-hoc rescaled ECE) in exchange for a 0.7% relative decrease in accuracy relative to the cross entropy baseline on CIFAR-100. When incorporated post-training, the soft-binning-based calibration error objective improves upon temperature scaling, a popular recalibration method. Overall, experiments across losses and datasets demonstrate that using calibration-sensitive procedures yield better uncertainty estimates under dataset shift than the standard practice of using a cross entropy loss and post-hoc recalibration methods.

* 17 pages total, 10 page main paper, 5 page appendix, 10 figures total, 8 figures in main paper, 2 figures in appendix 
Viaarxiv icon