Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ari Morcos

Linking average- and worst-case perturbation robustness via class selectivity and dimensionality

Oct 14, 2020

Matthew L. Leavitt, Ari Morcos

Figure 1 for Linking average- and worst-case perturbation robustness via class selectivity and dimensionality

Figure 2 for Linking average- and worst-case perturbation robustness via class selectivity and dimensionality

Figure 3 for Linking average- and worst-case perturbation robustness via class selectivity and dimensionality

Figure 4 for Linking average- and worst-case perturbation robustness via class selectivity and dimensionality

Abstract:Representational sparsity is known to affect robustness to input perturbations in deep neural networks (DNNs), but less is known about how the semantic content of representations affects robustness. Class selectivity-the variability of a unit's responses across data classes or dimensions-is one way of quantifying the sparsity of semantic representations. Given recent evidence that class selectivity may not be necessary for, and can even impair generalization, we investigated whether it also confers robustness (or vulnerability) to perturbations of input data. We found that class selectivity leads to increased vulnerability to average-case (naturalistic) perturbations in ResNet18 and ResNet20, as measured using Tiny ImageNetC and CIFAR10C, respectively. Networks regularized to have lower levels of class selectivity are more robust to average-case perturbations, while networks with higher class selectivity are more vulnerable. In contrast, we found that class selectivity increases robustness to worst-case (i.e. white box adversarial) perturbations, suggesting that while decreasing class selectivity is helpful for average-case robustness, it is harmful for worst-case robustness. To explain this difference, we studied the dimensionality of the networks' representations: we found that the dimensionality of early-layer representations is inversely proportional to a network's class selectivity, and that adversarial samples cause a larger increase in early-layer dimensionality than corrupted samples. We also found that the input-unit gradient was more variable across samples and units in high-selectivity networks compared to low-selectivity networks. These results lead to the conclusion that units participate more consistently in low-selectivity regimes compared to high-selectivity regimes, effectively creating a larger attack surface and hence vulnerability to worst-case perturbations.

* arXiv admin note: text overlap with arXiv:2007.04440

Via

Access Paper or Ask Questions

CURI: A Benchmark for Productive Concept Learning Under Uncertainty

Oct 06, 2020

Ramakrishna Vedantam, Arthur Szlam, Maximilian Nickel, Ari Morcos, Brenden Lake

Figure 1 for CURI: A Benchmark for Productive Concept Learning Under Uncertainty

Figure 2 for CURI: A Benchmark for Productive Concept Learning Under Uncertainty

Figure 3 for CURI: A Benchmark for Productive Concept Learning Under Uncertainty

Figure 4 for CURI: A Benchmark for Productive Concept Learning Under Uncertainty

Abstract:Humans can learn and reason under substantial uncertainty in a space of infinitely many concepts, including structured relational concepts ("a scene with objects that have the same color") and ad-hoc categories defined through goals ("objects that could fall on one's head"). In contrast, standard classification benchmarks: 1) consider only a fixed set of category labels, 2) do not evaluate compositional concept learning and 3) do not explicitly capture a notion of reasoning under uncertainty. We introduce a new few-shot, meta-learning benchmark, Compositional Reasoning Under Uncertainty (CURI) to bridge this gap. CURI evaluates different aspects of productive and systematic generalization, including abstract understandings of disentangling, productive generalization, learning boolean operations, variable binding, etc. Importantly, it also defines a model-independent "compositionality gap" to evaluate the difficulty of generalizing out-of-distribution along each of these axes. Extensive evaluations across a range of modeling choices spanning different modalities (image, schemas, and sounds), splits, privileged auxiliary concept information, and choices of negatives reveal substantial scope for modeling advances on the proposed task. All code and datasets will be available online.

Via

Access Paper or Ask Questions

Analyzing Visual Representations in Embodied Navigation Tasks

Mar 12, 2020

Erik Wijmans, Julian Straub, Dhruv Batra, Irfan Essa, Judy Hoffman, Ari Morcos

Figure 1 for Analyzing Visual Representations in Embodied Navigation Tasks

Figure 2 for Analyzing Visual Representations in Embodied Navigation Tasks

Figure 3 for Analyzing Visual Representations in Embodied Navigation Tasks

Figure 4 for Analyzing Visual Representations in Embodied Navigation Tasks

Abstract:Recent advances in deep reinforcement learning require a large amount of training data and generally result in representations that are often over specialized to the target task. In this work, we present a methodology to study the underlying potential causes for this specialization. We use the recently proposed projection weighted Canonical Correlation Analysis (PWCCA) to measure the similarity of visual representations learned in the same environment by performing different tasks. We then leverage our proposed methodology to examine the task dependence of visual representations learned on related but distinct embodied navigation tasks. Surprisingly, we find that slight differences in task have no measurable effect on the visual representation for both SqueezeNet and ResNet architectures. We then empirically demonstrate that visual representations learned on one task can be effectively transferred to a different task.

Via

Access Paper or Ask Questions

Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs

Mar 03, 2020

Matthew L. Leavitt, Ari Morcos

Figure 1 for Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs

Figure 2 for Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs

Figure 3 for Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs

Figure 4 for Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs

Abstract:Class selectivity, typically defined as how different a neuron's responses are across different classes of stimuli or data samples, is a common metric used to interpret the function of individual neurons in biological and artificial neural networks. However, it remains an open question whether it is necessary and/or sufficient for deep neural networks (DNNs) to learn class selectivity in individual units. In order to investigate the causal impact of class selectivity on network function, we directly regularize for or against class selectivity. Using this regularizer, we were able to reduce mean class selectivity across units in convolutional neural networks by a factor of 2.5 with no impact on test accuracy, and reduce it nearly to zero with only a small ($\sim$2%) change in test accuracy. In contrast, increasing class selectivity beyond the levels naturally learned during training had rapid and disastrous effects on test accuracy. These results indicate that class selectivity in individual units is neither neither sufficient nor strictly necessary for DNN performance, and more generally encourage caution when focusing on the properties of single units as representative of the mechanisms by which DNNs function.

Via

Access Paper or Ask Questions

Representation Learning Through Latent Canonicalizations

Feb 26, 2020

Or Litany, Ari Morcos, Srinath Sridhar, Leonidas Guibas, Judy Hoffman

Figure 1 for Representation Learning Through Latent Canonicalizations

Figure 2 for Representation Learning Through Latent Canonicalizations

Figure 3 for Representation Learning Through Latent Canonicalizations

Figure 4 for Representation Learning Through Latent Canonicalizations

Abstract:We seek to learn a representation on a large annotated data source that generalizes to a target domain using limited new supervision. Many prior approaches to this problem have focused on learning "disentangled" representations so that as individual factors vary in a new domain, only a portion of the representation need be updated. In this work, we seek the generalization power of disentangled representations, but relax the requirement of explicit latent disentanglement and instead encourage linearity of individual factors of variation by requiring them to be manipulable by learned linear transformations. We dub these transformations latent canonicalizers, as they aim to modify the value of a factor to a pre-determined (but arbitrary) canonical value (e.g., recoloring the image foreground to black). Assuming a source domain with access to meta-labels specifying the factors of variation within an image, we demonstrate experimentally that our method helps reduce the number of observations needed to generalize to a similar target domain when compared to a number of supervised baselines.

Via

Access Paper or Ask Questions

Pruning Convolutional Neural Networks with Self-Supervision

Jan 10, 2020

Mathilde Caron, Ari Morcos, Piotr Bojanowski, Julien Mairal, Armand Joulin

Figure 1 for Pruning Convolutional Neural Networks with Self-Supervision

Figure 2 for Pruning Convolutional Neural Networks with Self-Supervision

Figure 3 for Pruning Convolutional Neural Networks with Self-Supervision

Figure 4 for Pruning Convolutional Neural Networks with Self-Supervision

Abstract:Convolutional neural networks trained without supervision come close to matching performance with supervised pre-training, but sometimes at the cost of an even higher number of parameters. Extracting subnetworks from these large unsupervised convnets with preserved performance is of particular interest to make them less computationally intensive. Typical pruning methods operate during training on a task while trying to maintain the performance of the pruned network on the same task. However, in self-supervised feature learning, the training objective is agnostic on the representation transferability to downstream tasks. Thus, preserving performance for this objective does not ensure that the pruned subnetwork remains effective for solving downstream tasks. In this work, we investigate the use of standard pruning methods, developed primarily for supervised learning, for networks trained without labels (i.e. on self-supervised tasks). We show that pruned masks obtained with or without labels reach comparable performance when re-trained on labels, suggesting that pruning operates similarly for self-supervised and supervised learning. Interestingly, we also find that pruning preserves the transfer performance of self-supervised subnetwork representations.

Via

Access Paper or Ask Questions

Decentralized Distributed PPO: Solving PointGoal Navigation

Nov 01, 2019

Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, Dhruv Batra

Figure 1 for Decentralized Distributed PPO: Solving PointGoal Navigation

Figure 2 for Decentralized Distributed PPO: Solving PointGoal Navigation

Figure 3 for Decentralized Distributed PPO: Solving PointGoal Navigation

Figure 4 for Decentralized Distributed PPO: Solving PointGoal Navigation

Abstract:We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever "stale"), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs. This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially "solves" the task -- near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of "ImageNet pre-training + task-specific fine-tuning" for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models + code will be publicly available).

Via

Access Paper or Ask Questions

Luck Matters: Understanding Training Dynamics of Deep ReLU Networks

Jun 10, 2019

Yuandong Tian, Tina Jiang, Qucheng Gong, Ari Morcos

Figure 1 for Luck Matters: Understanding Training Dynamics of Deep ReLU Networks

Figure 2 for Luck Matters: Understanding Training Dynamics of Deep ReLU Networks

Figure 3 for Luck Matters: Understanding Training Dynamics of Deep ReLU Networks

Figure 4 for Luck Matters: Understanding Training Dynamics of Deep ReLU Networks

Abstract:We analyze the dynamics of training deep ReLU networks and their implications on generalization capability. Using a teacher-student setting, we discovered a novel relationship between the gradient received by hidden student nodes and the activations of teacher nodes for deep ReLU networks. With this relationship and the assumption of small overlapping teacher node activations, we prove that (1) student nodes whose weights are initialized to be close to teacher nodes converge to them at a faster rate, and (2) in over-parameterized regimes and 2-layer case, while a small set of lucky nodes do converge to the teacher nodes, the fan-out weights of other nodes converge to zero. This framework provides insight into multiple puzzling phenomena in deep learning like over-parameterization, implicit regularization, lottery tickets, etc. We verify our assumption by showing that the majority of BatchNorm biases of pre-trained VGG11/16 models are negative. Experiments on (1) random deep teacher networks with Gaussian inputs, (2) teacher network pre-trained on CIFAR-10 and (3) extensive ablation studies validate our multiple theoretical predictions.

Via

Access Paper or Ask Questions