Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ari S. Morcos

One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers

Jun 06, 2019

Ari S. Morcos, Haonan Yu, Michela Paganini, Yuandong Tian

Figure 1 for One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers

Figure 2 for One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers

Figure 3 for One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers

Figure 4 for One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers

Abstract:The success of lottery ticket initializations (Frankle and Carbin, 2019) suggests that small, sparsified networks can be trained so long as the network is initialized appropriately. Unfortunately, finding these "winning ticket" initializations is computationally expensive. One potential solution is to reuse the same winning tickets across a variety of datasets and optimizers. However, the generality of winning ticket initializations remains unclear. Here, we attempt to answer this question by generating winning tickets for one training configuration (optimizer and dataset) and evaluating their performance on another configuration. Perhaps surprisingly, we found that, within the natural images domain, winning ticket initializations generalized across a variety of datasets, including Fashion MNIST, SVHN, CIFAR-10/100, ImageNet, and Places365, often achieving performance close to that of winning tickets generated on the same dataset. Moreover, winning tickets generated using larger datasets consistently transferred better than those generated using smaller datasets. We also found that winning ticket initializations generalize across optimizers with high performance. These results suggest that winning ticket initializations contain inductive biases generic to neural networks more broadly which improve training across many settings and provide hope for the development of better initialization methods.

Via

Access Paper or Ask Questions

Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP

Jun 06, 2019

Haonan Yu, Sergey Edunov, Yuandong Tian, Ari S. Morcos

Figure 1 for Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP

Figure 2 for Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP

Figure 3 for Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP

Figure 4 for Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP

Abstract:The lottery ticket hypothesis proposes that over-parameterization of deep neural networks (DNNs) aids training by increasing the probability of a "lucky" sub-network initialization being present rather than by helping the optimization process. This phenomenon is intriguing and suggests that initialization strategies for DNNs can be improved substantially, but the lottery ticket hypothesis has only previously been tested in the context of supervised learning for natural image tasks. Here, we evaluate whether "winning ticket" initializations exist in two different domains: reinforcement learning (RL) and in natural language processing (NLP). For RL, we analyzed a number of discrete-action space tasks, including both classic control and pixel control. For NLP, we examined both recurrent LSTM models and large-scale Transformer models. Consistent with work in supervised image classification, we confirm that winning ticket initializations generally outperform parameter-matched random initializations, even at extreme pruning rates. Together, these results suggest that the lottery ticket hypothesis is not restricted to supervised learning of natural images, but rather represents a broader phenomenon in DNNs.

Via

Access Paper or Ask Questions

Learning to Make Analogies by Contrasting Abstract Relational Structure

Jan 31, 2019

Felix Hill, Adam Santoro, David G. T. Barrett, Ari S. Morcos, Timothy Lillicrap

Figure 1 for Learning to Make Analogies by Contrasting Abstract Relational Structure

Figure 2 for Learning to Make Analogies by Contrasting Abstract Relational Structure

Figure 3 for Learning to Make Analogies by Contrasting Abstract Relational Structure

Figure 4 for Learning to Make Analogies by Contrasting Abstract Relational Structure

Abstract:Analogical reasoning has been a principal focus of various waves of AI research. Analogy is particularly challenging for machines because it requires relational structures to be represented such that they can be flexibly applied across diverse domains of experience. Here, we study how analogical reasoning can be induced in neural networks that learn to perceive and reason about raw visual data. We find that the critical factor for inducing such a capacity is not an elaborate architecture, but rather, careful attention to the choice of data and the manner in which it is presented to the model. The most robust capacity for analogical reasoning is induced when networks learn analogies by contrasting abstract relational structures in their input domains, a training method that uses only the input data to force models to learn about important abstract features. Using this technique we demonstrate capacities for complex, visual and symbolic analogy making and generalisation in even the simplest neural network architectures.

Via

Access Paper or Ask Questions

Analyzing biological and artificial neural networks: challenges with opportunities for synergy?

Oct 31, 2018

David G. T. Barrett, Ari S. Morcos, Jakob H. Macke

Figure 1 for Analyzing biological and artificial neural networks: challenges with opportunities for synergy?

Figure 2 for Analyzing biological and artificial neural networks: challenges with opportunities for synergy?

Figure 3 for Analyzing biological and artificial neural networks: challenges with opportunities for synergy?

Abstract:Deep neural networks (DNNs) transform stimuli across multiple processing stages to produce representations that can be used to solve complex tasks, such as object recognition in images. However, a full understanding of how they achieve this remains elusive. The complexity of biological neural networks substantially exceeds the complexity of DNNs, making it even more challenging to understand the representations that they learn. Thus, both machine learning and computational neuroscience are faced with a shared challenge: how can we analyze their representations in order to understand how they solve complex tasks? We review how data-analysis concepts and techniques developed by computational neuroscientists can be useful for analyzing representations in DNNs, and in turn, how recently developed techniques for analysis of DNNs can be useful for understanding representations in biological neural networks. We explore opportunities for synergy between the two fields, such as the use of DNNs as in-silico model systems for neuroscience, and how this synergy can lead to new hypotheses about the operating principles of biological neural networks.

Via

Access Paper or Ask Questions

Insights on representational similarity in neural networks with canonical correlation

Oct 23, 2018

Ari S. Morcos, Maithra Raghu, Samy Bengio

Figure 1 for Insights on representational similarity in neural networks with canonical correlation

Figure 2 for Insights on representational similarity in neural networks with canonical correlation

Figure 3 for Insights on representational similarity in neural networks with canonical correlation

Figure 4 for Insights on representational similarity in neural networks with canonical correlation

Abstract:Comparing different neural network representations and determining how representations evolve over time remain challenging open questions in our understanding of the function of neural networks. Comparing representations in neural networks is fundamentally difficult as the structure of representations varies greatly, even across groups of networks trained on identical tasks, and over the course of training. Here, we develop projection weighted CCA (Canonical Correlation Analysis) as a tool for understanding neural networks, building off of SVCCA, a recently proposed method (Raghu et al., 2017). We first improve the core method, showing how to differentiate between signal and noise, and then apply this technique to compare across a group of CNNs, demonstrating that networks which generalize converge to more similar representations than networks which memorize, that wider networks converge to more similar solutions than narrow networks, and that trained networks with identical topology but different learning rates converge to distinct clusters with diverse representations. We also investigate the representational dynamics of RNNs, across both training and sequential timesteps, finding that RNNs converge in a bottom-up pattern over the course of training and that the hidden state is highly variable over the course of a sequence, even when accounting for linear transforms. Together, these results provide new insights into the function of CNNs and RNNs, and demonstrate the utility of using CCA to understand representations.

* NIPS 2018

Via

Access Paper or Ask Questions

Measuring abstract reasoning in neural networks

Jul 11, 2018

David G. T. Barrett, Felix Hill, Adam Santoro, Ari S. Morcos, Timothy Lillicrap

Figure 1 for Measuring abstract reasoning in neural networks

Figure 2 for Measuring abstract reasoning in neural networks

Figure 3 for Measuring abstract reasoning in neural networks

Figure 4 for Measuring abstract reasoning in neural networks

Abstract:Whether neural networks can learn abstract reasoning or whether they merely rely on superficial statistics is a topic of recent debate. Here, we propose a dataset and challenge designed to probe abstract reasoning, inspired by a well-known human IQ test. To succeed at this challenge, models must cope with various generalisation `regimes' in which the training and test data differ in clearly-defined ways. We show that popular models such as ResNets perform poorly, even when the training and test sets differ only minimally, and we present a novel architecture, with a structure designed to encourage reasoning, that does significantly better. When we vary the way in which the test questions and training data differ, we find that our model is notably proficient at certain forms of generalisation, but notably weak at others. We further show that the model's ability to generalise improves markedly if it is trained to predict symbolic explanations for its answers. Altogether, we introduce and explore ways to both measure and induce stronger abstract reasoning in neural networks. Our freely-available dataset should motivate further progress in this direction.

* ICML 2018

Via

Access Paper or Ask Questions

Human-level performance in first-person multiplayer games with population-based deep reinforcement learning

Jul 03, 2018

Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman(+8 more)

Abstract:Recent progress in artificial intelligence through reinforcement learning (RL) has shown great success on increasingly complex single-agent environments and two-player turn-based games. However, the real-world contains multiple agents, each learning and acting independently to cooperate and compete with other agents, and environments reflecting this degree of complexity remain an open challenge. In this work, we demonstrate for the first time that an agent can achieve human-level in a popular 3D multiplayer first-person video game, Quake III Arena Capture the Flag, using only pixels and game points as input. These results were achieved by a novel two-tier optimisation process in which a population of independent RL agents are trained concurrently from thousands of parallel matches with agents playing in teams together and against each other on randomly generated environments. Each agent in the population learns its own internal reward signal to complement the sparse delayed reward from winning, and selects actions using a novel temporally hierarchical representation that enables the agent to reason at multiple timescales. During game-play, these agents display human-like behaviours such as navigating, following, and defending based on a rich learned representation that is shown to encode high-level game knowledge. In an extensive tournament-style evaluation the trained agents exceeded the win-rate of strong human players both as teammates and opponents, and proved far stronger than existing state-of-the-art agents. These results demonstrate a significant jump in the capabilities of artificial agents, bringing us closer to the goal of human-level intelligence.

Via

Access Paper or Ask Questions

Pooling is neither necessary nor sufficient for appropriate deformation stability in CNNs

May 25, 2018

Avraham Ruderman, Neil C. Rabinowitz, Ari S. Morcos, Daniel Zoran

Figure 1 for Pooling is neither necessary nor sufficient for appropriate deformation stability in CNNs

Figure 2 for Pooling is neither necessary nor sufficient for appropriate deformation stability in CNNs

Figure 3 for Pooling is neither necessary nor sufficient for appropriate deformation stability in CNNs

Figure 4 for Pooling is neither necessary nor sufficient for appropriate deformation stability in CNNs

Abstract:Many of our core assumptions about how neural networks operate remain empirically untested. One common assumption is that convolutional neural networks need to be stable to small translations and deformations to solve image recognition tasks. For many years, this stability was baked into CNN architectures by incorporating interleaved pooling layers. Recently, however, interleaved pooling has largely been abandoned. This raises a number of questions: Are our intuitions about deformation stability right at all? Is it important? Is pooling necessary for deformation invariance? If not, how is deformation invariance achieved in its absence? In this work, we rigorously test these questions, and find that deformation stability in convolutional networks is more nuanced than it first appears: (1) Deformation invariance is not a binary property, but rather that different tasks require different degrees of deformation stability at different layers. (2) Deformation stability is not a fixed property of a network and is heavily adjusted over the course of training, largely through the smoothness of the convolutional filters. (3) Interleaved pooling layers are neither necessary nor sufficient for achieving the optimal form of deformation stability for natural image classification. (4) Pooling confers too much deformation stability for image classification at initialization, and during training, networks have to learn to counteract this inductive bias. Together, these findings provide new insights into the role of interleaved pooling and deformation invariance in CNNs, and demonstrate the importance of rigorous empirical testing of even our most basic assumptions about the working of neural networks.

* NIPS 2018 submission

Via

Access Paper or Ask Questions

On the importance of single directions for generalization

May 22, 2018

Ari S. Morcos, David G. T. Barrett, Neil C. Rabinowitz, Matthew Botvinick

Figure 1 for On the importance of single directions for generalization

Figure 2 for On the importance of single directions for generalization

Figure 3 for On the importance of single directions for generalization

Figure 4 for On the importance of single directions for generalization

Abstract:Despite their ability to memorize large datasets, deep neural networks often achieve good generalization performance. However, the differences between the learned solutions of networks which generalize and those which do not remain unclear. Additionally, the tuning properties of single directions (defined as the activation of a single unit or some linear combination of units in response to some input) have been highlighted, but their importance has not been evaluated. Here, we connect these lines of inquiry to demonstrate that a network's reliance on single directions is a good predictor of its generalization performance, across networks trained on datasets with different fractions of corrupted labels, across ensembles of networks trained on datasets with unmodified labels, across different hyperparameters, and over the course of training. While dropout only regularizes this quantity up to a point, batch normalization implicitly discourages single direction reliance, in part by decreasing the class selectivity of individual units. Finally, we find that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance.

* ICLR 2018 conference paper; added additional methodological details

Via

Access Paper or Ask Questions