Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Christopher Ré

Department of Computer Science, Stanford University

On the Downstream Performance of Compressed Word Embeddings

Sep 03, 2019

Avner May, Jian Zhang, Tri Dao, Christopher Ré

Figure 1 for On the Downstream Performance of Compressed Word Embeddings

Figure 2 for On the Downstream Performance of Compressed Word Embeddings

Figure 3 for On the Downstream Performance of Compressed Word Embeddings

Figure 4 for On the Downstream Performance of Compressed Word Embeddings

Abstract:Compressing word embeddings is important for deploying NLP models in memory-constrained settings. However, understanding what makes compressed embeddings perform well on downstream tasks is challenging---existing measures of compression quality often fail to distinguish between embeddings that perform well and those that do not. We thus propose the eigenspace overlap score as a new measure. We relate the eigenspace overlap score to downstream performance by developing generalization bounds for the compressed embeddings in terms of this score, in the context of linear and logistic regression. We then show that we can lower bound the eigenspace overlap score for a simple uniform quantization compression method, helping to explain the strong empirical performance of this method. Finally, we show that by using the eigenspace overlap score as a selection criterion between embeddings drawn from a representative set we compressed, we can efficiently identify the better performing embedding with up to $2\times$ lower selection error rates than the next best measure of compression quality, and avoid the cost of training a model for each task of interest.

* NeurIPS 2019 (Conference on Neural Information Processing Systems)

Via

Access Paper or Ask Questions

SysML: The New Frontier of Machine Learning Systems

May 01, 2019

Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung(+59 more)

Abstract:Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, SysML, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.

Via

Access Paper or Ask Questions

Low-Memory Neural Network Training: A Technical Report

Apr 24, 2019

Nimit Sharad Sohoni, Christopher Richard Aberger, Megan Leszczynski, Jian Zhang, Christopher Ré

Figure 1 for Low-Memory Neural Network Training: A Technical Report

Figure 2 for Low-Memory Neural Network Training: A Technical Report

Figure 3 for Low-Memory Neural Network Training: A Technical Report

Figure 4 for Low-Memory Neural Network Training: A Technical Report

Abstract:Memory is increasingly often the bottleneck when training neural network models. Despite this, techniques to lower the overall memory requirements of training have been less widely studied compared to the extensive literature on reducing the memory requirements of inference. In this paper we study a fundamental question: How much memory is actually needed to train a neural network? To answer this question, we profile the overall memory usage of training on two representative deep learning benchmarks -- the WideResNet model for image classification and the DynamicConv Transformer model for machine translation -- and comprehensively evaluate four standard techniques for reducing the training memory requirements: (1) imposing sparsity on the model, (2) using low precision, (3) microbatching, and (4) gradient checkpointing. We explore how each of these techniques in isolation affects both the peak memory usage of training and the quality of the end model, and explore the memory, accuracy, and computation tradeoffs incurred when combining these techniques. Using appropriate combinations of these techniques, we show that it is possible to the reduce the memory required to train a WideResNet-28-2 on CIFAR-10 by up to 60.7x with a 0.4% loss in accuracy, and reduce the memory required to train a DynamicConv model on IWSLT'14 German to English translation by up to 8.7x with a BLEU score drop of 0.15.

Via

Access Paper or Ask Questions

Medical device surveillance with electronic health records

Apr 03, 2019

Alison Callahan, Jason A Fries, Christopher Ré, James I Huddleston III, Nicholas J Giori, Scott Delp, Nigam H Shah

Figure 1 for Medical device surveillance with electronic health records

Figure 2 for Medical device surveillance with electronic health records

Figure 3 for Medical device surveillance with electronic health records

Figure 4 for Medical device surveillance with electronic health records

Abstract:Post-market medical device surveillance is a challenge facing manufacturers, regulatory agencies, and health care providers. Electronic health records are valuable sources of real world evidence to assess device safety and track device-related patient outcomes over time. However, distilling this evidence remains challenging, as information is fractured across clinical notes and structured records. Modern machine learning methods for machine reading promise to unlock increasingly complex information from text, but face barriers due to their reliance on large and expensive hand-labeled training sets. To address these challenges, we developed and validated state-of-the-art deep learning methods that identify patient outcomes from clinical notes without requiring hand-labeled training data. Using hip replacements as a test case, our methods accurately extracted implant details and reports of complications and pain from electronic health records with up to 96.3% precision, 98.5% recall, and 97.4% F1, improved classification performance by 12.7- 53.0% over rule-based methods, and detected over 6 times as many complication events compared to using structured data alone. Using these events to assess complication-free survivorship of different implant systems, we found significant variation between implants, including for risk of revision surgery, which could not be detected using coded data alone. Patients with revision surgeries had more hip pain mentions in the post-hip replacement, pre-revision period compared to patients with no evidence of revision surgery (mean hip pain mentions 4.97 vs. 3.23; t = 5.14; p < 0.001). Some implant models were associated with higher or lower rates of hip pain mentions. Our methods complement existing surveillance mechanisms by requiring orders of magnitude less hand-labeled training data, offering a scalable solution for national medical device surveillance.

Via

Access Paper or Ask Questions

Cross-Modal Data Programming Enables Rapid Medical Machine Learning

Mar 26, 2019

Jared Dunnmon, Alexander Ratner, Nishith Khandwala, Khaled Saab, Matthew Markert, Hersh Sagreiya, Roger Goldman, Christopher Lee-Messer, Matthew Lungren, Daniel Rubin(+1 more)

Figure 1 for Cross-Modal Data Programming Enables Rapid Medical Machine Learning

Figure 2 for Cross-Modal Data Programming Enables Rapid Medical Machine Learning

Figure 3 for Cross-Modal Data Programming Enables Rapid Medical Machine Learning

Figure 4 for Cross-Modal Data Programming Enables Rapid Medical Machine Learning

Abstract:Labeling training datasets has become a key barrier to building medical machine learning models. One strategy is to generate training labels programmatically, for example by applying natural language processing pipelines to text reports associated with imaging studies. We propose cross-modal data programming, which generalizes this intuitive strategy in a theoretically-grounded way that enables simpler, clinician-driven input, reduces required labeling time, and improves with additional unlabeled data. In this approach, clinicians generate training labels for models defined over a target modality (e.g. images or time series) by writing rules over an auxiliary modality (e.g. text reports). The resulting technical challenge consists of estimating the accuracies and correlations of these rules; we extend a recent unsupervised generative modeling technique to handle this cross-modal setting in a provably consistent way. Across four applications in radiography, computed tomography, and electroencephalography, and using only several hours of clinician time, our approach matches or exceeds the efficacy of physician-months of hand-labeling with statistical significance, demonstrating a fundamentally faster and more flexible way of building machine learning models in medicine.

Via

Access Paper or Ask Questions

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Mar 14, 2019

Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, Christopher Ré

Figure 1 for Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Figure 2 for Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Figure 3 for Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Figure 4 for Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Abstract:Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrix-vector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent hand-crafting these algorithms and implementations is necessary, what structural priors they encode, and how much knowledge is required to automatically learn a fast algorithm for a provided structured transform. Motivated by a characterization of fast matrix-vector multiplication as products of sparse matrices, we introduce a parameterization of divide-and-conquer methods that is capable of representing a large class of transforms. This generic formulation can automatically learn an efficient algorithm for many important transforms; for example, it recovers the $O(N \log N)$ Cooley-Tukey FFT algorithm to machine precision, for dimensions $N$ up to $1024$. Furthermore, our method can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations. On a standard task of compressing a single hidden-layer network, our method exceeds the classification accuracy of unconstrained matrices on CIFAR-10 by 3.9 points---the first time a structured approach has done so---with 4X faster inference speed and 40X fewer parameters.

Via

Access Paper or Ask Questions

Learning Dependency Structures for Weak Supervision Models

Mar 14, 2019

Paroma Varma, Frederic Sala, Ann He, Alexander Ratner, Christopher Ré

Figure 1 for Learning Dependency Structures for Weak Supervision Models

Figure 2 for Learning Dependency Structures for Weak Supervision Models

Figure 3 for Learning Dependency Structures for Weak Supervision Models

Figure 4 for Learning Dependency Structures for Weak Supervision Models

Abstract:Labeling training data is a key bottleneck in the modern machine learning pipeline. Recent weak supervision approaches combine labels from multiple noisy sources by estimating their accuracies without access to ground truth labels; however, estimating the dependencies among these sources is a critical challenge. We focus on a robust PCA-based algorithm for learning these dependency structures, establish improved theoretical recovery rates, and outperform existing methods on various real-world tasks. Under certain conditions, we show that the amount of unlabeled data needed can scale sublinearly or even logarithmically with the number of sources $m$, improving over previous efforts that ignore the sparsity pattern in the dependency structure and scale linearly in $m$. We provide an information-theoretic lower bound on the minimum sample complexity of the weak supervision setting. Our method outperforms weak supervision approaches that assume conditionally-independent sources by up to 4.64 F1 points and previous structure learning approaches by up to 4.41 F1 points on real-world relation extraction and image classification tasks.

Via

Access Paper or Ask Questions

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

Dec 02, 2018

Stephen H. Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen, Alexander Ratner, Braden Hancock, Houman Alborzi(+3 more)

Figure 1 for Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

Figure 2 for Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

Figure 3 for Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

Figure 4 for Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

Abstract:Labeling training data is one of the most costly bottlenecks in developing or modifying machine learning-based applications. We survey how resources from across an organization can be used as weak supervision sources for three classification tasks at Google, in order to bring development time and cost down by an order of magnitude. We build on the Snorkel framework, extending it as a new system, Snorkel DryBell, which integrates with Google's distributed production systems and enables engineers to develop and execute weak supervision strategies over millions of examples in less than thirty minutes. We find that Snorkel DryBell creates classifiers of comparable quality to ones trained using up to tens of thousands of hand-labeled examples, in part by leveraging organizational resources not servable in production which contribute an average 52% performance improvement to the weakly supervised classifiers.

Via

Access Paper or Ask Questions

Low-Precision Random Fourier Features for Memory-Constrained Kernel Approximation

Oct 31, 2018

Jian Zhang, Avner May, Tri Dao, Christopher Ré

Figure 1 for Low-Precision Random Fourier Features for Memory-Constrained Kernel Approximation

Figure 2 for Low-Precision Random Fourier Features for Memory-Constrained Kernel Approximation

Figure 3 for Low-Precision Random Fourier Features for Memory-Constrained Kernel Approximation

Figure 4 for Low-Precision Random Fourier Features for Memory-Constrained Kernel Approximation

Abstract:We investigate how to train kernel approximation methods that generalize well under a memory budget. Building on recent theoretical work, we define a measure of kernel approximation error which we find to be much more predictive of the empirical generalization performance of kernel approximation methods than conventional metrics. An important consequence of this definition is that a kernel approximation matrix must be high-rank to attain close approximation. Because storing a high-rank approximation is memory-intensive, we propose using a low-precision quantization of random Fourier features (LP-RFFs) to build a high-rank approximation under a memory budget. Theoretically, we show quantization has a negligible effect on generalization performance in important settings. Empirically, we demonstrate across four benchmark datasets that LP-RFFs can match the performance of full-precision RFFs and the Nystr\"{o}m method, with 3x-10x and 50x-460x less memory, respectively.

Via

Access Paper or Ask Questions

Training Classifiers with Natural Language Explanations

Aug 25, 2018

Braden Hancock, Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, Christopher Ré

Figure 1 for Training Classifiers with Natural Language Explanations

Figure 2 for Training Classifiers with Natural Language Explanations

Figure 3 for Training Classifiers with Natural Language Explanations

Figure 4 for Training Classifiers with Natural Language Explanations

Abstract:Training accurate classifiers requires many labels, but each label provides only limited information (one bit for binary classification). In this work, we propose BabbleLabble, a framework for training classifiers in which an annotator provides a natural language explanation for each labeling decision. A semantic parser converts these explanations into programmatic labeling functions that generate noisy labels for an arbitrary amount of unlabeled data, which is used to train a classifier. On three relation extraction tasks, we find that users are able to train classifiers with comparable F1 scores from 5-100$\times$ faster by providing explanations instead of just labels. Furthermore, given the inherent imperfection of labeling functions, we find that a simple rule-based semantic parser suffices.

* ACL 2018; v4 adds references and link to code

Via

Access Paper or Ask Questions