Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stefano Ermon

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Jun 10, 2022

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso(+435 more)

Abstract:Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

* 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Via

Access Paper or Ask Questions

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

May 27, 2022

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

Figure 1 for FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Figure 2 for FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Figure 3 for FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Figure 4 for FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Abstract:Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

Via

Access Paper or Ask Questions

Training and Inference on Any-Order Autoregressive Models the Right Way

May 26, 2022

Andy Shih, Dorsa Sadigh, Stefano Ermon

Figure 1 for Training and Inference on Any-Order Autoregressive Models the Right Way

Figure 2 for Training and Inference on Any-Order Autoregressive Models the Right Way

Figure 3 for Training and Inference on Any-Order Autoregressive Models the Right Way

Figure 4 for Training and Inference on Any-Order Autoregressive Models the Right Way

Abstract:Conditional inference on arbitrary subsets of variables is a core problem in probabilistic inference with important applications such as masked language modeling and image inpainting. In recent years, the family of Any-Order Autoregressive Models (AO-ARMs) -- which includes popular models such as XLNet -- has shown breakthrough performance in arbitrary conditional tasks across a sweeping range of domains. But, in spite of their success, in this paper we identify significant improvements to be made to previous formulations of AO-ARMs. First, we show that AO-ARMs suffer from redundancy in their probabilistic model, i.e., they define the same distribution in multiple different ways. We alleviate this redundancy by training on a smaller set of univariate conditionals that still maintains support for efficient arbitrary conditional inference. Second, we upweight the training loss for univariate conditionals that are evaluated more frequently during inference. Our method leads to improved performance with no compromises on tractability, giving state-of-the-art likelihoods in arbitrary conditional modeling on text (Text8), image (CIFAR10, ImageNet32), and continuous tabular data domains.

Via

Access Paper or Ask Questions

Self-Similarity Priors: Neural Collages as Differentiable Fractal Representations

Apr 15, 2022

Michael Poli, Winnie Xu, Stefano Massaroli, Chenlin Meng, Kuno Kim, Stefano Ermon

Abstract:Many patterns in nature exhibit self-similarity: they can be compactly described via self-referential transformations. Said patterns commonly appear in natural and artificial objects, such as molecules, shorelines, galaxies and even images. In this work, we investigate the role of learning in the automated discovery of self-similarity and in its utilization for downstream tasks. To this end, we design a novel class of implicit operators, Neural Collages, which (1) represent data as the parameters of a self-referential, structured transformation, and (2) employ hypernetworks to amortize the cost of finding these parameters to a single forward pass. We investigate how to leverage the representations produced by Neural Collages in various tasks, including data compression and generation. Neural Collages image compressors are orders of magnitude faster than other self-similarity-based algorithms during encoding and offer compression rates competitive with implicit methods. Finally, we showcase applications of Neural Collages for fractal art and as deep generative models.

Via

Access Paper or Ask Questions

Tracking Urbanization in Developing Regions with Remote Sensing Spatial-Temporal Super-Resolution

Apr 04, 2022

Yutong He, William Zhang, Chenlin Meng, Marshall Burke, David B. Lobell, Stefano Ermon

Figure 1 for Tracking Urbanization in Developing Regions with Remote Sensing Spatial-Temporal Super-Resolution

Figure 2 for Tracking Urbanization in Developing Regions with Remote Sensing Spatial-Temporal Super-Resolution

Figure 3 for Tracking Urbanization in Developing Regions with Remote Sensing Spatial-Temporal Super-Resolution

Abstract:Automated tracking of urban development in areas where construction information is not available became possible with recent advancements in machine learning and remote sensing. Unfortunately, these solutions perform best on high-resolution imagery, which is expensive to acquire and infrequently available, making it difficult to scale over long time spans and across large geographies. In this work, we propose a pipeline that leverages a single high-resolution image and a time series of publicly available low-resolution images to generate accurate high-resolution time series for object tracking in urban construction. Our method achieves significant improvement in comparison to baselines using single image super-resolution, and can assist in extending the accessibility and scalability of building construction tracking across the developing world.

* Presented at Workshop on Machine Learning for the Developing World (ML4D) at the 35th Conference on Neural Information Processing Systems (NeurIPS) 2021

Via

Access Paper or Ask Questions

Generative Modeling Helps Weak Supervision (and Vice Versa)

Mar 22, 2022

Benedikt Boecking, Willie Neiswanger, Nicholas Roberts, Stefano Ermon, Frederic Sala, Artur Dubrawski

Figure 1 for Generative Modeling Helps Weak Supervision (and Vice Versa)

Figure 2 for Generative Modeling Helps Weak Supervision (and Vice Versa)

Figure 3 for Generative Modeling Helps Weak Supervision (and Vice Versa)

Figure 4 for Generative Modeling Helps Weak Supervision (and Vice Versa)

Abstract:Many promising applications of supervised machine learning face hurdles in the acquisition of labeled data in sufficient quantity and quality, creating an expensive bottleneck. To overcome such limitations, techniques that do not depend on ground truth labels have been developed, including weak supervision and generative modeling. While these techniques would seem to be usable in concert, improving one another, how to build an interface between them is not well-understood. In this work, we propose a model fusing weak supervision and generative adversarial networks. It captures discrete variables in the data alongside the weak supervision derived label estimate. Their alignment allows for better modeling of sample-dependent accuracies of the weak supervision sources, improving the unobserved ground truth estimate. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels. Additionally, its learned discrete variables can be inspected qualitatively. The model outperforms baseline weak supervision label models on a number of multiclass classification datasets, improves the quality of generated images, and further improves end-model performance through data augmentation with synthetic samples.

Via

Access Paper or Ask Questions

Dual Diffusion Implicit Bridges for Image-to-Image Translation

Mar 16, 2022

Xuan Su, Jiaming Song, Chenlin Meng, Stefano Ermon

Figure 1 for Dual Diffusion Implicit Bridges for Image-to-Image Translation

Figure 2 for Dual Diffusion Implicit Bridges for Image-to-Image Translation

Figure 3 for Dual Diffusion Implicit Bridges for Image-to-Image Translation

Figure 4 for Dual Diffusion Implicit Bridges for Image-to-Image Translation

Abstract:Common image-to-image translation methods rely on joint training over data from both source and target domains. This excludes cases where domain data is private (e.g., in a federated setting), and often means that a new model has to be trained for a new pair of domains. We present Dual Diffusion Implicit Bridges (DDIBs), an image translation method based on diffusion models, that circumvents training on domain pairs. DDIBs allow translations between arbitrary pairs of source-target domains, given independently trained diffusion models on the respective domains. Image translation with DDIBs is a two-step process: DDIBs first obtain latent encodings for source images with the source diffusion model, and next decode such encodings using the target model to construct target images. Moreover, DDIBs enable cycle-consistency by default and is theoretically connected to optimal transport. Experimentally, we apply DDIBs on a variety of synthetic and high-resolution image datasets, demonstrating their utility in example-guided color transfer, image-to-image translation as well as their connections to optimal transport methods.

Via

Access Paper or Ask Questions

GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation

Mar 06, 2022

Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, Jian Tang

Figure 1 for GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation

Figure 2 for GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation

Figure 3 for GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation

Figure 4 for GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation

Abstract:Predicting molecular conformations from molecular graphs is a fundamental problem in cheminformatics and drug discovery. Recently, significant progress has been achieved with machine learning approaches, especially with deep generative models. Inspired by the diffusion process in classical non-equilibrium thermodynamics where heated particles will diffuse from original states to a noise distribution, in this paper, we propose a novel generative model named GeoDiff for molecular conformation prediction. GeoDiff treats each atom as a particle and learns to directly reverse the diffusion process (i.e., transforming from a noise distribution to stable conformations) as a Markov chain. Modeling such a generation process is however very challenging as the likelihood of conformations should be roto-translational invariant. We theoretically show that Markov chains evolving with equivariant Markov kernels can induce an invariant distribution by design, and further propose building blocks for the Markov kernels to preserve the desirable equivariance property. The whole framework can be efficiently trained in an end-to-end fashion by optimizing a weighted variational lower bound to the (conditional) likelihood. Experiments on multiple benchmarks show that GeoDiff is superior or comparable to existing state-of-the-art approaches, especially on large molecules.

* Published as a conference paper at ICLR 2022 (https://openreview.net/forum?id=PzcvxEMzvQC)

Via

Access Paper or Ask Questions

LISA: Learning Interpretable Skill Abstractions from Language

Feb 28, 2022

Divyansh Garg, Skanda Vaidyanath, Kuno Kim, Jiaming Song, Stefano Ermon

Figure 1 for LISA: Learning Interpretable Skill Abstractions from Language

Figure 2 for LISA: Learning Interpretable Skill Abstractions from Language

Figure 3 for LISA: Learning Interpretable Skill Abstractions from Language

Figure 4 for LISA: Learning Interpretable Skill Abstractions from Language

Abstract:Learning policies that effectually utilize language instructions in complex, multi-task environments is an important problem in imitation learning. While it is possible to condition on the entire language instruction directly, such an approach could suffer from generalization issues. To encode complex instructions into skills that can generalize to unseen instructions, we propose Learning Interpretable Skill Abstractions (LISA), a hierarchical imitation learning framework that can learn diverse, interpretable skills from language-conditioned demonstrations. LISA uses vector quantization to learn discrete skill codes that are highly correlated with language instructions and the behavior of the learned policy. In navigation and robotic manipulation environments, LISA is able to outperform a strong non-hierarchical baseline in the low data regime and compose learned skills to solve tasks containing unseen long-range instructions. Our method demonstrates a more natural way to condition on language in sequential decision-making problems and achieve interpretable and controllable behavior with the learned skills.

Via

Access Paper or Ask Questions

Denoising Diffusion Restoration Models

Feb 04, 2022

Bahjat Kawar, Michael Elad, Stefano Ermon, Jiaming Song

Figure 1 for Denoising Diffusion Restoration Models

Figure 2 for Denoising Diffusion Restoration Models

Figure 3 for Denoising Diffusion Restoration Models

Figure 4 for Denoising Diffusion Restoration Models

Abstract:Many interesting tasks in image restoration can be cast as linear inverse problems. A recent family of approaches for solving these problems uses stochastic algorithms that sample from the posterior distribution of natural images given the measurements. However, efficient solutions often require problem-specific supervised training to model the posterior, whereas unsupervised methods that are not problem-specific typically rely on inefficient iterative methods. This work addresses these issues by introducing Denoising Diffusion Restoration Models (DDRM), an efficient, unsupervised posterior sampling method. Motivated by variational inference, DDRM takes advantage of a pre-trained denoising diffusion generative model for solving any linear inverse problem. We demonstrate DDRM's versatility on several image datasets for super-resolution, deblurring, inpainting, and colorization under various amounts of measurement noise. DDRM outperforms the current leading unsupervised methods on the diverse ImageNet dataset in reconstruction quality, perceptual quality, and runtime, being 5x faster than the nearest competitor. DDRM also generalizes well for natural images out of the distribution of the observed ImageNet training set.

* Our code is available at https://github.com/bahjat-kawar/ddrm

Via

Access Paper or Ask Questions