Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Y. Fu

Hyena Hierarchy: Towards Larger Convolutional Language Models

Mar 06, 2023

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré

Figure 1 for Hyena Hierarchy: Towards Larger Convolutional Language Models

Figure 2 for Hyena Hierarchy: Towards Larger Convolutional Language Models

Figure 3 for Hyena Hierarchy: Towards Larger Convolutional Language Models

Figure 4 for Hyena Hierarchy: Towards Larger Convolutional Language Models

Abstract:Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

* Additional results (PG-19, LAMBADA)

Via

Access Paper or Ask Questions

Simple Hardware-Efficient Long Convolutions for Sequence Modeling

Feb 13, 2023

Daniel Y. Fu, Elliot L. Epstein, Eric Nguyen, Armin W. Thomas, Michael Zhang, Tri Dao, Atri Rudra, Christopher Ré

Abstract:State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that a key requirement to achieving high performance is keeping the convolution kernels smooth. We find that simple interventions--such as squashing the kernel weights--result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling. Next, we develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions. FlashButterfly appeals to classic Butterfly decompositions of the convolution to reduce GPU memory IO and increase FLOP utilization. FlashButterfly speeds up convolutions by 2.2$\times$, and allows us to train on Path256, a challenging task with sequence length 64K, where we set state-of-the-art by 29.1 points while training 7.2$\times$ faster than prior work. Lastly, we introduce an extension to FlashButterfly that learns the coefficients of the Butterfly decomposition, increasing expressivity without increasing runtime. Using this extension, we outperform a Transformer on WikiText103 by 0.2 PPL with 30% fewer parameters.

Via

Access Paper or Ask Questions

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Dec 28, 2022

Tri Dao, Daniel Y. Fu, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré

Abstract:State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid language models to generate text 1.6$\times$ faster than Transformers. Using FlashConv, we scale hybrid H3-attention language models up to 1.3B parameters on the Pile and find promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.

Via

Access Paper or Ask Questions

Lost in Transmission: On the Impact of Networking Corruptions on Video Machine Learning Models

Jun 10, 2022

Trenton Chang, Daniel Y. Fu

Figure 1 for Lost in Transmission: On the Impact of Networking Corruptions on Video Machine Learning Models

Figure 2 for Lost in Transmission: On the Impact of Networking Corruptions on Video Machine Learning Models

Figure 3 for Lost in Transmission: On the Impact of Networking Corruptions on Video Machine Learning Models

Figure 4 for Lost in Transmission: On the Impact of Networking Corruptions on Video Machine Learning Models

Abstract:We study how networking corruptions--data corruptions caused by networking errors--affect video machine learning (ML) models. We discover apparent networking corruptions in Kinetics-400, a benchmark video ML dataset. In a simulation study, we investigate (1) what artifacts networking corruptions cause, (2) how such artifacts affect ML models, and (3) whether standard robustness methods can mitigate their negative effects. We find that networking corruptions cause visual and temporal artifacts (i.e., smeared colors or frame drops). These networking corruptions degrade performance on a variety of video ML tasks, but effects vary by task and dataset, depending on how much temporal context the tasks require. Lastly, we evaluate data augmentation--a standard defense for data corruptions--but find that it does not recover performance.

* 12 pages, 12 figures (with supplemental: 34 pages)

Via

Access Paper or Ask Questions

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

May 27, 2022

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

Figure 1 for FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Figure 2 for FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Figure 3 for FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Figure 4 for FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Abstract:Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

Via

Access Paper or Ask Questions

TABi: Type-Aware Bi-Encoders for Open-Domain Entity Retrieval

Apr 18, 2022

Megan Leszczynski, Daniel Y. Fu, Mayee F. Chen, Christopher Ré

Figure 1 for TABi: Type-Aware Bi-Encoders for Open-Domain Entity Retrieval

Figure 2 for TABi: Type-Aware Bi-Encoders for Open-Domain Entity Retrieval

Figure 3 for TABi: Type-Aware Bi-Encoders for Open-Domain Entity Retrieval

Figure 4 for TABi: Type-Aware Bi-Encoders for Open-Domain Entity Retrieval

Abstract:Entity retrieval--retrieving information about entity mentions in a query--is a key step in open-domain tasks, such as question answering or fact checking. However, state-of-the-art entity retrievers struggle to retrieve rare entities for ambiguous mentions due to biases towards popular entities. Incorporating knowledge graph types during training could help overcome popularity biases, but there are several challenges: (1) existing type-based retrieval methods require mention boundaries as input, but open-domain tasks run on unstructured text, (2) type-based methods should not compromise overall performance, and (3) type-based methods should be robust to noisy and missing types. In this work, we introduce TABi, a method to jointly train bi-encoders on knowledge graph types and unstructured text for entity retrieval for open-domain tasks. TABi leverages a type-enforced contrastive loss to encourage entities and queries of similar types to be close in the embedding space. TABi improves retrieval of rare entities on the Ambiguous Entity Retrieval (AmbER) sets, while maintaining strong overall retrieval performance on open-domain tasks in the KILT benchmark compared to state-of-the-art retrievers. TABi is also robust to incomplete type systems, improving rare entity retrieval over baselines with only 5% type coverage of the training dataset. We make our code publicly available at https://github.com/HazyResearch/tabi.

* Accepted to Findings of ACL 2022

Via

Access Paper or Ask Questions

Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning

Apr 15, 2022

Mayee F. Chen, Daniel Y. Fu, Avanika Narayan, Michael Zhang, Zhao Song, Kayvon Fatahalian, Christopher Ré

Figure 1 for Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning

Figure 2 for Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning

Figure 3 for Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning

Figure 4 for Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning

Abstract:An ideal learned representation should display transferability and robustness. Supervised contrastive learning (SupCon) is a promising method for training accurate models, but produces representations that do not capture these properties due to class collapse -- when all points in a class map to the same representation. Recent work suggests that "spreading out" these representations improves them, but the precise mechanism is poorly understood. We argue that creating spread alone is insufficient for better representations, since spread is invariant to permutations within classes. Instead, both the correct degree of spread and a mechanism for breaking this invariance are necessary. We first prove that adding a weighted class-conditional InfoNCE loss to SupCon controls the degree of spread. Next, we study three mechanisms to break permutation invariance: using a constrained encoder, adding a class-conditional autoencoder, and using data augmentation. We show that the latter two encourage clustering of latent subclasses under more realistic conditions than the former. Using these insights, we show that adding a properly-weighted class-conditional InfoNCE loss and a class-conditional autoencoder to SupCon achieves 11.1 points of lift on coarse-to-fine transfer across 5 standard datasets and 4.7 points on worst-group robustness on 3 datasets, setting state-of-the-art on CelebA by 11.5 points.

Via

Access Paper or Ask Questions

Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision

Mar 24, 2022

Mayee F. Chen, Daniel Y. Fu, Dyah Adila, Michael Zhang, Frederic Sala, Kayvon Fatahalian, Christopher Ré

Figure 1 for Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision

Figure 2 for Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision

Figure 3 for Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision

Figure 4 for Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision

Abstract:Foundation models offer an exciting new paradigm for constructing models with out-of-the-box embeddings and a few labeled examples. However, it is not clear how to best apply foundation models without labeled data. A potential approach is to fuse foundation models with weak supervision frameworks, which use weak label sources -- pre-trained models, heuristics, crowd-workers -- to construct pseudolabels. The challenge is building a combination that best exploits the signal available in both foundation models and weak sources. We propose Liger, a combination that uses foundation model embeddings to improve two crucial elements of existing weak supervision techniques. First, we produce finer estimates of weak source quality by partitioning the embedding space and learning per-part source accuracies. Second, we improve source coverage by extending source votes in embedding space. Despite the black-box nature of foundation models, we prove results characterizing how our approach improves performance and show that lift scales with the smoothness of label distributions in embedding space. On six benchmark NLP and video tasks, Liger outperforms vanilla weak supervision by 14.1 points, weakly-supervised kNN and adapters by 11.8 points, and kNN and adapters supervised by traditional hand labels by 7.2 points.

Via

Access Paper or Ask Questions

Train and You'll Miss It: Interactive Model Iteration with Weak Supervision and Pre-Trained Embeddings

Jun 26, 2020

Mayee F. Chen, Daniel Y. Fu, Frederic Sala, Sen Wu, Ravi Teja Mullapudi, Fait Poms, Kayvon Fatahalian, Christopher Ré

Figure 1 for Train and You'll Miss It: Interactive Model Iteration with Weak Supervision and Pre-Trained Embeddings

Figure 2 for Train and You'll Miss It: Interactive Model Iteration with Weak Supervision and Pre-Trained Embeddings

Figure 3 for Train and You'll Miss It: Interactive Model Iteration with Weak Supervision and Pre-Trained Embeddings

Figure 4 for Train and You'll Miss It: Interactive Model Iteration with Weak Supervision and Pre-Trained Embeddings

Abstract:Our goal is to enable machine learning systems to be trained interactively. This requires models that perform well and train quickly, without large amounts of hand-labeled data. We take a step forward in this direction by borrowing from weak supervision (WS), wherein models can be trained with noisy sources of signal instead of hand-labeled data. But WS relies on training downstream deep networks to extrapolate to unseen data points, which can take hours or days. Pre-trained embeddings can remove this requirement. We do not use the embeddings as features as in transfer learning (TL), which requires fine-tuning for high performance, but instead use them to define a distance function on the data and extend WS source votes to nearby points. Theoretically, we provide a series of results studying how performance scales with changes in source coverage, source accuracy, and the Lipschitzness of label distributions in the embedding space, and compare this rate to standard WS without extension and TL without fine-tuning. On six benchmark NLP and video tasks, our method outperforms WS without extension by 4.1 points, TL without fine-tuning by 12.8 points, and traditionally-supervised deep networks by 13.1 points, and comes within 0.7 points of state-of-the-art weakly-supervised deep networks-all while training in less than half a second.

Via

Access Paper or Ask Questions

Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods

Feb 27, 2020

Daniel Y. Fu, Mayee F. Chen, Frederic Sala, Sarah M. Hooper, Kayvon Fatahalian, Christopher Ré

Figure 1 for Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods

Figure 2 for Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods

Figure 3 for Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods

Figure 4 for Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods

Abstract:Weak supervision is a popular method for building machine learning models without relying on ground truth annotations. Instead, it generates probabilistic training labels by estimating the accuracies of multiple noisy labeling sources (e.g., heuristics, crowd workers). Existing approaches use latent variable estimation to model the noisy sources, but these methods can be computationally expensive, scaling superlinearly in the data. In this work, we show that, for a class of latent variable models highly applicable to weak supervision, we can find a closed-form solution to model parameters, obviating the need for iterative solutions like stochastic gradient descent (SGD). We use this insight to build FlyingSquid, a weak supervision framework that runs orders of magnitude faster than previous weak supervision approaches and requires fewer assumptions. In particular, we prove bounds on generalization error without assuming that the latent variable model can exactly parameterize the underlying data distribution. Empirically, we validate FlyingSquid on benchmark weak supervision datasets and find that it achieves the same or higher quality compared to previous approaches without the need to tune an SGD procedure, recovers model parameters 170 times faster on average, and enables new video analysis and online learning applications.

Via

Access Paper or Ask Questions