Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Henryk Michalewski

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Nov 30, 2021
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, Augustus Odena

Figure 1 for Show Your Work: Scratchpads for Intermediate Computation with Language Models

Figure 2 for Show Your Work: Scratchpads for Intermediate Computation with Language Models

Figure 3 for Show Your Work: Scratchpads for Intermediate Computation with Language Models

Figure 4 for Show Your Work: Scratchpads for Intermediate Computation with Language Models

Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations -- even in the few-shot regime -- when asked to perform the operation "step by step", showing the results of intermediate computations. In particular, we train transformers to perform multi-step computations by asking them to emit intermediate computation steps into a "scratchpad". On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations.

Via

Access Paper or Ask Questions

Sparse is Enough in Scaling Transformers

Nov 24, 2021
Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva

Figure 1 for Sparse is Enough in Scaling Transformers

Figure 2 for Sparse is Enough in Scaling Transformers

Figure 3 for Sparse is Enough in Scaling Transformers

Figure 4 for Sparse is Enough in Scaling Transformers

Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. Surprisingly, the sparse layers are enough to obtain the same perplexity as the standard Transformer with the same number of parameters. We also integrate with prior sparsity approaches to attention and enable fast inference on long sequences even with limited memory. This results in performance competitive to the state-of-the-art on long text summarization.

* NeurIPS 2021

Via

Access Paper or Ask Questions

Off-Policy Correction For Multi-Agent Reinforcement Learning

Nov 22, 2021
Michał Zawalski, Błażej Osiński, Henryk Michalewski, Piotr Miłoś

Figure 1 for Off-Policy Correction For Multi-Agent Reinforcement Learning

Figure 2 for Off-Policy Correction For Multi-Agent Reinforcement Learning

Figure 3 for Off-Policy Correction For Multi-Agent Reinforcement Learning

Figure 4 for Off-Policy Correction For Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) provides a framework for problems involving multiple interacting agents. Despite apparent similarity to the single-agent case, multi-agent problems are often harder to train and analyze theoretically. In this work, we propose MA-Trace, a new on-policy actor-critic algorithm, which extends V-Trace to the MARL setting. The key advantage of our algorithm is its high scalability in a multi-worker setting. To this end, MA-Trace utilizes importance sampling as an off-policy correction method, which allows distributing the computations with no impact on the quality of training. Furthermore, our algorithm is theoretically grounded - we prove a fixed-point theorem that guarantees convergence. We evaluate the algorithm extensively on the StarCraft Multi-Agent Challenge, a standard benchmark for multi-agent algorithms. MA-Trace achieves high performance on all its tasks and exceeds state-of-the-art results on some of them.

Via

Access Paper or Ask Questions

Hierarchical Transformers Are More Efficient Language Models

Oct 26, 2021
Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Łukasz Kaiser, Yuhuai Wu, Christian Szegedy, Henryk Michalewski

Figure 1 for Hierarchical Transformers Are More Efficient Language Models

Figure 2 for Hierarchical Transformers Are More Efficient Language Models

Figure 3 for Hierarchical Transformers Are More Efficient Language Models

Figure 4 for Hierarchical Transformers Are More Efficient Language Models

Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences. To verify this claim, we first study different ways to downsample and upsample activations in Transformers so as to make them hierarchical. We use the best performing upsampling and downsampling layers to create Hourglass - a hierarchical Transformer language model. Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently. In particular, Hourglass sets new state-of-the-art for Transformer models on the ImageNet32 generation task and improves language modeling efficiency on the widely studied enwik8 benchmark.

Via

Access Paper or Ask Questions

Program Synthesis with Large Language Models

Aug 16, 2021
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton

Figure 1 for Program Synthesis with Large Language Models

Figure 2 for Program Synthesis with Large Language Models

Figure 3 for Program Synthesis with Large Language Models

Figure 4 for Program Synthesis with Large Language Models

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

* Jacob and Augustus contributed equally

Via

Access Paper or Ask Questions

Measuring and Improving BERT's Mathematical Abilities by Predicting the Order of Reasoning

Jun 07, 2021
Piotr Piękos, Henryk Michalewski, Mateusz Malinowski

Figure 1 for Measuring and Improving BERT's Mathematical Abilities by Predicting the Order of Reasoning

Figure 2 for Measuring and Improving BERT's Mathematical Abilities by Predicting the Order of Reasoning

Figure 3 for Measuring and Improving BERT's Mathematical Abilities by Predicting the Order of Reasoning

Figure 4 for Measuring and Improving BERT's Mathematical Abilities by Predicting the Order of Reasoning

Imagine you are in a supermarket. You have two bananas in your basket and want to buy four apples. How many fruits do you have in total? This seemingly straightforward question can be challenging for data-driven language models, even if trained at scale. However, we would expect such generic language models to possess some mathematical abilities in addition to typical linguistic competence. Towards this goal, we investigate if a commonly used language model, BERT, possesses such mathematical abilities and, if so, to what degree. For that, we fine-tune BERT on a popular dataset for word math problems, AQuA-RAT, and conduct several tests to understand learned representations better. Since we teach models trained on natural language to do formal mathematics, we hypothesize that such models would benefit from training on semi-formal steps that explain how math results are derived. To better accommodate such training, we also propose new pretext tasks for learning mathematical rules. We call them (Neighbor) Reasoning Order Prediction (ROP or NROP). With this new model, we achieve significantly better outcomes than data-driven baselines and even on-par with more tailored models. We also show how to reduce positional bias in such models.

* The paper has been accepted to the ACL-IJCNLP 2021 conference

Via

Access Paper or Ask Questions

Q-Value Weighted Regression: Reinforcement Learning with Limited Data

Feb 12, 2021
Piotr Kozakowski, Łukasz Kaiser, Henryk Michalewski, Afroz Mohiuddin, Katarzyna Kańska

Figure 1 for Q-Value Weighted Regression: Reinforcement Learning with Limited Data

Figure 2 for Q-Value Weighted Regression: Reinforcement Learning with Limited Data

Figure 3 for Q-Value Weighted Regression: Reinforcement Learning with Limited Data

Figure 4 for Q-Value Weighted Regression: Reinforcement Learning with Limited Data

Sample efficiency and performance in the offline setting have emerged as significant challenges of deep reinforcement learning. We introduce Q-Value Weighted Regression (QWR), a simple RL algorithm that excels in these aspects. QWR is an extension of Advantage Weighted Regression (AWR), an off-policy actor-critic algorithm that performs very well on continuous control tasks, also in the offline setting, but has low sample efficiency and struggles with high-dimensional observation spaces. We perform an analysis of AWR that explains its shortcomings and use these insights to motivate QWR. We show experimentally that QWR matches the state-of-the-art algorithms both on tasks with continuous and discrete actions. In particular, QWR yields results on par with SAC on the MuJoCo suite and - with the same set of hyperparameters - yields results on par with a highly tuned Rainbow implementation on a set of Atari games. We also verify that QWR performs well in the offline RL setting.

Via

Access Paper or Ask Questions

CARLA Real Traffic Scenarios -- novel training ground and benchmark for autonomous driving

Dec 16, 2020
Błażej Osiński, Piotr Miłoś, Adam Jakubowski, Paweł Zięcina, Michał Martyniak, Christopher Galias, Antonia Breuer, Silviu Homoceanu, Henryk Michalewski

Figure 1 for CARLA Real Traffic Scenarios -- novel training ground and benchmark for autonomous driving

Figure 2 for CARLA Real Traffic Scenarios -- novel training ground and benchmark for autonomous driving

Figure 3 for CARLA Real Traffic Scenarios -- novel training ground and benchmark for autonomous driving

Figure 4 for CARLA Real Traffic Scenarios -- novel training ground and benchmark for autonomous driving

This work introduces interactive traffic scenarios in the CARLA simulator, which are based on real-world traffic. We concentrate on tactical tasks lasting several seconds, which are especially challenging for current control methods. The CARLA Real Traffic Scenarios (CRTS) is intended to be a training and testing ground for autonomous driving systems. To this end, we open-source the code under a permissive license and present a set of baseline policies. CRTS combines the realism of traffic scenarios and the flexibility of simulation. We use it to train agents using a reinforcement learning algorithm. We show how to obtain competitive polices and evaluate experimentally how observation types and reward schemes affect the training process and the resulting agent's behavior.

Via

Access Paper or Ask Questions

Neural heuristics for SAT solving

May 27, 2020
Sebastian Jaszczur, Michał Łuszczyk, Henryk Michalewski

Figure 1 for Neural heuristics for SAT solving

Figure 2 for Neural heuristics for SAT solving

Figure 3 for Neural heuristics for SAT solving

Figure 4 for Neural heuristics for SAT solving

We use neural graph networks with a message-passing architecture and an attention mechanism to enhance the branching heuristic in two SAT-solving algorithms. We report improvements of learned neural heuristics compared with two standard human-designed heuristics.

Via

Access Paper or Ask Questions

Simulation-based reinforcement learning for real-world autonomous driving

Dec 26, 2019
Błażej Osiński, Adam Jakubowski, Piotr Miłoś, Paweł Zięcina, Christopher Galias, Silviu Homoceanu, Henryk Michalewski

Figure 1 for Simulation-based reinforcement learning for real-world autonomous driving

Figure 2 for Simulation-based reinforcement learning for real-world autonomous driving

Figure 3 for Simulation-based reinforcement learning for real-world autonomous driving

Figure 4 for Simulation-based reinforcement learning for real-world autonomous driving

We use synthetic data and a reinforcement learning algorithm to train a system controlling a full-size real-world vehicle in a number of restricted driving scenarios. The driving policy uses RGB images as input. We analyze how design decisions about perception, control and training impact the real-world performance.

Via

Access Paper or Ask Questions