Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ilya Sutskever

Tony

Evaluating Large Language Models Trained on Code

Jul 14, 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman(+48 more)

Figure 1 for Evaluating Large Language Models Trained on Code

Figure 2 for Evaluating Large Language Models Trained on Code

Figure 3 for Evaluating Large Language Models Trained on Code

Figure 4 for Evaluating Large Language Models Trained on Code

Abstract:We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

* corrected typos, added references, added authors, added acknowledgements

Via

Access Paper or Ask Questions

Zero-Shot Text-to-Image Generation

Feb 26, 2021

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever

Figure 1 for Zero-Shot Text-to-Image Generation

Figure 2 for Zero-Shot Text-to-Image Generation

Figure 3 for Zero-Shot Text-to-Image Generation

Figure 4 for Zero-Shot Text-to-Image Generation

Abstract:Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

Via

Access Paper or Ask Questions

Learning Transferable Visual Models From Natural Language Supervision

Feb 26, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark(+2 more)

Figure 1 for Learning Transferable Visual Models From Natural Language Supervision

Figure 2 for Learning Transferable Visual Models From Natural Language Supervision

Figure 3 for Learning Transferable Visual Models From Natural Language Supervision

Figure 4 for Learning Transferable Visual Models From Natural Language Supervision

Abstract:State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

Via

Access Paper or Ask Questions

Generative Language Modeling for Automated Theorem Proving

Sep 07, 2020

Stanislas Polu, Ilya Sutskever

Figure 1 for Generative Language Modeling for Automated Theorem Proving

Figure 2 for Generative Language Modeling for Automated Theorem Proving

Figure 3 for Generative Language Modeling for Automated Theorem Proving

Figure 4 for Generative Language Modeling for Automated Theorem Proving

Abstract:We explore the application of transformer-based language models to automated theorem proving. This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans -- the generation of original mathematical terms -- might be addressable via generation from language models. We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance. GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep-learning based system has contributed proofs that were adopted by a formal mathematics community.

* 15+5 pages

Via

Access Paper or Ask Questions

Language Models are Few-Shot Learners

Jun 05, 2020

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell(+21 more)

Figure 1 for Language Models are Few-Shot Learners

Figure 2 for Language Models are Few-Shot Learners

Figure 3 for Language Models are Few-Shot Learners

Figure 4 for Language Models are Few-Shot Learners

Abstract:Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

* 40+32 pages

Via

Access Paper or Ask Questions

Jukebox: A Generative Model for Music

Apr 30, 2020

Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever

Figure 1 for Jukebox: A Generative Model for Music

Figure 2 for Jukebox: A Generative Model for Music

Figure 3 for Jukebox: A Generative Model for Music

Figure 4 for Jukebox: A Generative Model for Music

Abstract:We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples at https://jukebox.openai.com, along with model weights and code at https://github.com/openai/jukebox

Via

Access Paper or Ask Questions

Dota 2 with Large Scale Deep Reinforcement Learning

Dec 13, 2019

OpenAI, :, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer(+17 more)

Figure 1 for Dota 2 with Large Scale Deep Reinforcement Learning

Figure 2 for Dota 2 with Large Scale Deep Reinforcement Learning

Figure 3 for Dota 2 with Large Scale Deep Reinforcement Learning

Figure 4 for Dota 2 with Large Scale Deep Reinforcement Learning

Abstract:On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.

Via

Access Paper or Ask Questions

Deep Double Descent: Where Bigger Models and More Data Hurt

Dec 04, 2019

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever

Figure 1 for Deep Double Descent: Where Bigger Models and More Data Hurt

Figure 2 for Deep Double Descent: Where Bigger Models and More Data Hurt

Figure 3 for Deep Double Descent: Where Bigger Models and More Data Hurt

Figure 4 for Deep Double Descent: Where Bigger Models and More Data Hurt

Abstract:We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by defining a new complexity measure we call the effective model complexity and conjecture a generalized double descent with respect to this measure. Furthermore, our notion of model complexity allows us to identify certain regimes where increasing (even quadrupling) the number of train samples actually hurts test performance.

* G.K. and Y.B. contributed equally

Via

Access Paper or Ask Questions

Generating Long Sequences with Sparse Transformers

Apr 23, 2019

Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever

Figure 1 for Generating Long Sequences with Sparse Transformers

Figure 2 for Generating Long Sequences with Sparse Transformers

Figure 3 for Generating Long Sequences with Sparse Transformers

Figure 4 for Generating Long Sequences with Sparse Transformers

Abstract:Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.

Via

Access Paper or Ask Questions

FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models

Oct 22, 2018

Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, David Duvenaud

Figure 1 for FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models

Figure 2 for FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models

Figure 3 for FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models

Figure 4 for FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models

Abstract:A promising class of generative models maps points from a simple distribution to a complex distribution through an invertible neural network. Likelihood-based training of these models requires restricting their architectures to allow cheap computation of Jacobian determinants. Alternatively, the Jacobian trace can be used if the transformation is specified by an ordinary differential equation. In this paper, we use Hutchinson's trace estimator to give a scalable unbiased estimate of the log-density. The result is a continuous-time invertible generative model with unbiased density estimation and one-pass sampling, while allowing unrestricted neural network architectures. We demonstrate our approach on high-dimensional density estimation, image generation, and variational inference, achieving the state-of-the-art among exact likelihood methods with efficient sampling.

* 8 Pages, 6 figures

Via

Access Paper or Ask Questions