Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrej Karpathy

Tony

GPT-4o System Card

Oct 25, 2024

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda(+409 more)

Abstract:GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

Via

Access Paper or Ask Questions

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications

Jan 19, 2017

Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma

Figure 1 for PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications

Figure 2 for PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications

Figure 3 for PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications

Figure 4 for PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications

Abstract:PixelCNNs are a recently proposed class of powerful generative models with tractable likelihood. Here we discuss our implementation of PixelCNNs which we make available at https://github.com/openai/pixel-cnn. Our implementation contains a number of modifications to the original model that both simplify its structure and improve its performance. 1) We use a discretized logistic mixture likelihood on the pixels, rather than a 256-way softmax, which we find to speed up training. 2) We condition on whole pixels, rather than R/G/B sub-pixels, simplifying the model structure. 3) We use downsampling to efficiently capture structure at multiple resolutions. 4) We introduce additional short-cut connections to further speed up optimization. 5) We regularize the model using dropout. Finally, we present state-of-the-art log likelihood results on CIFAR-10 to demonstrate the usefulness of these modifications.

Via

Access Paper or Ask Questions

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Nov 24, 2015

Justin Johnson, Andrej Karpathy, Li Fei-Fei

Figure 1 for DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Figure 2 for DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Figure 3 for DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Figure 4 for DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Abstract:We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language. The dense captioning task generalizes object detection when the descriptions consist of a single word, and Image Captioning when one predicted region covers the full image. To address the localization and description task jointly we propose a Fully Convolutional Localization Network (FCLN) architecture that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with a single round of optimization. The architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences. We evaluate our network on the Visual Genome dataset, which comprises 94,000 images and 4,100,000 region-grounded captions. We observe both speed and accuracy improvements over baselines based on current state of the art approaches in both generation and retrieval settings.

Via

Access Paper or Ask Questions

Visualizing and Understanding Recurrent Networks

Nov 17, 2015

Andrej Karpathy, Justin Johnson, Li Fei-Fei

Figure 1 for Visualizing and Understanding Recurrent Networks

Figure 2 for Visualizing and Understanding Recurrent Networks

Figure 3 for Visualizing and Understanding Recurrent Networks

Figure 4 for Visualizing and Understanding Recurrent Networks

Abstract:Recurrent Neural Networks (RNNs), and specifically a variant with Long Short-Term Memory (LSTM), are enjoying renewed interest as a result of successful applications in a wide range of machine learning problems that involve sequential data. However, while LSTMs provide exceptional results in practice, the source of their performance and their limitations remain rather poorly understood. Using character-level language models as an interpretable testbed, we aim to bridge this gap by providing an analysis of their representations, predictions and error types. In particular, our experiments reveal the existence of interpretable cells that keep track of long-range dependencies such as line lengths, quotes and brackets. Moreover, our comparative analysis with finite horizon n-gram models traces the source of the LSTM improvements to long-range structural dependencies. Finally, we provide analysis of the remaining errors and suggests areas for further study.

* changing style, adding references, minor changes to text

Via

Access Paper or Ask Questions

Deep Visual-Semantic Alignments for Generating Image Descriptions

Apr 14, 2015

Andrej Karpathy, Li Fei-Fei

Figure 1 for Deep Visual-Semantic Alignments for Generating Image Descriptions

Figure 2 for Deep Visual-Semantic Alignments for Generating Image Descriptions

Figure 3 for Deep Visual-Semantic Alignments for Generating Image Descriptions

Figure 4 for Deep Visual-Semantic Alignments for Generating Image Descriptions

Abstract:We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.

Via

Access Paper or Ask Questions

ImageNet Large Scale Visual Recognition Challenge

Jan 30, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein(+2 more)

Figure 1 for ImageNet Large Scale Visual Recognition Challenge

Figure 2 for ImageNet Large Scale Visual Recognition Challenge

Figure 3 for ImageNet Large Scale Visual Recognition Challenge

Figure 4 for ImageNet Large Scale Visual Recognition Challenge

Abstract:The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.

* 43 pages, 16 figures. v3 includes additional comparisons with PASCAL VOC (per-category comparisons in Table 3, distribution of localization difficulty in Fig 16), a list of queries used for obtaining object detection images (Appendix C), and some additional references

Via

Access Paper or Ask Questions

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Jun 22, 2014

Andrej Karpathy, Armand Joulin, Li Fei-Fei

Figure 1 for Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Figure 2 for Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Figure 3 for Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Figure 4 for Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Abstract:We introduce a model for bidirectional retrieval of images and sentences through a multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. In addition to a ranking objective seen in previous work, this allows us to add a new fragment alignment objective that learns to directly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments significantly improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions since the inferred inter-modal fragment alignment is explicit.

Via

Access Paper or Ask Questions