Alert button
Picture for Jenia Jitsev

Jenia Jitsev

Alert button

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Aug 07, 2023
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, Ludwig Schmidt

Figure 1 for OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Figure 2 for OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Figure 3 for OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Figure 4 for OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at https://github.com/mlfoundations/open_flamingo.

Viaarxiv icon

DataComp: In search of the next generation of multimodal datasets

May 03, 2023
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, Ludwig Schmidt

Figure 1 for DataComp: In search of the next generation of multimodal datasets
Figure 2 for DataComp: In search of the next generation of multimodal datasets
Figure 3 for DataComp: In search of the next generation of multimodal datasets
Figure 4 for DataComp: In search of the next generation of multimodal datasets

Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a benchmark where the training code is fixed and researchers innovate by proposing new training sets. We provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing on 38 downstream test sets. Our benchmark consists of multiple scales, with four candidate pool sizes and associated compute budgets ranging from 12.8M to 12.8B samples seen during training. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow is a promising way of improving multimodal datasets. We introduce DataComp-1B, a dataset created by applying a simple filtering algorithm to the 12.8B candidate pool. The resulting 1.4B subset enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet. Our new ViT-L/14 model outperforms a larger ViT-g/14 trained on LAION-2B by 0.7 percentage points while requiring 9x less training compute. We also outperform OpenAI's CLIP ViT-L/14 by 3.7 percentage points, which is trained with the same compute budget as our model. These gains highlight the potential for improving model performance by carefully curating training sets. We view DataComp-1B as only the first step and hope that DataComp paves the way toward the next generation of multimodal datasets.

Viaarxiv icon

A Comparative Study on Generative Models for High Resolution Solar Observation Imaging

Apr 14, 2023
Mehdi Cherti, Alexander Czernik, Stefan Kesselheim, Frederic Effenberger, Jenia Jitsev

Figure 1 for A Comparative Study on Generative Models for High Resolution Solar Observation Imaging
Figure 2 for A Comparative Study on Generative Models for High Resolution Solar Observation Imaging
Figure 3 for A Comparative Study on Generative Models for High Resolution Solar Observation Imaging
Figure 4 for A Comparative Study on Generative Models for High Resolution Solar Observation Imaging

Solar activity is one of the main drivers of variability in our solar system and the key source of space weather phenomena that affect Earth and near Earth space. The extensive record of high resolution extreme ultraviolet (EUV) observations from the Solar Dynamics Observatory (SDO) offers an unprecedented, very large dataset of solar images. In this work, we make use of this comprehensive dataset to investigate capabilities of current state-of-the-art generative models to accurately capture the data distribution behind the observed solar activity states. Starting from StyleGAN-based methods, we uncover severe deficits of this model family in handling fine-scale details of solar images when training on high resolution samples, contrary to training on natural face images. When switching to the diffusion based generative model family, we observe strong improvements of fine-scale detail generation. For the GAN family, we are able to achieve similar improvements in fine-scale generation when turning to ProjectedGANs, which uses multi-scale discriminators with a pre-trained frozen feature extractor. We conduct ablation studies to clarify mechanisms responsible for proper fine-scale handling. Using distributed training on supercomputers, we are able to train generative models for up to 1024x1024 resolution that produce high quality samples indistinguishable to human experts, as suggested by the evaluation we conduct. We make all code, models and workflows used in this study publicly available at \url{https://github.com/SLAMPAI/generative-models-for-highres-solar-images}.

Viaarxiv icon

Reproducible scaling laws for contrastive language-image learning

Dec 14, 2022
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, Jenia Jitsev

Figure 1 for Reproducible scaling laws for contrastive language-image learning
Figure 2 for Reproducible scaling laws for contrastive language-image learning
Figure 3 for Reproducible scaling laws for contrastive language-image learning
Figure 4 for Reproducible scaling laws for contrastive language-image learning

Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data \& models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning. We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes. We open-source our evaluation workflow and all models, including the largest public CLIP models, to ensure reproducibility and make scaling laws research more accessible. Source code and instructions to reproduce this study will be available at https://github.com/LAION-AI/scaling-laws-openclip

* Preprint. Under review 
Viaarxiv icon

Towards prediction of turbulent flows at high Reynolds numbers using high performance computing data and deep learning

Oct 28, 2022
Mathis Bode, Michael Gauding, Jens Henrik Göbbert, Baohao Liao, Jenia Jitsev, Heinz Pitsch

In this paper, deep learning (DL) methods are evaluated in the context of turbulent flows. Various generative adversarial networks (GANs) are discussed with respect to their suitability for understanding and modeling turbulence. Wasserstein GANs (WGANs) are then chosen to generate small-scale turbulence. Highly resolved direct numerical simulation (DNS) turbulent data is used for training the WGANs and the effect of network parameters, such as learning rate and loss function, is studied. Qualitatively good agreement between DNS input data and generated turbulent structures is shown. A quantitative statistical assessment of the predicted turbulent fields is performed.

* LNCS 11203, pp. 614-623, 2018  
Viaarxiv icon

LAION-5B: An open large-scale dataset for training next generation image-text models

Oct 16, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, Jenia Jitsev

Figure 1 for LAION-5B: An open large-scale dataset for training next generation image-text models
Figure 2 for LAION-5B: An open large-scale dataset for training next generation image-text models
Figure 3 for LAION-5B: An open large-scale dataset for training next generation image-text models
Figure 4 for LAION-5B: An open large-scale dataset for training next generation image-text models

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection. Announcement page https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/

* 36th Conference on Neural Information Processing Systems (NeurIPS 2022), Track on Datasets and Benchmarks. OpenReview: https://openreview.net/forum?id=M3Y74vmsMcY 
Viaarxiv icon

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Nov 03, 2021
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, Aran Komatsuzaki

Figure 1 for LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Figure 2 for LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Figure 3 for LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Figure 4 for LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero- or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.

* Short version. Accepted at Data Centric AI NeurIPS Workshop 2021 
Viaarxiv icon

Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images

Jun 09, 2021
Mehdi Cherti, Jenia Jitsev

Figure 1 for Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images
Figure 2 for Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images
Figure 3 for Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images
Figure 4 for Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images

Transfer learning aims to exploit pre-trained models for more efficient follow-up training on wide range of downstream tasks and datasets, enabling successful training also on small data. Recent line of work posits strong benefits for model generalization and transfer when model size, data size, and compute budget are increased for the pre-training. It remains however still largely unclear whether the observed transfer improvement due to increase in scale also holds when source and target data distributions are far apart from each other. In this work we conduct large-scale pre-training on large source datasets of either natural (ImageNet-21k/1k) or medical chest X-Ray images and compare full and few-shot transfer using different target datasets from both natural and medical imaging domains. Our observations provide evidence that while pre-training and transfer on closely related datasets do show clear benefit of increasing model and data size during pre-training, such benefits are not clearly visible when source and target datasets are further apart. These observations hold across both full and few-shot transfer and indicate that scaling laws pointing to improvement of generalization and transfer with increasing model and data size are incomplete and should be revised by taking into account the type and proximity of the source and target data, to correctly predict the effect of model and data scale during pre-training on transfer. Remarkably, in full shot transfer to a large X-Ray chest imaging target (PadChest), the largest model pre-trained on ImageNet-21k slightly outperforms best models pre-trained on large X-Ray chest imaging data. This indicates possibility to obtain high quality models for domain-specific transfer even without access to large domain-specific data, by pre-training instead on comparably very large, generic source data.

* Preprint. Under review 
Viaarxiv icon