Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexander Kolesnikov

Dima

Sigmoid Loss for Language Image Pre-Training

Mar 30, 2023

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer

Abstract:We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. With only four TPUv4 chips, we can train a Base CLIP model at 4k batch size and a Large LiT model at 20k batch size, the latter achieves 84.5% ImageNet zero-shot accuracy in two days. This disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32k being sufficient. We hope our research motivates further explorations in improving the quality and efficiency of language-image pre-training.

* Xiaohua and Lucas contributed equally. arXiv v2: fix typo in pseudocode

Via

Access Paper or Ask Questions

Tuning computer vision models with task rewards

Feb 16, 2023

André Susano Pinto, Alexander Kolesnikov, Yuge Shi, Lucas Beyer, Xiaohua Zhai

Abstract:Misalignment between model predictions and intended usage can be detrimental for the deployment of computer vision models. The issue is exacerbated when the task involves complex structured outputs, as it becomes harder to design procedures which address this misalignment. In natural language processing, this is often addressed using reinforcement learning techniques that align models with a task reward. We adopt this approach and show its surprising effectiveness across multiple computer vision tasks, such as object detection, panoptic segmentation, colorization and image captioning. We believe this approach has the potential to be widely useful for better aligning models with a diverse range of computer vision tasks.

* 11 pages

Via

Access Paper or Ask Questions

Scaling Vision Transformers to 22 Billion Parameters

Feb 10, 2023

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin(+32 more)

Figure 1 for Scaling Vision Transformers to 22 Billion Parameters

Figure 2 for Scaling Vision Transformers to 22 Billion Parameters

Figure 3 for Scaling Vision Transformers to 22 Billion Parameters

Figure 4 for Scaling Vision Transformers to 22 Billion Parameters

Abstract:The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

Via

Access Paper or Ask Questions

FlexiViT: One Model for All Patch Sizes

Dec 15, 2022

Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, Filip Pavetic

Abstract:Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision

* Code and pre-trained models available at https://github.com/google-research/big_vision. All authors made significant technical contributions

Via

Access Paper or Ask Questions

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Sep 16, 2022

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer(+19 more)

Figure 1 for PaLI: A Jointly-Scaled Multilingual Language-Image Model

Figure 2 for PaLI: A Jointly-Scaled Multilingual Language-Image Model

Figure 3 for PaLI: A Jointly-Scaled Multilingual Language-Image Model

Figure 4 for PaLI: A Jointly-Scaled Multilingual Language-Image Model

Abstract:Effective scaling and a flexible task interface enable large language models to excel at many tasks. PaLI (Pathways Language and Image model) extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pretrained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train the largest ViT to date (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

Via

Access Paper or Ask Questions

UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

May 27, 2022

Alexander Kolesnikov, André Susano Pinto, Lucas Beyer, Xiaohua Zhai, Jeremiah Harmsen, Neil Houlsby

Figure 1 for UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

Figure 2 for UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

Figure 3 for UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

Figure 4 for UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

Abstract:We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks; it requires no task-specific modifications which require extensive human expertise. The approach involves two components: (I) a base model (feed-forward) which is trained to directly predict raw vision outputs, guided by a learned discrete code and (II) a language model (autoregressive) that is trained to generate the guiding code. These components complement each other: the language model is well-suited to modeling structured interdependent data, while the base model is efficient at dealing with high-dimensional outputs. We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks: panoptic segmentation, depth prediction and image colorization, where we achieve competitive and near state-of-the-art results. Our experimental results suggest that UViM is a promising candidate for a unified modeling approach in computer vision.

* Alexander and Andr\'e share the first authorship, all authors made significant technical contributions to this work

Via

Access Paper or Ask Questions

Better plain ViT baselines for ImageNet-1k

May 03, 2022

Lucas Beyer, Xiaohua Zhai, Alexander Kolesnikov

Figure 1 for Better plain ViT baselines for ImageNet-1k

Figure 2 for Better plain ViT baselines for ImageNet-1k

Figure 3 for Better plain ViT baselines for ImageNet-1k

Abstract:It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data. Surprisingly, we find this is not the case and standard data augmentation is sufficient. This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT models. Notably, 90 epochs of training surpass 76% top-1 accuracy in under seven hours on a TPUv3-8, similar to the classic ResNet50 baseline, and 300 epochs of training reach 80% in less than one day.

* Code available at https://github.com/google-research/big_vision

Via

Access Paper or Ask Questions

LiT: Zero-Shot Transfer with Locked-image Text Tuning

Nov 15, 2021

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer

Figure 1 for LiT: Zero-Shot Transfer with Locked-image Text Tuning

Figure 2 for LiT: Zero-Shot Transfer with Locked-image Text Tuning

Figure 3 for LiT: Zero-Shot Transfer with Locked-image Text Tuning

Figure 4 for LiT: Zero-Shot Transfer with Locked-image Text Tuning

Abstract:This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning "Locked-image Text tuning" (LiT-tuning), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT-tuned model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiT-tuning is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT-tuned model achieves 84.5% zero-shot transfer accuracy on the ImageNet test set, and 81.1% on the challenging out-of-distribution ObjectNet test set.

* Xiaohua, Xiao, Basil, Andreas and Lucas contributed equally

Via

Access Paper or Ask Questions

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Jun 18, 2021

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, Lucas Beyer

Figure 1 for How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Figure 2 for How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Figure 3 for How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Figure 4 for How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Abstract:Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation (``AugReg'' for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.

* Andreas, Alex, Xiaohua and Lucas contributed equally. We release more than 50'000 ViT models trained under diverse settings on various datasets. We believe this to be a treasure trove for model analysis. Available at https://github.com/google-research/vision_transformer and https://github.com/rwightman/pytorch-image-models

Via

Access Paper or Ask Questions

Knowledge distillation: A good teacher is patient and consistent

Jun 09, 2021

Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, Alexander Kolesnikov

Figure 1 for Knowledge distillation: A good teacher is patient and consistent

Figure 2 for Knowledge distillation: A good teacher is patient and consistent

Figure 3 for Knowledge distillation: A good teacher is patient and consistent

Figure 4 for Knowledge distillation: A good teacher is patient and consistent

Abstract:There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications. In this paper we address this issue and significantly bridge the gap between these two types of models. Throughout our empirical investigation we do not aim to necessarily propose a new method, but strive to identify a robust and effective recipe for making state-of-the-art large scale models affordable in practice. We demonstrate that, when performed correctly, knowledge distillation can be a powerful tool for reducing the size of large models without compromising their performance. In particular, we uncover that there are certain implicit design choices, which may drastically affect the effectiveness of distillation. Our key contribution is the explicit identification of these design choices, which were not previously articulated in the literature. We back up our findings by a comprehensive empirical study, demonstrate compelling results on a wide range of vision datasets and, in particular, obtain a state-of-the-art ResNet-50 model for ImageNet, which achieves 82.8\% top-1 accuracy.

* Lucas, Xiaohua, Am\'elie, Larisa, and Alex contributed equally

Via

Access Paper or Ask Questions