Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Keysers

PaliGemma 2: A Family of Versatile VLMs for Transfer

Dec 04, 2024

Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long(+8 more)

Figure 1 for PaliGemma 2: A Family of Versatile VLMs for Transfer

Figure 2 for PaliGemma 2: A Family of Versatile VLMs for Transfer

Figure 3 for PaliGemma 2: A Family of Versatile VLMs for Transfer

Figure 4 for PaliGemma 2: A Family of Versatile VLMs for Transfer

Abstract:PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.

Via

Access Paper or Ask Questions

PaliGemma: A versatile 3B VLM for transfer

Jul 10, 2024

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello(+25 more)

Figure 1 for PaliGemma: A versatile 3B VLM for transfer

Figure 2 for PaliGemma: A versatile 3B VLM for transfer

Figure 3 for PaliGemma: A versatile 3B VLM for transfer

Figure 4 for PaliGemma: A versatile 3B VLM for transfer

Abstract:PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

Via

Access Paper or Ask Questions

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Oct 17, 2023

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski(+9 more)

Figure 1 for PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Figure 2 for PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Figure 3 for PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Figure 4 for PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Abstract:This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.

Via

Access Paper or Ask Questions

Video OWL-ViT: Temporally-consistent open-world localization in video

Aug 22, 2023

Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley, Daniel Keysers, Mario Lučić, Fisher Yu, Thomas Kipf

Abstract:We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pre-training, can be transferred successfully to open-world localization across diverse videos.

* ICCV 2023

Via

Access Paper or Ask Questions

PaLI-X: On Scaling up a Multilingual Vision and Language Model

May 29, 2023

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay(+33 more)

Figure 1 for PaLI-X: On Scaling up a Multilingual Vision and Language Model

Figure 2 for PaLI-X: On Scaling up a Multilingual Vision and Language Model

Figure 3 for PaLI-X: On Scaling up a Multilingual Vision and Language Model

Figure 4 for PaLI-X: On Scaling up a Multilingual Vision and Language Model

Abstract:We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

Via

Access Paper or Ask Questions

Scaling Vision Transformers to 22 Billion Parameters

Feb 10, 2023

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin(+32 more)

Figure 1 for Scaling Vision Transformers to 22 Billion Parameters

Figure 2 for Scaling Vision Transformers to 22 Billion Parameters

Figure 3 for Scaling Vision Transformers to 22 Billion Parameters

Figure 4 for Scaling Vision Transformers to 22 Billion Parameters

Abstract:The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

Via

Access Paper or Ask Questions

LiT: Zero-Shot Transfer with Locked-image Text Tuning

Nov 15, 2021

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer

Figure 1 for LiT: Zero-Shot Transfer with Locked-image Text Tuning

Figure 2 for LiT: Zero-Shot Transfer with Locked-image Text Tuning

Figure 3 for LiT: Zero-Shot Transfer with Locked-image Text Tuning

Figure 4 for LiT: Zero-Shot Transfer with Locked-image Text Tuning

Abstract:This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning "Locked-image Text tuning" (LiT-tuning), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT-tuned model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiT-tuning is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT-tuned model achieves 84.5% zero-shot transfer accuracy on the ImageNet test set, and 81.1% on the challenging out-of-distribution ObjectNet test set.

* Xiaohua, Xiao, Basil, Andreas and Lucas contributed equally

Via

Access Paper or Ask Questions

The Impact of Reinitialization on Generalization in Convolutional Neural Networks

Sep 01, 2021

Ibrahim Alabdulmohsin, Hartmut Maennel, Daniel Keysers

Figure 1 for The Impact of Reinitialization on Generalization in Convolutional Neural Networks

Figure 2 for The Impact of Reinitialization on Generalization in Convolutional Neural Networks

Figure 3 for The Impact of Reinitialization on Generalization in Convolutional Neural Networks

Figure 4 for The Impact of Reinitialization on Generalization in Convolutional Neural Networks

Abstract:Recent results suggest that reinitializing a subset of the parameters of a neural network during training can improve generalization, particularly for small training sets. We study the impact of different reinitialization methods in several convolutional architectures across 12 benchmark image classification datasets, analyzing their potential gains and highlighting limitations. We also introduce a new layerwise reinitialization algorithm that outperforms previous methods and suggest explanations of the observed improved generalization. First, we show that layerwise reinitialization increases the margin on the training examples without increasing the norm of the weights, hence leading to an improvement in margin-based generalization bounds for neural networks. Second, we demonstrate that it settles in flatter local minima of the loss surface. Third, it encourages learning general rules and discourages memorization by placing emphasis on the lower layers of the neural network. Our takeaway message is that the accuracy of convolutional neural networks can be improved for small datasets using bottom-up layerwise reinitialization, where the number of reinitialized layers may vary depending on the available compute budget.

* 12 figures, 7 tables

Via

Access Paper or Ask Questions

Continental-Scale Building Detection from High Resolution Satellite Imagery

Jul 29, 2021

Wojciech Sirko, Sergii Kashubin, Marvin Ritter, Abigail Annkah, Yasser Salah Eddine Bouchareb, Yann Dauphin, Daniel Keysers, Maxim Neumann, Moustapha Cisse, John Quinn

Figure 1 for Continental-Scale Building Detection from High Resolution Satellite Imagery

Figure 2 for Continental-Scale Building Detection from High Resolution Satellite Imagery

Figure 3 for Continental-Scale Building Detection from High Resolution Satellite Imagery

Figure 4 for Continental-Scale Building Detection from High Resolution Satellite Imagery

Abstract:Identifying the locations and footprints of buildings is vital for many practical and scientific purposes. Such information can be particularly useful in developing regions where alternative data sources may be scarce. In this work, we describe a model training pipeline for detecting buildings across the entire continent of Africa, using 50 cm satellite imagery. Starting with the U-Net model, widely used in satellite image analysis, we study variations in architecture, loss functions, regularization, pre-training, self-training and post-processing that increase instance segmentation performance. Experiments were carried out using a dataset of 100k satellite images across Africa containing 1.75M manually labelled building instances, and further datasets for pre-training and self-training. We report novel methods for improving performance of building detection with this type of model, including the use of mixup (mAP +0.12) and self-training with soft KL loss (mAP +0.06). The resulting pipeline obtains good results even on a wide variety of challenging rural and urban contexts, and was used to create the Open Buildings dataset of 516M Africa-wide detected footprints.

Via

Access Paper or Ask Questions

A Generalized Lottery Ticket Hypothesis

Jul 26, 2021

Ibrahim Alabdulmohsin, Larisa Markeeva, Daniel Keysers, Ilya Tolstikhin

Figure 1 for A Generalized Lottery Ticket Hypothesis

Figure 2 for A Generalized Lottery Ticket Hypothesis

Figure 3 for A Generalized Lottery Ticket Hypothesis

Figure 4 for A Generalized Lottery Ticket Hypothesis

Abstract:We introduce a generalization to the lottery ticket hypothesis in which the notion of "sparsity" is relaxed by choosing an arbitrary basis in the space of parameters. We present evidence that the original results reported for the canonical basis continue to hold in this broader setting. We describe how structured pruning methods, including pruning units or factorizing fully-connected layers into products of low-rank matrices, can be cast as particular instances of this "generalized" lottery ticket hypothesis. The investigations reported here are preliminary and are provided to encourage further research along this direction.

* Workshop on Sparsity in Neural Networks: Advancing Understanding and Practice (SNN'21). Updates: New curve on Figure 2(left) and discussion on Li et al

Via

Access Paper or Ask Questions