Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Piotr Bojanowski

WILLOW, LIENS

The Hidden Uniform Cluster Prior in Self-Supervised Learning

Oct 13, 2022

Mahmoud Assran, Randall Balestriero, Quentin Duval, Florian Bordes, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Nicolas Ballas

Figure 1 for The Hidden Uniform Cluster Prior in Self-Supervised Learning

Figure 2 for The Hidden Uniform Cluster Prior in Self-Supervised Learning

Figure 3 for The Hidden Uniform Cluster Prior in Self-Supervised Learning

Figure 4 for The Hidden Uniform Cluster Prior in Self-Supervised Learning

Abstract:A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.

Via

Access Paper or Ask Questions

Walk the Random Walk: Learning to Discover and Reach Goals Without Supervision

Jun 23, 2022

Lina Mezghani, Sainbayar Sukhbaatar, Piotr Bojanowski, Karteek Alahari

Figure 1 for Walk the Random Walk: Learning to Discover and Reach Goals Without Supervision

Figure 2 for Walk the Random Walk: Learning to Discover and Reach Goals Without Supervision

Figure 3 for Walk the Random Walk: Learning to Discover and Reach Goals Without Supervision

Figure 4 for Walk the Random Walk: Learning to Discover and Reach Goals Without Supervision

Abstract:Learning a diverse set of skills by interacting with an environment without any external supervision is an important challenge. In particular, obtaining a goal-conditioned agent that can reach any given state is useful in many applications. We propose a novel method for training such a goal-conditioned agent without any external rewards or any domain knowledge. We use random walk to train a reachability network that predicts the similarity between two states. This reachability network is then used in building goal memory containing past observations that are diverse and well-balanced. Finally, we train a goal-conditioned policy network with goals sampled from the goal memory and reward it by the reachability network and the goal memory. All the components are kept updated throughout training as the agent discovers and learns new goals. We apply our method to a continuous control navigation and robotic manipulation tasks.

Via

Access Paper or Ask Questions

Masked Siamese Networks for Label-Efficient Learning

Apr 14, 2022

Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas

Figure 1 for Masked Siamese Networks for Label-Efficient Learning

Figure 2 for Masked Siamese Networks for Label-Efficient Learning

Figure 3 for Masked Siamese Networks for Label-Efficient Learning

Figure 4 for Masked Siamese Networks for Label-Efficient Learning

Abstract:We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Vision Transformers since only the unmasked patches are processed by the network. As a result, MSNs improve the scalability of joint-embedding architectures, while producing representations of a high semantic level that perform competitively on low-shot image classification. For instance, on ImageNet-1K, with only 5,000 annotated images, our base MSN model achieves 72.4% top-1 accuracy, and with 1% of ImageNet-1K labels, we achieve 75.7% top-1 accuracy, setting a new state-of-the-art for self-supervised learning on this benchmark. Our code is publicly available.

Via

Access Paper or Ask Questions

Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

Feb 22, 2022

Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron, Ishan Misra, Levent Sagun, Armand Joulin, Piotr Bojanowski

Figure 1 for Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

Figure 2 for Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

Figure 3 for Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

Figure 4 for Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

Abstract:Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads to object centric features that perform on par with supervised features on most object-centric downstream tasks. In this work, we question if using this ability, we can learn any salient and more representative information present in diverse unbounded set of images from across the globe. To do so, we train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn. We scale our model size to dense 10 billion parameters to avoid underfitting on a large data size. We extensively study and validate our model performance on over 50 benchmarks including fairness, robustness to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets. The resulting model, not only captures well semantic information, it also captures information about artistic style and learns salient information such as geolocations and multilingual word embeddings based on visual content only. More importantly, we discover that such model is more robust, more fair, less harmful and less biased than supervised models or models trained on object centric datasets such as ImageNet.

Via

Access Paper or Ask Questions

Augmenting Convolutional networks with attention-based aggregation

Dec 27, 2021

Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Piotr Bojanowski, Armand Joulin, Gabriel Synnaeve, Hervé Jégou

Figure 1 for Augmenting Convolutional networks with attention-based aggregation

Figure 2 for Augmenting Convolutional networks with attention-based aggregation

Figure 3 for Augmenting Convolutional networks with attention-based aggregation

Figure 4 for Augmenting Convolutional networks with attention-based aggregation

Abstract:We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attention-based aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth). In contrast with a pyramidal design, this architecture family maintains the input patch resolution across all the layers. It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption, as shown by our experiments on various computer vision tasks: object classification, image segmentation and detection.

Via

Access Paper or Ask Questions

Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Dec 16, 2021

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, Edouard Grave

Figure 1 for Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Figure 2 for Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Figure 3 for Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Figure 4 for Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Abstract:Information retrieval is an important component in natural language processing, for knowledge intensive tasks such as question answering and fact checking. Recently, information retrieval has seen the emergence of dense retrievers, based on neural networks, as an alternative to classical sparse methods based on term-frequency. These models have obtained state-of-the-art results on datasets and tasks where large training sets are available. However, they do not transfer well to new domains or applications with no training data, and are often outperformed by term-frequency methods such as BM25 which are not supervised. Thus, a natural question is whether it is possible to train dense retrievers without supervision. In this work, we explore the limits of contrastive learning as a way to train unsupervised dense retrievers, and show that it leads to strong retrieval performance. More precisely, we show on the BEIR benchmark that our model outperforms BM25 on 11 out of 15 datasets. Furthermore, when a few thousands examples are available, we show that fine-tuning our model on these leads to strong improvements compared to BM25. Finally, when used as pre-training before fine-tuning on the MS-MARCO dataset, our technique obtains state-of-the-art results on the BEIR benchmark.

Via

Access Paper or Ask Questions

XCiT: Cross-Covariance Image Transformers

Jun 18, 2021

Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek(+1 more)

Figure 1 for XCiT: Cross-Covariance Image Transformers

Figure 2 for XCiT: Cross-Covariance Image Transformers

Figure 3 for XCiT: Cross-Covariance Image Transformers

Figure 4 for XCiT: Cross-Covariance Image Transformers

Abstract:Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.

Via

Access Paper or Ask Questions

Emerging Properties in Self-Supervised Vision Transformers

May 24, 2021

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin

Figure 1 for Emerging Properties in Self-Supervised Vision Transformers

Figure 2 for Emerging Properties in Self-Supervised Vision Transformers

Figure 3 for Emerging Properties in Self-Supervised Vision Transformers

Figure 4 for Emerging Properties in Self-Supervised Vision Transformers

Abstract:In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

* 21 pages

Via

Access Paper or Ask Questions

ResMLP: Feedforward networks for image classification with data-efficient training

May 07, 2021

Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou

Figure 1 for ResMLP: Feedforward networks for image classification with data-efficient training

Figure 2 for ResMLP: Feedforward networks for image classification with data-efficient training

Figure 3 for ResMLP: Feedforward networks for image classification with data-efficient training

Figure 4 for ResMLP: Feedforward networks for image classification with data-efficient training

Abstract:We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. We will share our code based on the Timm library and pre-trained models.

Via

Access Paper or Ask Questions

Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples

Apr 28, 2021

Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, Michael Rabbat

Figure 1 for Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples

Figure 2 for Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples

Figure 3 for Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples

Figure 4 for Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples

Abstract:This paper proposes a novel method of learning by predicting view assignments with support samples (PAWS). The method trains a model to minimize a consistency loss, which ensures that different views of the same unlabeled instance are assigned similar pseudo-labels. The pseudo-labels are generated non-parametrically, by comparing the representations of the image views to those of a set of randomly sampled labeled images. The distance between the view representations and labeled representations is used to provide a weighting over class labels, which we interpret as a soft pseudo-label. By non-parametrically incorporating labeled samples in this way, PAWS extends the distance-metric loss used in self-supervised methods such as BYOL and SwAV to the semi-supervised setting. Despite the simplicity of the approach, PAWS outperforms other semi-supervised methods across architectures, setting a new state-of-the-art for a ResNet-50 on ImageNet trained with either 10% or 1% of the labels, reaching 75.5% and 66.5% top-1 respectively. PAWS requires 4x to 12x less training than the previous best methods.

Via

Access Paper or Ask Questions