Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tinne Tuytelaars

Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How

Nov 08, 2023

Timm Hess, Tinne Tuytelaars, Gido M. van de Ven

Abstract:Recent years have seen considerable progress in the continual training of deep neural networks, predominantly thanks to approaches that add replay or regularization terms to the loss function to approximate the joint loss over all tasks so far. However, we show that even with a perfect approximation to the joint loss, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task. Motivated by this 'stability gap', we propose that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some continual learning work that alters the optimization trajectory (e.g., using gradient projection techniques), this line of research is positioned as alternative to improving the optimization objective, while we argue it should be complementary. To evaluate the merits of our proposition, we plan to combine replay-approximated joint objectives with gradient projection-based optimization routines to test whether the addition of the latter provides benefits in terms of (1) alleviating the stability gap, (2) increasing the learning efficiency and (3) improving the final learning outcome.

* Pre-registered report, accepted at the 1st ContinualAI Unconference. Full paper with the results of the proposed experiment is expected to follow by June 2024

Via

Access Paper or Ask Questions

Revisiting Evaluation Metrics for Semantic Segmentation: Optimization and Evaluation of Fine-grained Intersection over Union

Oct 30, 2023

Zifu Wang, Maxim Berman, Amal Rannen-Triki, Philip H. S. Torr, Devis Tuia, Tinne Tuytelaars, Luc Van Gool, Jiaqian Yu, Matthew B. Blaschko

Abstract:Semantic segmentation datasets often exhibit two types of imbalance: \textit{class imbalance}, where some classes appear more frequently than others and \textit{size imbalance}, where some objects occupy more pixels than others. This causes traditional evaluation metrics to be biased towards \textit{majority classes} (e.g. overall pixel-wise accuracy) and \textit{large objects} (e.g. mean pixel-wise accuracy and per-dataset mean intersection over union). To address these shortcomings, we propose the use of fine-grained mIoUs along with corresponding worst-case metrics, thereby offering a more holistic evaluation of segmentation techniques. These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing. Furthermore, we undertake an extensive benchmark study, where we train and evaluate 15 modern neural networks with the proposed metrics on 12 diverse natural and aerial segmentation datasets. Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects. Moreover, we identify the crucial role played by architecture designs and loss functions, which lead to best practices in optimizing fine-grained metrics. The code is available at \href{https://github.com/zifuwanggg/JDTLosses}{https://github.com/zifuwanggg/JDTLosses}.

* NeurIPS 2023

Via

Access Paper or Ask Questions

CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

Oct 11, 2023

Tim Lebailly, Thomas Stegmüller, Behzad Bozorgtabar, Jean-Philippe Thiran, Tinne Tuytelaars

Figure 1 for CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

Figure 2 for CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

Figure 3 for CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

Figure 4 for CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

Abstract:Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrapping can lead to undesirable entanglement of object representations. Furthermore, even object-centric datasets stand to benefit from a finer-grained bootstrapping approach. In response to these challenges, we introduce a novel Cross-Image Object-Level Bootstrapping method tailored to enhance dense visual representation learning. By employing object-level nearest neighbor bootstrapping throughout the training, CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time. CrIBo shows state-of-the-art performance on the latter task while being highly competitive in more standard downstream segmentation tasks. Our code and pretrained models will be publicly available upon acceptance.

Via

Access Paper or Ask Questions

Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

Sep 10, 2023

Bo Wan, Tinne Tuytelaars

Figure 1 for Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

Figure 2 for Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

Figure 3 for Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

Figure 4 for Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

Abstract:In this paper, we investigate the task of zero-shot human-object interaction (HOI) detection, a novel paradigm for identifying HOIs without the need for task-specific annotations. To address this challenging task, we employ CLIP, a large-scale pre-trained vision-language model (VLM), for knowledge distillation on multiple levels. Specifically, we design a multi-branch neural network that leverages CLIP for learning HOI representations at various levels, including global images, local union regions encompassing human-object pairs, and individual instances of humans or objects. To train our model, CLIP is utilized to generate HOI scores for both global images and local union regions that serve as supervision signals. The extensive experiments demonstrate the effectiveness of our novel multi-level CLIP knowledge integration strategy. Notably, the model achieves strong performance, which is even comparable with some fully-supervised and weakly-supervised methods on the public HICO-DET benchmark.

Via

Access Paper or Ask Questions

Ref-DVGO: Reflection-Aware Direct Voxel Grid Optimization for an Improved Quality-Efficiency Trade-Off in Reflective Scene Reconstruction

Aug 21, 2023

Georgios Kouros, Minye Wu, Shubham Shrivastava, Sushruth Nagesh, Punarjay Chakravarty, Tinne Tuytelaars

Figure 1 for Ref-DVGO: Reflection-Aware Direct Voxel Grid Optimization for an Improved Quality-Efficiency Trade-Off in Reflective Scene Reconstruction

Figure 2 for Ref-DVGO: Reflection-Aware Direct Voxel Grid Optimization for an Improved Quality-Efficiency Trade-Off in Reflective Scene Reconstruction

Figure 3 for Ref-DVGO: Reflection-Aware Direct Voxel Grid Optimization for an Improved Quality-Efficiency Trade-Off in Reflective Scene Reconstruction

Figure 4 for Ref-DVGO: Reflection-Aware Direct Voxel Grid Optimization for an Improved Quality-Efficiency Trade-Off in Reflective Scene Reconstruction

Abstract:Neural Radiance Fields (NeRFs) have revolutionized the field of novel view synthesis, demonstrating remarkable performance. However, the modeling and rendering of reflective objects remain challenging problems. Recent methods have shown significant improvements over the baselines in handling reflective scenes, albeit at the expense of efficiency. In this work, we aim to strike a balance between efficiency and quality. To this end, we investigate an implicit-explicit approach based on conventional volume rendering to enhance the reconstruction quality and accelerate the training and rendering processes. We adopt an efficient density-based grid representation and reparameterize the reflected radiance in our pipeline. Our proposed reflection-aware approach achieves a competitive quality efficiency trade-off compared to competing methods. Based on our experimental results, we propose and discuss hypotheses regarding the factors influencing the results of density-based methods for reconstructing reflective objects. The source code is available at https://github.com/gkouros/ref-dvgo.

* 5 pages, 4 figures, 3 tables, ICCV TRICKY 2023 Workshop

Via

Access Paper or Ask Questions

Visually-Aware Context Modeling for News Image Captioning

Aug 16, 2023

Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens

Abstract:The goal of News Image Captioning is to generate an image caption according to the content of both a news article and an image. To leverage the visual information effectively, it is important to exploit the connection between the context in the articles/captions and the images. Psychological studies indicate that human faces in images draw higher attention priorities. On top of that, humans often play a central role in news stories, as also proven by the face-name co-occurrence pattern we discover in existing News Image Captioning datasets. Therefore, we design a face-naming module for faces in images and names in captions/articles to learn a better name embedding. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. Humans typically address this by searching for relevant information from the article based on the image. To emulate this thought process, we design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image. We conduct extensive experiments to demonstrate the efficacy of our framework. Without using additional paired data, we establish the new state-of-the-art performance on two News Image Captioning datasets, exceeding the previous state-of-the-art by 5 CIDEr points. We will release code upon acceptance.

Via

Access Paper or Ask Questions

Multimodal Distillation for Egocentric Action Recognition

Jul 18, 2023

Gorjan Radevski, Dusan Grujicic, Marie-Francine Moens, Matthew Blaschko, Tinne Tuytelaars

Figure 1 for Multimodal Distillation for Egocentric Action Recognition

Figure 2 for Multimodal Distillation for Egocentric Action Recognition

Figure 3 for Multimodal Distillation for Egocentric Action Recognition

Figure 4 for Multimodal Distillation for Egocentric Action Recognition

Abstract:The focal point of egocentric video understanding is modelling hand-object interactions. Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well. However, their performance improves further by employing additional input modalities that provide complementary cues, such as object detections, optical flow, audio, etc. The added complexity of the modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time. We demonstrate that for egocentric action recognition on the Epic-Kitchens and the Something-Something datasets, students which are taught by multimodal teachers tend to be more accurate and better calibrated than architecturally equivalent models trained on ground truth labels in a unimodal or multimodal fashion. We further adopt a principled multimodal knowledge distillation framework, allowing us to deal with issues which occur when applying multimodal knowledge distillation in a naive manner. Lastly, we demonstrate the achieved reduction in computational complexity, and show that our approach maintains higher performance with the reduction of the number of input views. We release our code at https://github.com/gorjanradevski/multimodal-distillation.

* Accepted at ICCV 2023; Codebase released at https://github.com/gorjanradevski/multimodal-distillation

Via

Access Paper or Ask Questions

Unbalanced Optimal Transport: A Unified Framework for Object Detection

Jul 05, 2023

Henri De Plaen, Pierre-François De Plaen, Johan A. K. Suykens, Marc Proesmans, Tinne Tuytelaars, Luc Van Gool

Abstract:During training, supervised object detection tries to correctly match the predicted bounding boxes and associated classification scores to the ground truth. This is essential to determine which predictions are to be pushed towards which solutions, or to be discarded. Popular matching strategies include matching to the closest ground truth box (mostly used in combination with anchors), or matching via the Hungarian algorithm (mostly used in anchor-free methods). Each of these strategies comes with its own properties, underlying losses, and heuristics. We show how Unbalanced Optimal Transport unifies these different approaches and opens a whole continuum of methods in between. This allows for a finer selection of the desired properties. Experimentally, we show that training an object detection model with Unbalanced Optimal Transport is able to reach the state-of-the-art both in terms of Average Precision and Average Recall as well as to provide a faster initial convergence. The approach is well suited for GPU implementation, which proves to be an advantage for large-scale models.

* Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)

Via

Access Paper or Ask Questions

EffSeg: Efficient Fine-Grained Instance Segmentation using Structure-Preserving Sparsity

Jul 04, 2023

Cédric Picron, Tinne Tuytelaars

Figure 1 for EffSeg: Efficient Fine-Grained Instance Segmentation using Structure-Preserving Sparsity

Figure 2 for EffSeg: Efficient Fine-Grained Instance Segmentation using Structure-Preserving Sparsity

Figure 3 for EffSeg: Efficient Fine-Grained Instance Segmentation using Structure-Preserving Sparsity

Figure 4 for EffSeg: Efficient Fine-Grained Instance Segmentation using Structure-Preserving Sparsity

Abstract:Many two-stage instance segmentation heads predict a coarse 28x28 mask per instance, which is insufficient to capture the fine-grained details of many objects. To address this issue, PointRend and RefineMask predict a 112x112 segmentation mask resulting in higher quality segmentations. Both methods however have limitations by either not having access to neighboring features (PointRend) or by performing computation at all spatial locations instead of sparsely (RefineMask). In this work, we propose EffSeg performing fine-grained instance segmentation in an efficient way by using our Structure-Preserving Sparsity (SPS) method based on separately storing the active features, the passive features and a dense 2D index map containing the feature indices. The goal of the index map is to preserve the 2D spatial configuration or structure between the features such that any 2D operation can still be performed. EffSeg achieves similar performance on COCO compared to RefineMask, while reducing the number of FLOPs by 71% and increasing the FPS by 29%. Code will be released.

Via

Access Paper or Ask Questions

Continual Learning with Pretrained Backbones by Tuning in the Input Space

Jun 08, 2023

Simone Marullo, Matteo Tiezzi, Marco Gori, Stefano Melacci, Tinne Tuytelaars

Abstract:The intrinsic difficulty in adapting deep learning models to non-stationary environments limits the applicability of neural networks to real-world tasks. This issue is critical in practical supervised learning settings, such as the ones in which a pre-trained model computes projections toward a latent space where different task predictors are sequentially learned over time. As a matter of fact, incrementally fine-tuning the whole model to better adapt to new tasks usually results in catastrophic forgetting, with decreasing performance over the past experiences and losing valuable knowledge from the pre-training stage. In this paper, we propose a novel strategy to make the fine-tuning procedure more effective, by avoiding to update the pre-trained part of the network and learning not only the usual classification head, but also a set of newly-introduced learnable parameters that are responsible for transforming the input data. This process allows the network to effectively leverage the pre-training knowledge and find a good trade-off between plasticity and stability with modest computational efforts, thus especially suitable for on-the-edge settings. Our experiments on four image classification problems in a continual learning setting confirm the quality of the proposed approach when compared to several fine-tuning procedures and to popular continual learning methods.

Via

Access Paper or Ask Questions