Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thomas B. Moeslund

Aalborg University

An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

Jun 04, 2024

Scott C. Lowe, Joakim Bruslund Haurum, Sageev Oore, Thomas B. Moeslund, Graham W. Taylor

Figure 1 for An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

Figure 2 for An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

Figure 3 for An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

Figure 4 for An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

Abstract:Can pretrained models generalize to new datasets without any retraining? We deploy pretrained image models on datasets they were not trained for, and investigate whether their embeddings form meaningful clusters. Our suite of benchmarking experiments use encoders pretrained solely on ImageNet-1k with either supervised or self-supervised training techniques, deployed on image datasets that were not seen during training, and clustered with conventional clustering algorithms. This evaluation provides new insights into the embeddings of self-supervised models, which prioritize different features to supervised models. Supervised encoders typically offer more utility than SSL encoders within the training domain, and vice-versa far outside of it, however, fine-tuned encoders demonstrate the opposite trend. Clustering provides a way to evaluate the utility of self-supervised learned representations orthogonal to existing methods such as kNN. Additionally, we find the silhouette score when measured in a UMAP-reduced space is highly correlated with clustering performance, and can therefore be used as a proxy for clustering performance on data with no ground truth labels. Our code implementation is available at \url{https://github.com/scottclowe/zs-ssl-clustering/}.

Via

Access Paper or Ask Questions

Foundation Models for Video Understanding: A Survey

May 06, 2024

Neelu Madan, Andreas Moegelmose, Rajat Modi, Yogesh S. Rawat, Thomas B. Moeslund

Figure 1 for Foundation Models for Video Understanding: A Survey

Figure 2 for Foundation Models for Video Understanding: A Survey

Figure 3 for Foundation Models for Video Understanding: A Survey

Figure 4 for Foundation Models for Video Understanding: A Survey

Abstract:Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs achieve this by capturing robust and generic features from video data. This survey analyzes over 200 video foundational models, offering a comprehensive overview of benchmarks and evaluation metrics across 14 distinct video tasks categorized into 3 main categories. Additionally, we offer an in-depth performance analysis of these models for the 6 most common video tasks. We categorize ViFMs into three categories: 1) Image-based ViFMs, which adapt existing image models for video tasks, 2) Video-Based ViFMs, which utilize video-specific encoding methods, and 3) Universal Foundational Models (UFMs), which combine multiple modalities (image, video, audio, and text etc.) within a single framework. By comparing the performance of various ViFMs on different tasks, this survey offers valuable insights into their strengths and weaknesses, guiding future advancements in video understanding. Our analysis surprisingly reveals that image-based foundation models consistently outperform video-based models on most video understanding tasks. Additionally, UFMs, which leverage diverse modalities, demonstrate superior performance on video tasks. We share the comprehensive list of ViFMs studied in this work at: \url{https://github.com/NeeluMadan/ViFM_Survey.git}

Via

Access Paper or Ask Questions

T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in Sports Videos

Apr 11, 2024

Artur Xarles, Sergio Escalera, Thomas B. Moeslund, Albert Clapés

Figure 1 for T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in Sports Videos

Figure 2 for T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in Sports Videos

Figure 3 for T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in Sports Videos

Figure 4 for T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in Sports Videos

Abstract:In this paper, we introduce T-DEED, a Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in sports videos. T-DEED addresses multiple challenges in the task, including the need for discriminability among frame representations, high output temporal resolution to maintain prediction precision, and the necessity to capture information at different temporal scales to handle events with varying dynamics. It tackles these challenges through its specifically designed architecture, featuring an encoder-decoder for leveraging multiple temporal scales and achieving high output temporal resolution, along with temporal modules designed to increase token discriminability. Leveraging these characteristics, T-DEED achieves SOTA performance on the FigureSkating and FineDiving datasets. Code is available at https://github.com/arturxe2/T-DEED.

Via

Access Paper or Ask Questions

OpenTrench3D: A Photogrammetric 3D Point Cloud Dataset for Semantic Segmentation of Underground Utilities

Apr 11, 2024

Lasse H. Hansen, Simon B. Jensen, Mark P. Philipsen, Andreas Møgelmose, Lars Bodum, Thomas B. Moeslund

Figure 1 for OpenTrench3D: A Photogrammetric 3D Point Cloud Dataset for Semantic Segmentation of Underground Utilities

Figure 2 for OpenTrench3D: A Photogrammetric 3D Point Cloud Dataset for Semantic Segmentation of Underground Utilities

Figure 3 for OpenTrench3D: A Photogrammetric 3D Point Cloud Dataset for Semantic Segmentation of Underground Utilities

Figure 4 for OpenTrench3D: A Photogrammetric 3D Point Cloud Dataset for Semantic Segmentation of Underground Utilities

Abstract:Identifying and classifying underground utilities is an important task for efficient and effective urban planning and infrastructure maintenance. We present OpenTrench3D, a novel and comprehensive 3D Semantic Segmentation point cloud dataset, designed to advance research and development in underground utility surveying and mapping. OpenTrench3D covers a completely novel domain for public 3D point cloud datasets and is unique in its focus, scope, and cost-effective capturing method. The dataset consists of 310 point clouds collected across 7 distinct areas. These include 5 water utility areas and 2 district heating utility areas. The inclusion of different geographical areas and main utilities (water and district heating utilities) makes OpenTrench3D particularly valuable for inter-domain transfer learning experiments. We provide benchmark results for the dataset using three state-of-the-art semantic segmentation models, PointNeXt, PointVector and PointMetaBase. Benchmarks are conducted by training on data from water areas, fine-tuning on district heating area 1 and evaluating on district heating area 2. The dataset is publicly available. With OpenTrench3D, we seek to foster innovation and progress in the field of 3D semantic segmentation in applications related to detection and documentation of underground utilities as well as in transfer learning methods in general.

Via

Access Paper or Ask Questions

ASTRA: An Action Spotting TRAnsformer for Soccer Videos

Apr 02, 2024

Artur Xarles, Sergio Escalera, Thomas B. Moeslund, Albert Clapés

Figure 1 for ASTRA: An Action Spotting TRAnsformer for Soccer Videos

Figure 2 for ASTRA: An Action Spotting TRAnsformer for Soccer Videos

Figure 3 for ASTRA: An Action Spotting TRAnsformer for Soccer Videos

Figure 4 for ASTRA: An Action Spotting TRAnsformer for Soccer Videos

Abstract:In this paper, we introduce ASTRA, a Transformer-based model designed for the task of Action Spotting in soccer matches. ASTRA addresses several challenges inherent in the task and dataset, including the requirement for precise action localization, the presence of a long-tail data distribution, non-visibility in certain actions, and inherent label noise. To do so, ASTRA incorporates (a) a Transformer encoder-decoder architecture to achieve the desired output temporal resolution and to produce precise predictions, (b) a balanced mixup strategy to handle the long-tail distribution of the data, (c) an uncertainty-aware displacement head to capture the label variability, and (d) input audio signal to enhance detection of non-visible actions. Results demonstrate the effectiveness of ASTRA, achieving a tight Average-mAP of 66.82 on the test set. Moreover, in the SoccerNet 2023 Action Spotting challenge, we secure the 3rd position with an Average-mAP of 70.21 on the challenge set.

Via

Access Paper or Ask Questions

A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?

Apr 02, 2024

Galadrielle Humblot-Renaux, Sergio Escalera, Thomas B. Moeslund

Abstract:The ability to detect unfamiliar or unexpected images is essential for safe deployment of computer vision systems. In the context of classification, the task of detecting images outside of a model's training domain is known as out-of-distribution (OOD) detection. While there has been a growing research interest in developing post-hoc OOD detection methods, there has been comparably little discussion around how these methods perform when the underlying classifier is not trained on a clean, carefully curated dataset. In this work, we take a closer look at 20 state-of-the-art OOD detection methods in the (more realistic) scenario where the labels used to train the underlying classifier are unreliable (e.g. crowd-sourced or web-scraped labels). Extensive experiments across different datasets, noise types & levels, architectures and checkpointing strategies provide insights into the effect of class label noise on OOD detection, and show that poor separation between incorrectly classified ID samples vs. OOD samples is an overlooked yet important limitation of existing methods. Code: https://github.com/glhr/ood-labelnoise

* Accepted at CVPR 2024

Via

Access Paper or Ask Questions

Raw Instinct: Trust Your Classifiers and Skip the Conversion

Mar 21, 2024

Christos Kantas, Bjørk Antoniussen, Mathias V. Andersen, Rasmus Munksø, Shobhit Kotnala, Simon B. Jensen, Andreas Møgelmose, Lau Nørgaard, Thomas B. Moeslund

Figure 1 for Raw Instinct: Trust Your Classifiers and Skip the Conversion

Figure 2 for Raw Instinct: Trust Your Classifiers and Skip the Conversion

Figure 3 for Raw Instinct: Trust Your Classifiers and Skip the Conversion

Figure 4 for Raw Instinct: Trust Your Classifiers and Skip the Conversion

Abstract:Using RAW-images in computer vision problems is surprisingly underexplored considering that converting from RAW to RGB does not introduce any new capture information. In this paper, we show that a sufficiently advanced classifier can yield equivalent results on RAW input compared to RGB and present a new public dataset consisting of RAW images and the corresponding converted RGB images. Classifying images directly from RAW is attractive, as it allows for skipping the conversion to RGB, lowering computation time significantly. Two CNN classifiers are used to classify the images in both formats, confirming that classification performance can indeed be preserved. We furthermore show that the total computation time from RAW image data to classification results for RAW images can be up to 8.46 times faster than RGB. These results contribute to the evidence found in related works, that using RAW images as direct input to computer vision algorithms looks very promising.

* 2023 IEEE 6th International Conference on Pattern Recognition and Artificial Intelligence (PRAI)
* https://www.kaggle.com/datasets/mathiasviborg/raw-instinct

Via

Access Paper or Ask Questions

SIDU-TXT: An XAI Algorithm for NLP with a Holistic Assessment Approach

Feb 05, 2024

Mohammad N. S. Jahromi, Satya. M. Muddamsetty, Asta Sofie Stage Jarlner, Anna Murphy Høgenhaug, Thomas Gammeltoft-Hansen, Thomas B. Moeslund

Abstract:Explainable AI (XAI) aids in deciphering 'black-box' models. While several methods have been proposed and evaluated primarily in the image domain, the exploration of explainability in the text domain remains a growing research area. In this paper, we delve into the applicability of XAI methods for the text domain. In this context, the 'Similarity Difference and Uniqueness' (SIDU) XAI method, recognized for its superior capability in localizing entire salient regions in image-based classification is extended to textual data. The extended method, SIDU-TXT, utilizes feature activation maps from 'black-box' models to generate heatmaps at a granular, word-based level, thereby providing explanations that highlight contextually significant textual elements crucial for model predictions. Given the absence of a unified standard for assessing XAI methods, this study applies a holistic three-tiered comprehensive evaluation framework: Functionally-Grounded, Human-Grounded and Application-Grounded, to assess the effectiveness of the proposed SIDU-TXT across various experiments. We find that, in sentiment analysis task of a movie review dataset, SIDU-TXT excels in both functionally and human-grounded evaluations, demonstrating superior performance through quantitative and qualitative analyses compared to benchmarks like Grad-CAM and LIME. In the application-grounded evaluation within the sensitive and complex legal domain of asylum decision-making, SIDU-TXT and Grad-CAM demonstrate comparable performances, each with its own set of strengths and weaknesses. However, both methods fall short of entirely fulfilling the sophisticated criteria of expert expectations, highlighting the imperative need for additional research in XAI methods suitable for such domains.

* Preprint submitted to Elsevier on Jan 5th, 2024

Via

Access Paper or Ask Questions

SoccerNet 2023 Challenges Results

Sep 12, 2023

Anthony Cioppa, Silvio Giancola, Vladimir Somers, Floriane Magera, Xin Zhou, Hassan Mkhallati, Adrien Deliège, Jan Held, Carlos Hinojosa, Amir M. Mansourian(+92 more)

Figure 1 for SoccerNet 2023 Challenges Results

Figure 2 for SoccerNet 2023 Challenges Results

Figure 3 for SoccerNet 2023 Challenges Results

Figure 4 for SoccerNet 2023 Challenges Results

Abstract:The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, focusing on retrieving all timestamps related to global actions in soccer, (2) ball action spotting, focusing on retrieving all timestamps related to the soccer ball change of state, and (3) dense video captioning, focusing on describing the broadcast with natural language and anchored timestamps. The second theme, field understanding, relates to the single task of (4) camera calibration, focusing on retrieving the intrinsic and extrinsic camera parameters from images. The third and last theme, player understanding, is composed of three low-level tasks related to extracting information about the players: (5) re-identification, focusing on retrieving the same players across multiple views, (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams, and (7) jersey number recognition, focusing on recognizing the jersey number of players from tracklets. Compared to the previous editions of the SoccerNet challenges, tasks (2-3-7) are novel, including new annotations and data, task (4) was enhanced with more data and annotations, and task (6) now focuses on end-to-end approaches. More information on the tasks, challenges, and leaderboards are available on https://www.soccer-net.org. Baselines and development kits can be found on https://github.com/SoccerNet.

Via

Access Paper or Ask Questions

CL-MAE: Curriculum-Learned Masked Autoencoders

Aug 31, 2023

Neelu Madan, Nicolae-Catalin Ristea, Kamal Nasrollahi, Thomas B. Moeslund, Radu Tudor Ionescu

Figure 1 for CL-MAE: Curriculum-Learned Masked Autoencoders

Figure 2 for CL-MAE: Curriculum-Learned Masked Autoencoders

Figure 3 for CL-MAE: Curriculum-Learned Masked Autoencoders

Figure 4 for CL-MAE: Curriculum-Learned Masked Autoencoders

Abstract:Masked image modeling has been demonstrated as a powerful pretext task for generating robust representations that can be effectively generalized across multiple downstream tasks. Typically, this approach involves randomly masking patches (tokens) in input images, with the masking strategy remaining unchanged during training. In this paper, we propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task. We conjecture that, by gradually increasing the task complexity, the model can learn more sophisticated and transferable representations. To facilitate this, we introduce a novel learnable masking module that possesses the capability to generate masks of different complexities, and integrate the proposed module into masked autoencoders (MAE). Our module is jointly trained with the MAE, while adjusting its behavior during training, transitioning from a partner to the MAE (optimizing the same reconstruction loss) to an adversary (optimizing the opposite loss), while passing through a neutral state. The transition between these behaviors is smooth, being regulated by a factor that is multiplied with the reconstruction loss of the masking module. The resulting training procedure generates an easy-to-hard curriculum. We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE. The empirical results on five downstream tasks confirm our conjecture, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders.

Via

Access Paper or Ask Questions