KTH Royal Institute of Technology, Stockholm, Sweden
Abstract:Predicting future states in uncertain environments, such as wildfire spread, medical diagnosis, or autonomous driving, requires models that can consider multiple plausible outcomes. While diffusion models can effectively learn such multi-modal distributions, naively sampling from these models is computationally inefficient, potentially requiring hundreds of samples to find low-probability modes that may still be operationally relevant. In this work, we address the challenge of sample-efficient ambiguous segmentation by evaluating several training-free sampling methods that encourage diverse predictions. We adapt two techniques, particle guidance and SPELL, originally designed for the generation of diverse natural images, to discrete segmentation tasks, and additionally propose a simple clustering-based technique. We validate these approaches on the LIDC medical dataset, a modified version of the Cityscapes dataset, and MMFire, a new simulation-based wildfire spread dataset introduced in this paper. Compared to naive sampling, these approaches increase the HM IoU* metric by up to 7.5% on MMFire and 16.4% on Cityscapes, demonstrating that training-free methods can be used to efficiently increase the sample diversity of segmentation diffusion models with little cost to image quality and runtime. Code and dataset: https://github.com/SebastianGer/wildfire-spread-scenarios
Abstract:Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions are equally correct. Current methods typically rely on generative models to capture this uncertainty. However, identifying the underlying modes of the distribution with these methods is computationally expensive, requiring large numbers of samples and post-hoc clustering. In this paper, we shift the focus from stochastic sampling to the direct generation of likely outcomes. We introduce mode proposal models, a deterministic framework that efficiently produces a fixed-size set of proposal masks in a single forward pass. To handle superfluous proposals, we adapt a confidence mechanism, traditionally used in object detection, to the high-dimensional space of segmentation masks. Our approach significantly reduces inference time while achieving higher ground-truth coverage than existing generative models. Furthermore, we demonstrate that our model can be trained without knowing the full distribution of outcomes, making it applicable to real-world datasets. Finally, we show that by decomposing the velocity field of a pre-trained flow model, we can efficiently estimate prior mode probabilities for our proposals.




Abstract:Wildfire monitoring and prediction are essential for understanding wildfire behaviour. With extensive Earth observation data, these tasks can be integrated and enhanced through multi-task deep learning models. We present a comprehensive multi-temporal remote sensing dataset for active fire detection, daily wildfire monitoring, and next-day wildfire prediction. Covering wildfire events in the contiguous U.S. from January 2017 to October 2021, the dataset includes 3552 surface reflectance images and auxiliary data such as weather, topography, land cover, and fuel information, totalling 71 GB. The lifecycle of each wildfire is documented, with labels for active fires (AF) and burned areas (BA), supported by manual quality assurance of AF and BA test labels. The dataset supports three tasks: a) active fire detection, b) daily burned area mapping, and c) wildfire progression prediction. Detection tasks use pixel-wise classification of multi-spectral, multi-temporal images, while prediction tasks integrate satellite and auxiliary data to model fire dynamics. This dataset and its benchmarks provide a foundation for advancing wildfire research using deep learning.
Abstract:Geospatial Foundation Models (GFMs) have emerged as powerful tools for extracting representations from Earth observation data, but their evaluation remains inconsistent and narrow. Existing works often evaluate on suboptimal downstream datasets and tasks, that are often too easy or too narrow, limiting the usefulness of the evaluations to assess the real-world applicability of GFMs. Additionally, there is a distinct lack of diversity in current evaluation protocols, which fail to account for the multiplicity of image resolutions, sensor types, and temporalities, which further complicates the assessment of GFM performance. In particular, most existing benchmarks are geographically biased towards North America and Europe, questioning the global applicability of GFMs. To overcome these challenges, we introduce PANGAEA, a standardized evaluation protocol that covers a diverse set of datasets, tasks, resolutions, sensor modalities, and temporalities. It establishes a robust and widely applicable benchmark for GFMs. We evaluate the most popular GFMs openly available on this benchmark and analyze their performance across several domains. In particular, we compare these models to supervised baselines (e.g. UNet and vanilla ViT), and assess their effectiveness when faced with limited labeled data. Our findings highlight the limitations of GFMs, under different scenarios, showing that they do not consistently outperform supervised models. PANGAEA is designed to be highly extensible, allowing for the seamless inclusion of new datasets, models, and tasks in future research. By releasing the evaluation code and benchmark, we aim to enable other researchers to replicate our experiments and build upon our work, fostering a more principled evaluation protocol for large pre-trained geospatial models. The code is available at https://github.com/VMarsocci/pangaea-bench.
Abstract:We construct a strong baseline method for building damage detection by starting with the highly-engineered winning solution of the xView2 competition, and gradually stripping away components. This way, we obtain a much simpler method, while retaining adequate performance. We expect the simplified solution to be more widely and easily applicable. This expectation is based on the reduced complexity, as well as the fact that we choose hyperparameters based on simple heuristics, that transfer to other datasets. We then re-arrange the xView2 dataset splits such that the test locations are not seen during training, contrary to the competition setup. In this setting, we find that both the complex and the simplified model fail to generalize to unseen locations. Analyzing the dataset indicates that this failure to generalize is not only a model-based problem, but that the difficulty might also be influenced by the unequal class distributions between events. Code, including the baseline model, is available under https://github.com/PaulBorneP/Xview2_Strong_Baseline




Abstract:Domain-specific variants of contrastive learning can construct positive pairs from two distinct images, as opposed to augmenting the same image twice. Unlike in traditional contrastive methods, this can result in positive pairs not matching perfectly. Similar to false negative pairs, this could impede model performance. Surprisingly, we find that downstream semantic segmentation is either robust to the noisy pairs or even benefits from them. The experiments are conducted on the remote sensing dataset xBD, and a synthetic segmentation dataset, on which we have full control over the noise parameters. As a result, practitioners should be able to use such domain-specific contrastive methods without having to filter their positive pairs beforehand.