Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bernt Schiele

Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Oct 02, 2024

Mattia Segu, Luigi Piccinelli, Siyuan Li, Yung-Hsu Yang, Bernt Schiele, Luc Van Gool

Figure 1 for Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Figure 2 for Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Figure 3 for Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Figure 4 for Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Abstract:Multiple object tracking in complex scenarios - such as coordinated dance performances, team sports, or dynamic animal groups - presents unique challenges. In these settings, objects frequently move in coordinated patterns, occlude each other, and exhibit long-term dependencies in their trajectories. However, it remains a key open research question on how to model long-range dependencies within tracklets, interdependencies among tracklets, and the associated temporal occlusions. To this end, we introduce Samba, a novel linear-time set-of-sequences model designed to jointly process multiple tracklets by synchronizing the multiple selective state-spaces used to model each tracklet. Samba autoregressively predicts the future track query for each sequence while maintaining synchronized long-term memory representations across tracklets. By integrating Samba into a tracking-by-propagation framework, we propose SambaMOTR, the first tracker effectively addressing the aforementioned issues, including long-range dependencies, tracklet interdependencies, and temporal occlusions. Additionally, we introduce an effective technique for dealing with uncertain observations (MaskObs) and an efficient training recipe to scale SambaMOTR to longer sequences. By modeling long-range dependencies and interactions among tracked objects, SambaMOTR implicitly learns to track objects accurately through occlusions without any hand-crafted heuristics. Our approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT, and SportsMOT datasets.

Via

Access Paper or Ask Questions

Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs

Sep 25, 2024

Mattia Segu, Luigi Piccinelli, Siyuan Li, Luc Van Gool, Fisher Yu, Bernt Schiele

Figure 1 for Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs

Figure 2 for Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs

Figure 3 for Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs

Figure 4 for Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs

Abstract:The supervision of state-of-the-art multiple object tracking (MOT) methods requires enormous annotation efforts to provide bounding boxes for all frames of all videos, and instance IDs to associate them through time. To this end, we introduce Walker, the first self-supervised tracker that learns from videos with sparse bounding box annotations, and no tracking labels. First, we design a quasi-dense temporal object appearance graph, and propose a novel multi-positive contrastive objective to optimize random walks on the graph and learn instance similarities. Then, we introduce an algorithm to enforce mutually-exclusive connective properties across instances in the graph, optimizing the learned topology for MOT. At inference time, we propose to associate detected instances to tracklets based on the max-likelihood transition state under motion-constrained bi-directional walks. Walker is the first self-supervised tracker to achieve competitive performance on MOT17, DanceTrack, and BDD100K. Remarkably, our proposal outperforms the previous self-supervised trackers even when drastically reducing the annotation requirements by up to 400x.

* ECCV 2024

Via

Access Paper or Ask Questions

Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets

Aug 22, 2024

Wolfgang Boettcher, Lukas Hoyer, Ozan Unal, Jan Eric Lenssen, Bernt Schiele

Figure 1 for Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets

Figure 2 for Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets

Figure 3 for Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets

Figure 4 for Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets

Abstract:In this work, we introduce Scribbles for All, a label and training data generation algorithm for semantic segmentation trained on scribble labels. Training or fine-tuning semantic segmentation models with weak supervision has become an important topic recently and was subject to significant advances in model quality. In this setting, scribbles are a promising label type to achieve high quality segmentation results while requiring a much lower annotation effort than usual pixel-wise dense semantic segmentation annotations. The main limitation of scribbles as source for weak supervision is the lack of challenging datasets for scribble segmentation, which hinders the development of novel methods and conclusive evaluations. To overcome this limitation, Scribbles for All provides scribble labels for several popular segmentation datasets and provides an algorithm to automatically generate scribble labels for any dataset with dense annotations, paving the way for new insights and model advancements in the field of weakly supervised segmentation. In addition to providing datasets and algorithm, we evaluate state-of-the-art segmentation models on our datasets and show that models trained with our synthetic labels perform competitively with respect to models trained on manual labels. Thus, our datasets enable state-of-the-art research into methods for scribble-labeled semantic segmentation. The datasets, scribble generation algorithm, and baselines are publicly available at https://github.com/wbkit/Scribbles4All

* under review

Via

Access Paper or Ask Questions

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

Jul 31, 2024

Anurag Das, Xinting Hu, Li Jiang, Bernt Schiele

Figure 1 for MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

Figure 2 for MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

Figure 3 for MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

Figure 4 for MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

Abstract:Recent approaches have shown that large-scale vision-language models such as CLIP can improve semantic segmentation performance. These methods typically aim for pixel-level vision-language alignment, but often rely on low resolution image features from CLIP, resulting in class ambiguities along boundaries. Moreover, the global scene representations in CLIP text embeddings do not directly correlate with the local and detailed pixel-level features, making meaningful alignment more difficult. To address these limitations, we introduce MTA-CLIP, a novel framework employing mask-level vision-language alignment. Specifically, we first propose Mask-Text Decoder that enhances the mask representations using rich textual data with the CLIP language model. Subsequently, it aligns mask representations with text embeddings using Mask-to-Text Contrastive Learning. Furthermore, we introduce MaskText Prompt Learning, utilizing multiple context-specific prompts for text embeddings to capture diverse class representations across masks. Overall, MTA-CLIP achieves state-of-the-art, surpassing prior works by an average of 2.8% and 1.3% on on standard benchmark datasets, ADE20k and Cityscapes, respectively.

* accepted at ECCV 2024

Via

Access Paper or Ask Questions

Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Jul 19, 2024

Sukrut Rao, Sweta Mahajan, Moritz Böhle, Bernt Schiele

Figure 1 for Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Figure 2 for Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Figure 3 for Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Figure 4 for Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Abstract:Concept Bottleneck Models (CBMs) have recently been proposed to address the 'black-box' problem of deep neural networks, by first mapping images to a human-understandable concept space and then linearly combining concepts for classification. Such models typically require first coming up with a set of concepts relevant to the task and then aligning the representations of a feature extractor to map to these concepts. However, even with powerful foundational feature extractors like CLIP, there are no guarantees that the specified concepts are detectable. In this work, we leverage recent advances in mechanistic interpretability and propose a novel CBM approach -- called Discover-then-Name-CBM (DN-CBM) -- that inverts the typical paradigm: instead of pre-selecting concepts based on the downstream classification task, we use sparse autoencoders to first discover concepts learnt by the model, and then name them and train linear probes for classification. Our concept extraction strategy is efficient, since it is agnostic to the downstream task, and uses concepts already known to the model. We perform a comprehensive evaluation across multiple datasets and CLIP architectures and show that our method yields semantically meaningful concepts, assigns appropriate names to them that make them easy to interpret, and yields performant and interpretable CBMs. Code available at https://github.com/neuroexplicit-saar/discover-then-name.

* 40 pages, 21 figures, 6 tables, European Conference on Computer Vision (ECCV) 2024

Via

Access Paper or Ask Questions

Toward a Diffusion-Based Generalist for Dense Vision Tasks

Jun 29, 2024

Yue Fan, Yongqin Xian, Xiaohua Zhai, Alexander Kolesnikov, Muhammad Ferjad Naeem, Bernt Schiele, Federico Tombari

Figure 1 for Toward a Diffusion-Based Generalist for Dense Vision Tasks

Figure 2 for Toward a Diffusion-Based Generalist for Dense Vision Tasks

Figure 3 for Toward a Diffusion-Based Generalist for Dense Vision Tasks

Figure 4 for Toward a Diffusion-Based Generalist for Dense Vision Tasks

Abstract:Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.

* Published at CVPR 2024 as a workshop paper

Via

Access Paper or Ask Questions

Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors

May 26, 2024

Soumava Paul, Christopher Wewer, Bernt Schiele, Jan Eric Lenssen

Figure 1 for Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors

Figure 2 for Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors

Figure 3 for Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors

Figure 4 for Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors

Abstract:We aim to tackle sparse-view reconstruction of a 360 3D scene using priors from latent diffusion models (LDM). The sparse-view setting is ill-posed and underconstrained, especially for scenes where the camera rotates 360 degrees around a point, as no visual information is available beyond some frontal views focused on the central object(s) of interest. In this work, we show that pretrained 2D diffusion models can strongly improve the reconstruction of a scene with low-cost fine-tuning. Specifically, we present SparseSplat360 (Sp2360), a method that employs a cascade of in-painting and artifact removal models to fill in missing details and clean novel views. Due to superior training and rendering speeds, we use an explicit scene representation in the form of 3D Gaussians over NeRF-based implicit representations. We propose an iterative update strategy to fuse generated pseudo novel views with existing 3D Gaussians fitted to the initial sparse inputs. As a result, we obtain a multi-view consistent scene representation with details coherent with the observed inputs. Our evaluation on the challenging Mip-NeRF360 dataset shows that our proposed 2D to 3D distillation algorithm considerably improves the performance of a regularized version of 3DGS adapted to a sparse-view setting and outperforms existing sparse-view reconstruction methods in 360 scene reconstruction. Qualitatively, our method generates entire 360 scenes from as few as 9 input views, with a high degree of foreground and background detail.

* 18 pages, 10 figures, 4 tables

Via

Access Paper or Ask Questions

X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization

Mar 28, 2024

Anna Kukleva, Fadime Sener, Edoardo Remelli, Bugra Tekin, Eric Sauser, Bernt Schiele, Shugao Ma

Figure 1 for X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization

Figure 2 for X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization

Figure 3 for X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization

Figure 4 for X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization

Abstract:Lately, there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However, the adaptation of these models to egocentric videos has been largely unexplored. To address this gap, we propose a simple yet effective cross-modal adaptation framework, which we call X-MIC. Using a video adapter, our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space. Our novel adapter architecture retains and improves generalization of the pre-trained VLMs by disentangling learnable temporal modeling and frozen visual encoder. This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization. We evaluate our approach on the Epic-Kitchens, Ego4D, and EGTEA datasets for fine-grained cross-dataset action generalization, demonstrating the effectiveness of our method. Code is available at https://github.com/annusha/xmic

* CVPR 2024

Via

Access Paper or Ask Questions

OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning

Mar 27, 2024

Noor Ahmed, Anna Kukleva, Bernt Schiele

Abstract:Few-Shot Class-Incremental Learning (FSCIL) introduces a paradigm in which the problem space expands with limited data. FSCIL methods inherently face the challenge of catastrophic forgetting as data arrives incrementally, making models susceptible to overwriting previously acquired knowledge. Moreover, given the scarcity of labeled samples available at any given time, models may be prone to overfitting and find it challenging to strike a balance between extensive pretraining and the limited incremental data. To address these challenges, we propose the OrCo framework built on two core principles: features' orthogonality in the representation space, and contrastive learning. In particular, we improve the generalization of the embedding space by employing a combination of supervised and self-supervised contrastive losses during the pretraining phase. Additionally, we introduce OrCo loss to address challenges arising from data limitations during incremental sessions. Through feature space perturbations and orthogonality between classes, the OrCo loss maximizes margins and reserves space for the following incremental data. This, in turn, ensures the accommodation of incoming classes in the feature space without compromising previously acquired knowledge. Our experimental results showcase state-of-the-art performance across three benchmark datasets, including mini-ImageNet, CIFAR100, and CUB datasets. Code is available at https://github.com/noorahmedds/OrCo

Via

Access Paper or Ask Questions

latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction

Mar 24, 2024

Christopher Wewer, Kevin Raj, Eddy Ilg, Bernt Schiele, Jan Eric Lenssen

Abstract:We present latentSplat, a method to predict semantic Gaussians in a 3D latent space that can be splatted and decoded by a light-weight generative 2D architecture. Existing methods for generalizable 3D reconstruction either do not enable fast inference of high resolution novel views due to slow volume rendering, or are limited to interpolation of close input views, even in simpler settings with a single central object, where 360-degree generalization is possible. In this work, we combine a regression-based approach with a generative model, moving towards both of these capabilities within the same method, trained purely on readily available real video data. The core of our method are variational 3D Gaussians, a representation that efficiently encodes varying uncertainty within a latent space consisting of 3D feature Gaussians. From these Gaussians, specific instances can be sampled and rendered via efficient Gaussian splatting and a fast, generative decoder network. We show that latentSplat outperforms previous works in reconstruction quality and generalization, while being fast and scalable to high-resolution data.

* Project website: https://geometric-rl.mpi-inf.mpg.de/latentsplat/

Via

Access Paper or Ask Questions