Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Markus Marks

Diffusion-Based Action Recognition Generalizes to Untrained Domains

Sep 10, 2025

Rogerio Guimaraes, Frank Xiao, Pietro Perona, Markus Marks

Figure 1 for Diffusion-Based Action Recognition Generalizes to Untrained Domains

Figure 2 for Diffusion-Based Action Recognition Generalizes to Untrained Domains

Figure 3 for Diffusion-Based Action Recognition Generalizes to Untrained Domains

Figure 4 for Diffusion-Based Action Recognition Generalizes to Untrained Domains

Abstract:Humans can recognize the same actions despite large context and viewpoint variations, such as differences between species (walking in spiders vs. horses), viewpoints (egocentric vs. third-person), and contexts (real life vs movies). Current deep learning models struggle with such generalization. We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across these challenging conditions. We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. We experimentally explore the generalization properties of our approach in classifying actions across animal species, across different viewing angles, and different recording contexts. Our model sets a new state-of-the-art across all three generalization benchmarks, bringing machine action recognition closer to human-like robustness. Project page: $\href{https://www.vision.caltech.edu/actiondiff/}{\texttt{vision.caltech.edu/actiondiff}}$ Code: $\href{https://github.com/frankyaoxiao/ActionDiff}{\texttt{github.com/frankyaoxiao/ActionDiff}}$

Via

Access Paper or Ask Questions

SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance

Mar 31, 2025

Suzanne Stathatos, Michael Hobley, Markus Marks, Pietro Perona

Abstract:Foundation models excel at vision tasks in natural images but fail in low signal-to-noise ratio (SNR) videos, such as underwater sonar, ultrasound, and microscopy. We introduce Spatiotemporal Augmentations and denoising in Video for Downstream Tasks (SAVeD), a self-supervised method that denoises low-SNR sensor videos and is trained using only the raw noisy data. By leveraging differences in foreground and background motion, SAVeD enhances object visibility using an encoder-decoder with a temporal bottleneck. Our approach improves classification, detection, tracking, and counting, outperforming state-of-the-art video denoising methods with lower resource requirements. Project page: https://suzanne-stathatos.github.io/SAVeD Code page: https://github.com/suzanne-stathatos/SAVeD

* Project page: https://suzanne-stathatos.github.io/SAVeD Code page: https://github.com/suzanne-stathatos/SAVeD

Via

Access Paper or Ask Questions

Probing the Mid-level Vision Capabilities of Self-Supervised Learning

Nov 25, 2024

Xuweiyi Chen, Markus Marks, Zezhou Cheng

Figure 1 for Probing the Mid-level Vision Capabilities of Self-Supervised Learning

Figure 2 for Probing the Mid-level Vision Capabilities of Self-Supervised Learning

Figure 3 for Probing the Mid-level Vision Capabilities of Self-Supervised Learning

Figure 4 for Probing the Mid-level Vision Capabilities of Self-Supervised Learning

Abstract:Mid-level vision capabilities - such as generic object localization and 3D geometric understanding - are not only fundamental to human vision but are also crucial for many real-world applications of computer vision. These abilities emerge with minimal supervision during the early stages of human visual development. Despite their significance, current self-supervised learning (SSL) approaches are primarily designed and evaluated for high-level recognition tasks, leaving their mid-level vision capabilities largely unexamined. In this study, we introduce a suite of benchmark protocols to systematically assess mid-level vision capabilities and present a comprehensive, controlled evaluation of 22 prominent SSL models across 8 mid-level vision tasks. Our experiments reveal a weak correlation between mid-level and high-level task performance. We also identify several SSL methods with highly imbalanced performance across mid-level and high-level capabilities, as well as some that excel in both. Additionally, we investigate key factors contributing to mid-level vision performance, such as pretraining objectives and network architectures. Our study provides a holistic and timely view of what SSL models have learned, complementing existing research that primarily focuses on high-level vision tasks. We hope our findings guide future SSL research to benchmark models not only on high-level vision tasks but on mid-level as well.

* Project Page: https://midvision-probe.cs.virginia.edu/

Via

Access Paper or Ask Questions

Learning Keypoints for Multi-Agent Behavior Analysis using Self-Supervision

Sep 14, 2024

Daniel Khalil, Christina Liu, Pietro Perona, Jennifer J. Sun, Markus Marks

Figure 1 for Learning Keypoints for Multi-Agent Behavior Analysis using Self-Supervision

Figure 2 for Learning Keypoints for Multi-Agent Behavior Analysis using Self-Supervision

Figure 3 for Learning Keypoints for Multi-Agent Behavior Analysis using Self-Supervision

Figure 4 for Learning Keypoints for Multi-Agent Behavior Analysis using Self-Supervision

Abstract:The study of social interactions and collective behaviors through multi-agent video analysis is crucial in biology. While self-supervised keypoint discovery has emerged as a promising solution to reduce the need for manual keypoint annotations, existing methods often struggle with videos containing multiple interacting agents, especially those of the same species and color. To address this, we introduce B-KinD-multi, a novel approach that leverages pre-trained video segmentation models to guide keypoint discovery in multi-agent scenarios. This eliminates the need for time-consuming manual annotations on new experimental settings and organisms. Extensive evaluations demonstrate improved keypoint regression and downstream behavioral classification in videos of flies, mice, and rats. Furthermore, our method generalizes well to other species, including ants, bees, and humans, highlighting its potential for broad applications in automated keypoint annotation for multi-agent behavior analysis. Code available under: https://danielpkhalil.github.io/B-KinD-Multi

Via

Access Paper or Ask Questions

A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification

Jul 18, 2024

Markus Marks, Manuel Knott, Neehar Kondapaneni, Elijah Cole, Thijs Defraeye, Fernando Perez-Cruz, Pietro Perona

Figure 1 for A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification

Figure 2 for A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification

Figure 3 for A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification

Figure 4 for A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification

Abstract:Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels. The model is forced to learn about the data structure or context by solving a pretext task. With SSL, models can learn from abundant and cheap unlabeled data, significantly reducing the cost of training models where labels are expensive or inaccessible. In Computer Vision, SSL is widely used as pre-training followed by a downstream task, such as supervised transfer, few-shot learning on smaller labeled data sets, and/or unsupervised clustering. Unfortunately, it is infeasible to evaluate SSL methods on all possible downstream tasks and objectively measure the quality of the learned representation. Instead, SSL methods are evaluated using in-domain evaluation protocols, such as fine-tuning, linear probing, and k-nearest neighbors (kNN). However, it is not well understood how well these evaluation protocols estimate the representation quality of a pre-trained model for different downstream tasks under different conditions, such as dataset, metric, and model architecture. We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types. Our study includes eleven common image datasets and 26 models that were pre-trained with different SSL methods or have different model backbones. We find that in-domain linear/kNN probing protocols are, on average, the best general predictors for out-of-domain performance. We further investigate the importance of batch normalization and evaluate how robust correlations are for different kinds of dataset domain shifts. We challenge assumptions about the relationship between discriminative and generative self-supervised methods, finding that most of their performance differences can be explained by changes to model backbones.

Via

Access Paper or Ask Questions

Less is More: Discovering Concise Network Explanations

May 24, 2024

Neehar Kondapaneni, Markus Marks, Oisin MacAodha, Pietro Perona

Abstract:We introduce Discovering Conceptual Network Explanations (DCNE), a new approach for generating human-comprehensible visual explanations to enhance the interpretability of deep neural image classifiers. Our method automatically finds visual explanations that are critical for discriminating between classes. This is achieved by simultaneously optimizing three criteria: the explanations should be few, diverse, and human-interpretable. Our approach builds on the recently introduced Concept Relevance Propagation (CRP) explainability method. While CRP is effective at describing individual neuronal activations, it generates too many concepts, which impacts human comprehension. Instead, DCNE selects the few most important explanations. We introduce a new evaluation dataset centered on the challenging task of classifying birds, enabling us to compare the alignment of DCNE's explanations to those of human expert-defined ones. Compared to existing eXplainable Artificial Intelligence (XAI) methods, DCNE has a desirable trade-off between conciseness and completeness when summarizing network explanations. It produces 1/30 of CRP's explanations while only resulting in a slight reduction in explanation quality. DCNE represents a step forward in making neural network decisions accessible and interpretable to humans, providing a valuable tool for both researchers and practitioners in XAI and model alignment.

* 9 pages, 5 figures; ICLR Re-Align Workshop 2024; Project Page: https://www.vision.caltech.edu/dcne/; Github: https://github.com/nkondapa/DiscoveringConciseNetworkExplanations

Via

Access Paper or Ask Questions

Text-image Alignment for Diffusion-based Perception

Oct 04, 2023

Neehar Kondapaneni, Markus Marks, Manuel Knott, Rogério Guimarães, Pietro Perona

Figure 1 for Text-image Alignment for Diffusion-based Perception

Figure 2 for Text-image Alignment for Diffusion-based Perception

Figure 3 for Text-image Alignment for Diffusion-based Perception

Figure 4 for Text-image Alignment for Diffusion-based Perception

Abstract:Diffusion models are generative models with impressive text-to-image synthesis capabilities and have spurred a new wave of creative methods for classical machine learning tasks. However, the best way to harness the perceptual knowledge of these generative models for visual tasks is still an open question. Specifically, it is unclear how to use the prompting interface when applying diffusion backbones to vision tasks. We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps, leading to better perceptual performance. Our approach improves upon the current SOTA in diffusion-based semantic segmentation on ADE20K and the current overall SOTA in depth estimation on NYUv2. Furthermore, our method generalizes to the cross-domain setting; we use model personalization and caption modifications to align our model to the target domain and find improvements over unaligned baselines. Our object detection model, trained on Pascal VOC, achieves SOTA results on Watercolor2K. Our segmentation method, trained on Cityscapes, achieves SOTA results on Dark Zurich-val and Nighttime Driving. Project page: https://www.vision.caltech.edu/tadp/

* Project page: https://www.vision.caltech.edu/tadp/

Via

Access Paper or Ask Questions

Robust Disentanglement of a Few Factors at a Time

Oct 26, 2020

Benjamin Estermann, Markus Marks, Mehmet Fatih Yanik

Figure 1 for Robust Disentanglement of a Few Factors at a Time

Figure 2 for Robust Disentanglement of a Few Factors at a Time

Figure 3 for Robust Disentanglement of a Few Factors at a Time

Figure 4 for Robust Disentanglement of a Few Factors at a Time

Abstract:Disentanglement is at the forefront of unsupervised learning, as disentangled representations of data improve generalization, interpretability, and performance in downstream tasks. Current unsupervised approaches remain inapplicable for real-world datasets since they are highly variable in their performance and fail to reach levels of disentanglement of (semi-)supervised approaches. We introduce population-based training (PBT) for improving consistency in training variational autoencoders (VAEs) and demonstrate the validity of this approach in a supervised setting (PBT-VAE). We then use Unsupervised Disentanglement Ranking (UDR) as an unsupervised heuristic to score models in our PBT-VAE training and show how models trained this way tend to consistently disentangle only a subset of the generative factors. Building on top of this observation we introduce the recursive rPU-VAE approach. We train the model until convergence, remove the learned factors from the dataset and reiterate. In doing so, we can label subsets of the dataset with the learned factors and consecutively use these labels to train one model that fully disentangles the whole dataset. With this approach, we show striking improvement in state-of-the-art unsupervised disentanglement performance and robustness across multiple datasets and metrics.

* The first two authors contributed equally. Code is available at this url https://github.com/besterma/robust_disentanglement

Via

Access Paper or Ask Questions