Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jan van Gemert

Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video

Jan 22, 2026

Pascal Benschop, Justin Dauwels, Jan van Gemert

Abstract:Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.

Via

Access Paper or Ask Questions

LayoutGKN: Graph Similarity Learning of Floor Plans

Sep 03, 2025

Casper van Engelenburg, Jan van Gemert, Seyran Khademi

Abstract:Floor plans depict building layouts and are often represented as graphs to capture the underlying spatial relationships. Comparison of these graphs is critical for applications like search, clustering, and data visualization. The most successful methods to compare graphs \ie, graph matching networks, rely on costly intermediate cross-graph node-level interactions, therefore being slow in inference time. We introduce \textbf{LayoutGKN}, a more efficient approach that postpones the cross-graph node-level interactions to the end of the joint embedding architecture. We do so by using a differentiable graph kernel as a distance function on the final learned node-level embeddings. We show that LayoutGKN computes similarity comparably or better than graph matching networks while significantly increasing the speed. \href{https://github.com/caspervanengelenburg/LayoutGKN}{Code and data} are open.

* BMVC (2025)

Via

Access Paper or Ask Questions

Data-Efficient Challenges in Visual Inductive Priors: A Retrospective

Jun 10, 2025

Robert-Jan Bruintjes, Attila Lengyel, Osman Semih Kayhan, Davide Zambrano, Nergis Tömen, Hadi Jamali-Rad, Jan van Gemert

Figure 1 for Data-Efficient Challenges in Visual Inductive Priors: A Retrospective

Figure 2 for Data-Efficient Challenges in Visual Inductive Priors: A Retrospective

Figure 3 for Data-Efficient Challenges in Visual Inductive Priors: A Retrospective

Figure 4 for Data-Efficient Challenges in Visual Inductive Priors: A Retrospective

Abstract:Deep Learning requires large amounts of data to train models that work well. In data-deficient settings, performance can be degraded. We investigate which Deep Learning methods benefit training models in a data-deficient setting, by organizing the "VIPriors: Visual Inductive Priors for Data-Efficient Deep Learning" workshop series, featuring four editions of data-impaired challenges. These challenges address the problem of training deep learning models for computer vision tasks with limited data. Participants are limited to training models from scratch using a low number of training samples and are not allowed to use any form of transfer learning. We aim to stimulate the development of novel approaches that incorporate prior knowledge to improve the data efficiency of deep learning models. Successful challenge entries make use of large model ensembles that mix Transformers and CNNs, as well as heavy data augmentation. Novel prior knowledge-based methods contribute to success in some entries.

Via

Access Paper or Ask Questions

Making Every Event Count: Balancing Data Efficiency and Accuracy in Event Camera Subsampling

May 27, 2025

Hesam Araghi, Jan van Gemert, Nergis Tomen

Figure 1 for Making Every Event Count: Balancing Data Efficiency and Accuracy in Event Camera Subsampling

Figure 2 for Making Every Event Count: Balancing Data Efficiency and Accuracy in Event Camera Subsampling

Figure 3 for Making Every Event Count: Balancing Data Efficiency and Accuracy in Event Camera Subsampling

Figure 4 for Making Every Event Count: Balancing Data Efficiency and Accuracy in Event Camera Subsampling

Abstract:Event cameras offer high temporal resolution and power efficiency, making them well-suited for edge AI applications. However, their high event rates present challenges for data transmission and processing. Subsampling methods provide a practical solution, but their effect on downstream visual tasks remains underexplored. In this work, we systematically evaluate six hardware-friendly subsampling methods using convolutional neural networks for event video classification on various benchmark datasets. We hypothesize that events from high-density regions carry more task-relevant information and are therefore better suited for subsampling. To test this, we introduce a simple causal density-based subsampling method, demonstrating improved classification accuracy in sparse regimes. Our analysis further highlights key factors affecting subsampling performance, including sensitivity to hyperparameters and failure cases in scenarios with large event count variance. These findings provide insights for utilization of hardware-efficient subsampling strategies that balance data efficiency and task accuracy. The code for this paper will be released at: https://github.com/hesamaraghi/event-camera-subsampling-methods.

Via

Access Paper or Ask Questions

Learning to Adapt to Position Bias in Vision Transformer Classifiers

May 19, 2025

Robert-Jan Bruintjes, Jan van Gemert

Abstract:How discriminative position information is for image classification depends on the data. On the one hand, the camera position is arbitrary and objects can appear anywhere in the image, arguing for translation invariance. At the same time, position information is key for exploiting capture/center bias, and scene layout, e.g.: the sky is up. We show that position bias, the level to which a dataset is more easily solved when positional information on input features is used, plays a crucial role in the performance of Vision Transformers image classifiers. To investigate, we propose Position-SHAP, a direct measure of position bias by extending SHAP to work with position embeddings. We show various levels of position bias in different datasets, and find that the optimal choice of position embedding depends on the position bias apparent in the dataset. We therefore propose Auto-PE, a single-parameter position embedding extension, which allows the position embedding to modulate its norm, enabling the unlearning of position information. Auto-PE combines with existing PEs to match or improve accuracy on classification datasets.

Via

Access Paper or Ask Questions

ARC: Anchored Representation Clouds for High-Resolution INR Classification

Mar 19, 2025

Joost Luijmes, Alexander Gielisse, Roman Knyazhitskiy, Jan van Gemert

Abstract:Implicit neural representations (INRs) encode signals in neural network weights as a memory-efficient representation, decoupling sampling resolution from the associated resource costs. Current INR image classification methods are demonstrated on low-resolution data and are sensitive to image-space transformations. We attribute these issues to the global, fully-connected MLP neural network architecture encoding of current INRs, which lack mechanisms for local representation: MLPs are sensitive to absolute image location and struggle with high-frequency details. We propose ARC: Anchored Representation Clouds, a novel INR architecture that explicitly anchors latent vectors locally in image-space. By introducing spatial structure to the latent vectors, ARC captures local image data which in our testing leads to state-of-the-art implicit image classification of both low- and high-resolution images and increased robustness against image-space translation. Code can be found at https://github.com/JLuij/anchored_representation_clouds.

* Accepted at the ICLR 2025 Workshop on Neural Network Weights as a New Data Modality

Via

Access Paper or Ask Questions

Local Attention Transformers for High-Detail Optical Flow Upsampling

Dec 09, 2024

Alexander Gielisse, Nergis Tömen, Jan van Gemert

Abstract:Most recent works on optical flow use convex upsampling as the last step to obtain high-resolution flow. In this work, we show and discuss several issues and limitations of this currently widely adopted convex upsampling approach. We propose a series of changes, in an attempt to resolve current issues. First, we propose to decouple the weights for the final convex upsampler, making it easier to find the correct convex combination. For the same reason, we also provide extra contextual features to the convex upsampler. Then, we increase the convex mask size by using an attention-based alternative convex upsampler; Transformers for Convex Upsampling. This upsampler is based on the observation that convex upsampling can be reformulated as attention, and we propose to use local attention masks as a drop-in replacement for convex masks to increase the mask size. We provide empirical evidence that a larger mask size increases the likelihood of the existence of the convex combination. Lastly, we propose an alternative training scheme to remove bilinear interpolation artifacts from the model output. Our proposed ideas could theoretically be applied to almost every current state-of-the-art optical flow architecture. On the FlyingChairs + FlyingThings3D training setting we reduce the Sintel Clean training end-point-error of RAFT from 1.42 to 1.26, GMA from 1.31 to 1.18, and that of FlowFormer from 0.94 to 0.90, by solely adapting the convex upsampler.

* Note; this work is an extension of my Master's thesis, available as "Optical Flow Upsamplers Ignore Details: Neighborhood Attention Transformers for Convex Upsampling"

Via

Access Paper or Ask Questions

Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems

Oct 02, 2024

Alejandro Castañeda Garcia, Jan van Gemert, Daan Brinks, Nergis Tömen

Figure 1 for Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems

Figure 2 for Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems

Figure 3 for Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems

Figure 4 for Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems

Abstract:Extracting physical dynamical system parameters from videos is of great interest to applications in natural science and technology. The state-of-the-art in automatic parameter estimation from video is addressed by training supervised deep networks on large datasets. Such datasets require labels, which are difficult to acquire. While some unsupervised techniques -- which depend on frame prediction -- exist, they suffer from long training times, instability under different initializations, and are limited to hand-picked motion problems. In this work, we propose a method to estimate the physical parameters of any known, continuous governing equation from single videos; our solution is suitable for different dynamical systems beyond motion and is robust to initialization compared to previous approaches. Moreover, we remove the need for frame prediction by implementing a KL-divergence-based loss function in the latent space, which avoids convergence to trivial solutions and reduces model size and compute.

Via

Access Paper or Ask Questions

Deep activity propagation via weight initialization in spiking neural networks

Oct 01, 2024

Aurora Micheli, Olaf Booij, Jan van Gemert, Nergis Tömen

Abstract:Spiking Neural Networks (SNNs) and neuromorphic computing offer bio-inspired advantages such as sparsity and ultra-low power consumption, providing a promising alternative to conventional networks. However, training deep SNNs from scratch remains a challenge, as SNNs process and transmit information by quantizing the real-valued membrane potentials into binary spikes. This can lead to information loss and vanishing spikes in deeper layers, impeding effective training. While weight initialization is known to be critical for training deep neural networks, what constitutes an effective initial state for a deep SNN is not well-understood. Existing weight initialization methods designed for conventional networks (ANNs) are often applied to SNNs without accounting for their distinct computational properties. In this work we derive an optimal weight initialization method specifically tailored for SNNs, taking into account the quantization operation. We show theoretically that, unlike standard approaches, this method enables the propagation of activity in deep SNNs without loss of spikes. We demonstrate this behavior in numerical simulations of SNNs with up to 100 layers across multiple time steps. We present an in-depth analysis of the numerical conditions, regarding layer width and neuron hyperparameters, which are necessary to accurately apply our theoretical findings. Furthermore, our experiments on MNIST demonstrate higher accuracy and faster convergence when using the proposed weight initialization scheme. Finally, we show that the newly introduced weight initialization is robust against variations in several network and neuron hyperparameters.

Via

Access Paper or Ask Questions

Pushing the boundaries of event subsampling in event-based video classification using CNNs

Sep 13, 2024

Hesam Araghi, Jan van Gemert, Nergis Tomen

Figure 1 for Pushing the boundaries of event subsampling in event-based video classification using CNNs

Figure 2 for Pushing the boundaries of event subsampling in event-based video classification using CNNs

Figure 3 for Pushing the boundaries of event subsampling in event-based video classification using CNNs

Figure 4 for Pushing the boundaries of event subsampling in event-based video classification using CNNs

Abstract:Event cameras offer low-power visual sensing capabilities ideal for edge-device applications. However, their high event rate, driven by high temporal details, can be restrictive in terms of bandwidth and computational resources. In edge AI applications, determining the minimum amount of events for specific tasks can allow reducing the event rate to improve bandwidth, memory, and processing efficiency. In this paper, we study the effect of event subsampling on the accuracy of event data classification using convolutional neural network (CNN) models. Surprisingly, across various datasets, the number of events per video can be reduced by an order of magnitude with little drop in accuracy, revealing the extent to which we can push the boundaries in accuracy vs. event rate trade-off. Additionally, we also find that lower classification accuracy in high subsampling rates is not solely attributable to information loss due to the subsampling of the events, but that the training of CNNs can be challenging in highly subsampled scenarios, where the sensitivity to hyperparameters increases. We quantify training instability across multiple event-based classification datasets using a novel metric for evaluating the hyperparameter sensitivity of CNNs in different subsampling settings. Finally, we analyze the weight gradients of the network to gain insight into this instability.

Via

Access Paper or Ask Questions