Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tin Stribor Sohn

Improving Driver Satisfaction with a Driving Function Learning from Implicit Human Feedback -- a Test Group Study

Feb 14, 2026

Robin Schwager, Andrea Anastasio, Simon Hartmann, Andreas Ronellenfitsch, Michael Grimm, Tim Brühl, Tin Stribor Sohn, Tim Dieter Eberhardt, Sören Hohmann

Abstract:During the use of advanced driver assistance systems, drivers frequently intervene into the active driving function and adjust the system's behavior to their personal wishes. These active driver-initiated takeovers contain feedback about deviations in the driving function's behavior from the drivers' personal preferences. This feedback should be utilized to optimize and personalize the driving function's behavior. In this work, the adjustment of the speed profile of a Predictive Longitudinal Driving Function (PLDF) on a pre-defined route is highlighted. An algorithm is introduced which iteratively adjusts the PLDF's speed profile by taking into account both the original speed profile of the PLDF and the driver demonstration. This approach allows for personalization in a traded control scenario during active use of the PLDF. The applicability of the proposed algorithm is tested in a driving simulator-based test group study with 43 participants. The study finds a significant increase in driver satisfaction and a significant reduction in the intervention frequency when using the proposed adaptive PLDF. Additionally, feedback by the participants was gathered to identify further optimization potentials of the proposed system.

Via

Access Paper or Ask Questions

Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation

Dec 19, 2025

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

Abstract:Vision-language navigation requires agents to reason and act under constraints of embodiment. While vision-language models (VLMs) demonstrate strong generalization, current benchmarks provide limited understanding of how embodiment -- i.e., the choice of physical platform, sensor configuration, and modality alignment -- influences perception, reasoning, and control. We introduce Embodied4C, a closed-loop benchmark designed as a Turing test for embodied reasoning. The benchmark evaluates the core embodied capabilities of VLMs across three heterogeneous embodiments -- autonomous vehicles, aerial drones, and robotic manipulators -- through approximately 1.1K one-shot reasoning questions and 58 goal-directed navigation tasks. These tasks jointly assess four foundational dimensions: semantic, spatial, temporal, and physical reasoning. Each embodiment presents dynamic sensor configurations and environment variations to probe generalization beyond platform-specific adaptation. To prevent embodiment overfitting, Embodied4C integrates domain-far queries targeting abstract and cross-context reasoning. Comprehensive evaluation across ten state-of-the-art VLMs and four embodied control baselines shows that cross-modal alignment and instruction tuning matter more than scale, while spatial and temporal reasoning remains the primary bottleneck for reliable embodied competence.

Via

Access Paper or Ask Questions

SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Dec 18, 2025

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

Figure 1 for SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Figure 2 for SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Figure 3 for SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Figure 4 for SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Abstract:Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.

Via

Access Paper or Ask Questions

R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

Dec 17, 2025

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

Abstract:Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.

Via

Access Paper or Ask Questions

A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving

Mar 14, 2025

Tin Stribor Sohn, Philipp Reis, Maximilian Dillitzer, Johannes Bach, Jason J. Corso, Eric Sax

Abstract:Multimodal large language models (MLLMs) hold the potential to enhance autonomous driving by combining domain-independent world knowledge with context-specific language guidance. Their integration into autonomous driving systems shows promising results in isolated proof-of-concept applications, while their performance is evaluated on selective singular aspects of perception, reasoning, or planning. To leverage their full potential a systematic framework for evaluating MLLMs in the context of autonomous driving is required. This paper proposes a holistic framework for a capability-driven evaluation of MLLMs in autonomous driving. The framework structures scenario understanding along the four core capability dimensions semantic, spatial, temporal, and physical. They are derived from the general requirements of autonomous driving systems, human driver cognition, and language-based reasoning. It further organises the domain into context layers, processing modalities, and downstream tasks such as language-based interaction and decision-making. To illustrate the framework's applicability, two exemplary traffic scenarios are analysed, grounding the proposed dimensions in realistic driving situations. The framework provides a foundation for the structured evaluation of MLLMs' potential for scenario understanding in autonomous driving.

* Submitted to IEEE IAVVC 2025, Under Review

Via

Access Paper or Ask Questions

An Analysis of Driver-Initiated Takeovers during Assisted Driving and their Effect on Driver Satisfaction

Apr 19, 2024

Robin Schwager, Michael Grimm, Xin Liu, Lukas Ewecker, Tim Bruehl, Tin Stribor Sohn, Soeren Hohmann

Abstract:During the use of Advanced Driver Assistance Systems (ADAS), drivers can intervene in the active function and take back control due to various reasons. However, the specific reasons for driver-initiated takeovers in naturalistic driving are still not well understood. In order to get more information on the reasons behind these takeovers, a test group study was conducted. There, 17 participants used a predictive longitudinal driving function for their daily commutes and annotated the reasons for their takeovers during active function use. In this paper, the recorded takeovers are analyzed and the different reasons for them are highlighted. The results show that the reasons can be divided into three main categories. The most common category consists of takeovers which aim to adjust the behavior of the ADAS within its Operational Design Domain (ODD) in order to better match the drivers' personal preferences. Other reasons include takeovers due to leaving the ADAS's ODD and corrections of incorrect sensing state information. Using the questionnaire results of the test group study, it was found that the number and frequency of takeovers especially within the ADAS's ODD have a significant negative impact on driver satisfaction. Therefore, the driver satisfaction with the ADAS could be increased by adapting its behavior to the drivers' wishes and thereby lowering the number of takeovers within the ODD. The information contained in the takeover behavior of the drivers could be used as feedback for the ADAS. Finally, it is shown that there are considerable differences in the takeover behavior of different drivers, which shows a need for ADAS individualization.

* Submitted to and accepted by IV 2024. Accepted paper version with minor changes before incorporating peer reviews

Via

Access Paper or Ask Questions