Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Athul M. Mathew

GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

Nov 09, 2025

Athul M. Mathew, Haithem Hermassi, Thariq Khalid, Arshad Ali Khan, Riad Souissi

Figure 1 for GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

Figure 2 for GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

Figure 3 for GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

Figure 4 for GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

Abstract:Gaze understanding unifies the detection of people, their gaze targets, and objects of interest into a single framework, offering critical insight into visual attention and intent estimation. Although prior research has modelled gaze cues in visual scenes, a unified system is still needed for gaze understanding using both visual and language prompts. This paper introduces GazeVLM, a novel Vision-Language Model (VLM) for multi-task gaze understanding in images, addressing person detection, gaze target detection, and gaze object identification. While other transformer-based methods exist for gaze analysis, GazeVLM represents, to our knowledge, the first application of a VLM to these combined tasks, allowing for selective execution of each task. Through the integration of visual (RGB and depth) and textual modalities, our ablation study on visual input combinations revealed that a fusion of RGB images with HHA-encoded depth maps, guided by text prompts, yields superior performance. We also introduce an object-level gaze detection metric for gaze object identification ($AP_{ob}$). Through experiments, GazeVLM demonstrates significant improvements, notably achieving state-of-the-art evaluation scores on GazeFollow and VideoAttentionTarget datasets.

Via

Access Paper or Ask Questions

Leveraging Multi-Modal Saliency and Fusion for Gaze Target Detection

Apr 27, 2025

Athul M. Mathew, Arshad Ali Khan, Thariq Khalid, Faroq AL-Tam, Riad Souissi

Abstract:Gaze target detection (GTD) is the task of predicting where a person in an image is looking. This is a challenging task, as it requires the ability to understand the relationship between the person's head, body, and eyes, as well as the surrounding environment. In this paper, we propose a novel method for GTD that fuses multiple pieces of information extracted from an image. First, we project the 2D image into a 3D representation using monocular depth estimation. We then extract a depth-infused saliency module map, which highlights the most salient (\textit{attention-grabbing}) regions in image for the subject in consideration. We also extract face and depth modalities from the image, and finally fuse all the extracted modalities to identify the gaze target. We quantitatively evaluated our method, including the ablation analysis on three publicly available datasets, namely VideoAttentionTarget, GazeFollow and GOO-Real, and showed that it outperforms other state-of-the-art methods. This suggests that our method is a promising new approach for GTD.

* accepted at NeurIPS 2023 Gaze Meets ML Workshop

Via

Access Paper or Ask Questions

Ego Vehicle Speed Estimation using 3D Convolution with Masked Attention

Dec 11, 2022

Athul M. Mathew, Thariq Khalid

Abstract:Speed estimation of an ego vehicle is crucial to enable autonomous driving and advanced driver assistance technologies. Due to functional and legacy issues, conventional methods depend on in-car sensors to extract vehicle speed through the Controller Area Network bus. However, it is desirable to have modular systems that are not susceptible to external sensors to execute perception tasks. In this paper, we propose a novel 3D-CNN with masked-attention architecture to estimate ego vehicle speed using a single front-facing monocular camera. To demonstrate the effectiveness of our method, we conduct experiments on two publicly available datasets, nuImages and KITTI. We also demonstrate the efficacy of masked-attention by comparing our method with a traditional 3D-CNN.

* 13 pages, 6 figures

Via

Access Paper or Ask Questions