Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gregory Zelinsky

LVLMs and Humans Ground Differently in Referential Communication

Jan 28, 2026

Peter Zeng, Weiling Li, Amie Paige, Zhengxiang Wang, Panagiotis Kaliosis, Dimitris Samaras, Gregory Zelinsky, Susan Brennan, Owen Rambow

Abstract:For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs' limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.

* 24 pages, 16 figures, preprint

Via

Access Paper or Ask Questions

Few-shot Personalized Scanpath Prediction

Apr 07, 2025

Ruoyu Xue, Jingyi Xu, Sounak Mondal, Hieu Le, Gregory Zelinsky, Minh Hoai, Dimitris Samaras

Abstract:A personalized model for scanpath prediction provides insights into the visual preferences and attention patterns of individual subjects. However, existing methods for training scanpath prediction models are data-intensive and cannot be effectively personalized to new individuals with only a few available examples. In this paper, we propose few-shot personalized scanpath prediction task (FS-PSP) and a novel method to address it, which aims to predict scanpaths for an unseen subject using minimal support data of that subject's scanpath behavior. The key to our method's adaptability is the Subject-Embedding Network (SE-Net), specifically designed to capture unique, individualized representations for each subject's scanpaths. SE-Net generates subject embeddings that effectively distinguish between subjects while minimizing variability among scanpaths from the same individual. The personalized scanpath prediction model is then conditioned on these subject embeddings to produce accurate, personalized results. Experiments on multiple eye-tracking datasets demonstrate that our method excels in FS-PSP settings and does not require any fine-tuning steps at test time. Code is available at: https://github.com/cvlab-stonybrook/few-shot-scanpath

* Accepted by CVPR 2025,20 pages, 10 figures

Via

Access Paper or Ask Questions

Look Hear: Gaze Prediction for Speech-directed Human Attention

Jul 28, 2024

Sounak Mondal, Seoyoung Ahn, Zhibo Yang, Niranjan Balasubramanian, Dimitris Samaras, Gregory Zelinsky, Minh Hoai

Figure 1 for Look Hear: Gaze Prediction for Speech-directed Human Attention

Figure 2 for Look Hear: Gaze Prediction for Speech-directed Human Attention

Figure 3 for Look Hear: Gaze Prediction for Speech-directed Human Attention

Figure 4 for Look Hear: Gaze Prediction for Speech-directed Human Attention

Abstract:For computer systems to effectively interact with humans using spoken language, they need to understand how the words being generated affect the users' moment-by-moment attention. Our study focuses on the incremental prediction of attention as a person is seeing an image and hearing a referring expression defining the object in the scene that should be fixated by gaze. To predict the gaze scanpaths in this incremental object referral task, we developed the Attention in Referral Transformer model or ART, which predicts the human fixations spurred by each word in a referring expression. ART uses a multimodal transformer encoder to jointly learn gaze behavior and its underlying grounding tasks, and an autoregressive transformer decoder to predict, for each word, a variable number of fixations based on fixation history. To train ART, we created RefCOCO-Gaze, a large-scale dataset of 19,738 human gaze scanpaths, corresponding to 2,094 unique image-expression pairs, from 220 participants performing our referral task. In our quantitative and qualitative analyses, ART not only outperforms existing methods in scanpath prediction, but also appears to capture several human attention patterns, such as waiting, scanning, and verification.

* Accepted for ECCV 2024

Via

Access Paper or Ask Questions

Affinity-based Attention in Self-supervised Transformers Predicts Dynamics of Object Grouping in Humans

Jun 01, 2023

Hossein Adeli, Seoyoung Ahn, Nikolaus Kriegeskorte, Gregory Zelinsky

Figure 1 for Affinity-based Attention in Self-supervised Transformers Predicts Dynamics of Object Grouping in Humans

Figure 2 for Affinity-based Attention in Self-supervised Transformers Predicts Dynamics of Object Grouping in Humans

Figure 3 for Affinity-based Attention in Self-supervised Transformers Predicts Dynamics of Object Grouping in Humans

Figure 4 for Affinity-based Attention in Self-supervised Transformers Predicts Dynamics of Object Grouping in Humans

Abstract:The spreading of attention has been proposed as a mechanism for how humans group features to segment objects. However, such a mechanism has not yet been implemented and tested in naturalistic images. Here, we leverage the feature maps from self-supervised vision Transformers and propose a model of human object-based attention spreading and segmentation. Attention spreads within an object through the feature affinity signal between different patches of the image. We also collected behavioral data on people grouping objects in natural images by judging whether two dots are on the same object or on two different objects. We found that our models of affinity spread that were built on feature maps from the self-supervised Transformers showed significant improvement over baseline and CNN based models on predicting reaction time patterns of humans, despite not being trained on the task or with any other object labels. Our work provides new benchmarks for evaluating models of visual representation learning including Transformers.

Via

Access Paper or Ask Questions

Predicting Human Attention using Computational Attention

Apr 04, 2023

Zhibo Yang, Sounak Mondal, Seoyoung Ahn, Gregory Zelinsky, Minh Hoai, Dimitris Samaras

Figure 1 for Predicting Human Attention using Computational Attention

Figure 2 for Predicting Human Attention using Computational Attention

Figure 3 for Predicting Human Attention using Computational Attention

Figure 4 for Predicting Human Attention using Computational Attention

Abstract:Most models of visual attention are aimed at predicting either top-down or bottom-up control, as studied using different visual search and free-viewing tasks. We propose Human Attention Transformer (HAT), a single model predicting both forms of attention control. HAT is the new state-of-the-art (SOTA) in predicting the scanpath of fixations made during target-present and target-absent search, and matches or exceeds SOTA in the prediction of taskless free-viewing fixation scanpaths. HAT achieves this new SOTA by using a novel transformer-based architecture and a simplified foveated retina that collectively create a spatio-temporal awareness akin to the dynamic visual working memory of humans. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization, HAT features a dense-prediction architecture and outputs a dense heatmap for each fixation, thus avoiding discretizing fixations. HAT sets a new standard in computational attention, which emphasizes both effectiveness and generality. HAT's demonstrated scope and applicability will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios.

Via

Access Paper or Ask Questions

Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

Mar 27, 2023

Sounak Mondal, Zhibo Yang, Seoyoung Ahn, Dimitris Samaras, Gregory Zelinsky, Minh Hoai

Figure 1 for Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

Figure 2 for Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

Figure 3 for Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

Figure 4 for Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

Abstract:Predicting human gaze is important in Human-Computer Interaction (HCI). However, to practically serve HCI applications, gaze prediction models must be scalable, fast, and accurate in their spatial and temporal gaze predictions. Recent scanpath prediction models focus on goal-directed attention (search). Such models are limited in their application due to a common approach relying on trained target detectors for all possible objects, and the availability of human gaze data for their training (both not scalable). In response, we pose a new task called ZeroGaze, a new variant of zero-shot learning where gaze is predicted for never-before-searched objects, and we develop a novel model, Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods using object detector modules, Gazeformer encodes the target using a natural language model, thus leveraging semantic similarities in scanpath prediction. We use a transformer-based encoder-decoder architecture because transformers are particularly useful for generating contextual representations. Gazeformer surpasses other models by a large margin on the ZeroGaze setting. It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks. In addition to its improved performance, Gazeformer is more than five times faster than the state-of-the-art target-present visual search model.

* IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Via

Access Paper or Ask Questions

Target-absent Human Attention

Jul 04, 2022

Zhibo Yang, Sounak Mondal, Seoyoung Ahn, Gregory Zelinsky, Minh Hoai, Dimitris Samaras

Figure 1 for Target-absent Human Attention

Figure 2 for Target-absent Human Attention

Figure 3 for Target-absent Human Attention

Figure 4 for Target-absent Human Attention

Abstract:The prediction of human gaze behavior is important for building human-computer interactive systems that can anticipate a user's attention. Computer vision models have been developed to predict the fixations made by people as they search for target objects. But what about when the image has no target? Equally important is to know how people search when they cannot find a target, and when they would stop searching. In this paper, we propose the first data-driven computational model that addresses the search-termination problem and predicts the scanpath of search fixations made by people searching for targets that do not appear in images. We model visual search as an imitation learning problem and represent the internal knowledge that the viewer acquires through fixations using a novel state representation that we call Foveated Feature Maps (FFMs). FFMs integrate a simulated foveated retina into a pretrained ConvNet that produces an in-network feature pyramid, all with minimal computational overhead. Our method integrates FFMs as the state representation in inverse reinforcement learning. Experimentally, we improve the state of the art in predicting human target-absent search behavior on the COCO-Search18 dataset

* Accepted to ECCV2022

Via

Access Paper or Ask Questions

Recurrent Attention Models with Object-centric Capsule Representation for Multi-object Recognition

Oct 11, 2021

Hossein Adeli, Seoyoung Ahn, Gregory Zelinsky

Figure 1 for Recurrent Attention Models with Object-centric Capsule Representation for Multi-object Recognition

Figure 2 for Recurrent Attention Models with Object-centric Capsule Representation for Multi-object Recognition

Figure 3 for Recurrent Attention Models with Object-centric Capsule Representation for Multi-object Recognition

Figure 4 for Recurrent Attention Models with Object-centric Capsule Representation for Multi-object Recognition

Abstract:The visual system processes a scene using a sequence of selective glimpses, each driven by spatial and object-based attention. These glimpses reflect what is relevant to the ongoing task and are selected through recurrent processing and recognition of the objects in the scene. In contrast, most models treat attention selection and recognition as separate stages in a feedforward process. Here we show that using capsule networks to create an object-centric hidden representation in an encoder-decoder model with iterative glimpse attention yields effective integration of attention and recognition. We evaluate our model on three multi-object recognition tasks; highly overlapping digits, digits among distracting clutter and house numbers, and show that it learns to effectively move its glimpse window, recognize and reconstruct the objects, all with only the classification as supervision. Our work takes a step toward a general architecture for how to integrate recurrent object-centric representation into the planning of attentional glimpses.

Via

Access Paper or Ask Questions

A Study of Human Gaze Behavior During Visual Crowd Counting

Sep 27, 2020

Raji Annadi, Yupei Chen, Viresh Ranjan, Dimitris Samaras, Gregory Zelinsky, Minh Hoai

Figure 1 for A Study of Human Gaze Behavior During Visual Crowd Counting

Figure 2 for A Study of Human Gaze Behavior During Visual Crowd Counting

Figure 3 for A Study of Human Gaze Behavior During Visual Crowd Counting

Figure 4 for A Study of Human Gaze Behavior During Visual Crowd Counting

Abstract:In this paper, we describe our study on how humans allocate their attention during visual crowd counting. Using an eye tracker, we collect gaze behavior of human participants who are tasked with counting the number of people in crowd images. Analyzing the collected gaze behavior of ten human participants on thirty crowd images, we observe some common approaches for visual counting. For an image of a small crowd, the approach is to enumerate over all people or groups of people in the crowd, and this explains the high level of similarity between the fixation density maps of different human participants. For an image of a large crowd, our participants tend to focus on one section of the image, count the number of people in that section, and then extrapolate to the other sections. In terms of count accuracy, our human participants are not as good at the counting task, compared to the performance of the current state-of-the-art computer algorithms. Interestingly, there is a tendency to under count the number of people in all crowd images. Gaze behavior data and images can be downloaded from https://www3.cs.stonybrook.edu/~minhhoai/projects/crowd_counting_gaze/.

Via

Access Paper or Ask Questions

Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning

Jun 25, 2020

Zhibo Yang, Lihan Huang, Yupei Chen, Zijun Wei, Seoyoung Ahn, Gregory Zelinsky, Dimitris Samaras, Minh Hoai

Figure 1 for Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning

Figure 2 for Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning

Figure 3 for Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning

Figure 4 for Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning

Abstract:Being able to predict human gaze behavior has obvious importance for behavioral vision and for computer vision applications. Most models have mainly focused on predicting free-viewing behavior using saliency maps, but these predictions do not generalize to goal-directed behavior, such as when a person searches for a visual target object. We propose the first inverse reinforcement learning (IRL) model to learn the internal reward function and policy used by humans during visual search. The viewer's internal belief states were modeled as dynamic contextual belief maps of object locations. These maps were learned by IRL and then used to predict behavioral scanpaths for multiple target categories. To train and evaluate our IRL model we created COCO-Search18, which is now the largest dataset of high-quality search fixations in existence. COCO-Search18 has 10 participants searching for each of 18 target-object categories in 6202 images, making about 300,000 goal-directed fixations. When trained and evaluated on COCO-Search18, the IRL model outperformed baseline models in predicting search fixation scanpaths, both in terms of similarity to human search behavior and search efficiency. Finally, reward maps recovered by the IRL model reveal distinctive target-dependent patterns of object prioritization, which we interpret as a learned object context.

* 16 pages, 13 figures, CVPR 2020

Via

Access Paper or Ask Questions