Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xucong Zhang

AmbientEye: A Dataset for Pupil Segmentation under Natural Ambient Infrared Illumination

Jun 02, 2026

Mingyu Han, Hyunyoung Han, Nitheekulawatn Thommakoon, Gangtae Park, Jieun Han, Xucong Zhang, Ian Oakley

Abstract:Eye tracking is essential for smart glasses, as it provides insight into user attention for ambient intelligence applications. However, most existing eye-tracking systems rely on active infrared (IR) illumination, creating practical barriers to all-day outdoor use due to power consumption. In this paper, we investigate whether passive IR cameras alone, without any active IR light source, can enable reliable pupil detection in unconstrained outdoor environments, where ambient sunlight serves as the sole illumination source. To support this investigation, we introduce AmbientEye, a large-scale dataset of 2,606,225 eye images collected from 35 participants from 19 countries. It is captured outdoors under natural sunlight with two off-axis camera configurations and two sun-orientation conditions. We provide high-quality pupil annotation through SAM2 automatic segmentation, followed by refinement by human annotators. We benchmark a state-of-the-art pupil segmentation algorithm on our dataset and compare its performance with that on existing datasets under controlled IR illumination. Results reveal a substantial drop in pupil segmentation performance from 0.928 on controlled IR datasets to 0.767 on AmbientEye. This performance gap highlights the challenge of the ambient-light setting. This positions AmbientEye as a first benchmark for an unexplored and highly practical eye-tracking scenario.

* 12 pages, 7 figures

Via

Access Paper or Ask Questions

PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

Apr 09, 2026

Zhi-Yi Lin, Thomas Markhorst, Jouh Yeong Chew, Xucong Zhang

Abstract:Human-like multimodal reaction generation is essential for natural group interactions between humans and embodied AI. However, existing approaches are limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and complex dynamics of polyadic interactions, both critical for engagement and conversational coherence. In this work, we present PolySLGen, an online framework for Polyadic multimodal Speaking and Listening reaction Generation. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking state score. To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals from the group. Extensive experiments, along with quantitative and qualitative evaluations, show that PolySLGen produces contextually appropriate and temporally coherent multi-modal reactions, outperforming several adapted and state-of-the-art baselines in motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism.

Via

Access Paper or Ask Questions

MuPPet: Multi-person 2D-to-3D Pose Lifting

Apr 08, 2026

Thomas Markhorst, Zhi-Yi Lin, Jouh Yeong Chew, Jan van Gemert, Xucong Zhang

Abstract:Multi-person social interactions are inherently built on coherence and relationships among all individuals within the group, making multi-person localization and body pose estimation essential to understanding these social dynamics. One promising approach is 2D-to-3D pose lifting which provides a 3D human pose consisting of rich spatial details by building on the significant advances in 2D pose estimation. However, the existing 2D-to-3D pose lifting methods often neglect inter-person relationships or cannot handle varying group sizes, limiting their effectiveness in multi-person settings. We propose MuPPet, a novel multi-person 2D-to-3D pose lifting framework that explicitly models inter-person correlations. To leverage these inter-person dependencies, our approach introduces Person Encoding to structure individual representations, Permutation Augmentation to enhance training diversity, and Dynamic Multi-Person Attention to adaptively model correlations between individuals. Extensive experiments on group interaction datasets demonstrate MuPPet significantly outperforms state-of-the-art single- and multi-person 2D-to-3D pose lifting methods, and improves robustness in occlusion scenarios. Our findings highlight the importance of modeling inter-person correlations, paving the way for accurate and socially-aware 3D pose estimation. Our code is available at: https://github.com/Thomas-Markhorst/MuPPet

* Accepted at CVPRw 2026

Via

Access Paper or Ask Questions

GGPT: Geometry Grounded Point Transformer

Mar 11, 2026

Yutong Chen, Yiming Wang, Xucong Zhang, Sergey Prokudin, Siyu Tang

Abstract:Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views. Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit partial-geometry supervision using an optimised guidance encoding. Extensive experiments demonstrate that our method provides a principled mechanism for integrating geometric priors with dense feed-forward predictions, producing reconstructions that are both geometrically consistent and spatially complete, recovering fine structures and filling gaps in textureless areas. Trained solely on ScanNet++ with VGGT predictions, GGPT generalises across architectures and datasets, substantially outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings.

* CVPR 2026, Project website: https://chenyutongthu.github.io/research/ggpt

Via

Access Paper or Ask Questions

UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training

Feb 04, 2025

Jiawei Qin, Xucong Zhang, Yusuke Sugano

Figure 1 for UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training

Figure 2 for UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training

Figure 3 for UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training

Figure 4 for UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training

Abstract:Despite decades of research on data collection and model architectures, current gaze estimation models face significant challenges in generalizing across diverse data domains. While recent advances in self-supervised pre-training have shown remarkable potential for improving model generalization in various vision tasks, their effectiveness in gaze estimation remains unexplored due to the geometric nature of the gaze regression task. We propose UniGaze, which leverages large-scale, in-the-wild facial datasets through self-supervised pre-training for gaze estimation. We carefully curate multiple facial datasets that capture diverse variations in identity, lighting, background, and head poses. By directly applying Masked Autoencoder (MAE) pre-training on normalized face images with a Vision Transformer (ViT) backbone, our UniGaze learns appropriate feature representations within the specific input space required by downstream gaze estimation models. Through comprehensive experiments using challenging cross-dataset evaluation and novel protocols, including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. The source code and pre-trained models will be released upon acceptance.

Via

Access Paper or Ask Questions

PrivateGaze: Preserving User Privacy in Black-box Mobile Gaze Tracking Services

Aug 01, 2024

Lingyu Du, Jinyuan Jia, Xucong Zhang, Guohao Lan

Figure 1 for PrivateGaze: Preserving User Privacy in Black-box Mobile Gaze Tracking Services

Figure 2 for PrivateGaze: Preserving User Privacy in Black-box Mobile Gaze Tracking Services

Figure 3 for PrivateGaze: Preserving User Privacy in Black-box Mobile Gaze Tracking Services

Figure 4 for PrivateGaze: Preserving User Privacy in Black-box Mobile Gaze Tracking Services

Abstract:Eye gaze contains rich information about human attention and cognitive processes. This capability makes the underlying technology, known as gaze tracking, a critical enabler for many ubiquitous applications and has triggered the development of easy-to-use gaze estimation services. Indeed, by utilizing the ubiquitous cameras on tablets and smartphones, users can readily access many gaze estimation services. In using these services, users must provide their full-face images to the gaze estimator, which is often a black box. This poses significant privacy threats to the users, especially when a malicious service provider gathers a large collection of face images to classify sensitive user attributes. In this work, we present PrivateGaze, the first approach that can effectively preserve users' privacy in black-box gaze tracking services without compromising gaze estimation performance. Specifically, we proposed a novel framework to train a privacy preserver that converts full-face images into obfuscated counterparts, which are effective for gaze estimation while containing no privacy information. Evaluation on four datasets shows that the obfuscated image can protect users' private information, such as identity and gender, against unauthorized attribute classification. Meanwhile, when used directly by the black-box gaze estimator as inputs, the obfuscated images lead to comparable tracking performance to the conventional, unprotected full-face images.

Via

Access Paper or Ask Questions

GazeHTA: End-to-end Gaze Target Detection with Head-Target Association

Apr 19, 2024

Zhi-Yi Lin, Jouh Yeong Chew, Jan van Gemert, Xucong Zhang

Abstract:We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.

Via

Access Paper or Ask Questions

3D Kinematics Estimation from Video with a Biomechanical Model and Synthetic Training Data

Feb 26, 2024

Zhi-Yi Lin, Bofan Lyu, Judith Cueto Fernandez, Eline van der Kruk, Ajay Seth, Xucong Zhang

Figure 1 for 3D Kinematics Estimation from Video with a Biomechanical Model and Synthetic Training Data

Figure 2 for 3D Kinematics Estimation from Video with a Biomechanical Model and Synthetic Training Data

Figure 3 for 3D Kinematics Estimation from Video with a Biomechanical Model and Synthetic Training Data

Figure 4 for 3D Kinematics Estimation from Video with a Biomechanical Model and Synthetic Training Data

Abstract:Accurate 3D kinematics estimation of human body is crucial in various applications for human health and mobility, such as rehabilitation, injury prevention, and diagnosis, as it helps to understand the biomechanical loading experienced during movement. Conventional marker-based motion capture is expensive in terms of financial investment, time, and the expertise required. Moreover, due to the scarcity of datasets with accurate annotations, existing markerless motion capture methods suffer from challenges including unreliable 2D keypoint detection, limited anatomic accuracy, and low generalization capability. In this work, we propose a novel biomechanics-aware network that directly outputs 3D kinematics from two input views with consideration of biomechanical prior and spatio-temporal information. To train the model, we create synthetic dataset ODAH with accurate kinematics annotations generated by aligning the body mesh from the SMPL-X model and a full-body OpenSim skeletal model. Our extensive experiments demonstrate that the proposed approach, only trained on synthetic data, outperforms previous state-of-the-art methods when evaluated across multiple datasets, revealing a promising direction for enhancing video-based human motion capture.

Via

Access Paper or Ask Questions

Unsupervised Gaze-aware Contrastive Learning with Subject-specific Condition

Sep 08, 2023

Lingyu Du, Xucong Zhang, Guohao Lan

Figure 1 for Unsupervised Gaze-aware Contrastive Learning with Subject-specific Condition

Figure 2 for Unsupervised Gaze-aware Contrastive Learning with Subject-specific Condition

Figure 3 for Unsupervised Gaze-aware Contrastive Learning with Subject-specific Condition

Figure 4 for Unsupervised Gaze-aware Contrastive Learning with Subject-specific Condition

Abstract:Appearance-based gaze estimation has shown great promise in many applications by using a single general-purpose camera as the input device. However, its success is highly depending on the availability of large-scale well-annotated gaze datasets, which are sparse and expensive to collect. To alleviate this challenge we propose ConGaze, a contrastive learning-based framework that leverages unlabeled facial images to learn generic gaze-aware representations across subjects in an unsupervised way. Specifically, we introduce the gaze-specific data augmentation to preserve the gaze-semantic features and maintain the gaze consistency, which are proven to be crucial for effective contrastive gaze representation learning. Moreover, we devise a novel subject-conditional projection module that encourages a share feature extractor to learn gaze-aware and generic representations. Our experiments on three public gaze estimation datasets show that ConGaze outperforms existing unsupervised learning solutions by 6.7% to 22.5%; and achieves 15.1% to 24.6% improvement over its supervised learning-based counterpart in cross-dataset evaluations.

Via

Access Paper or Ask Questions

Investigation of Architectures and Receptive Fields for Appearance-based Gaze Estimation

Aug 18, 2023

Yunhan Wang, Xiangwei Shi, Shalini De Mello, Hyung Jin Chang, Xucong Zhang

Figure 1 for Investigation of Architectures and Receptive Fields for Appearance-based Gaze Estimation

Figure 2 for Investigation of Architectures and Receptive Fields for Appearance-based Gaze Estimation

Figure 3 for Investigation of Architectures and Receptive Fields for Appearance-based Gaze Estimation

Abstract:With the rapid development of deep learning technology in the past decade, appearance-based gaze estimation has attracted great attention from both computer vision and human-computer interaction research communities. Fascinating methods were proposed with variant mechanisms including soft attention, hard attention, two-eye asymmetry, feature disentanglement, rotation consistency, and contrastive learning. Most of these methods take the single-face or multi-region as input, yet the basic architecture of gaze estimation has not been fully explored. In this paper, we reveal the fact that tuning a few simple parameters of a ResNet architecture can outperform most of the existing state-of-the-art methods for the gaze estimation task on three popular datasets. With our extensive experiments, we conclude that the stride number, input image resolution, and multi-region architecture are critical for the gaze estimation performance while their effectiveness dependent on the quality of the input face image. We obtain the state-of-the-art performances on three datasets with 3.64 on ETH-XGaze, 4.50 on MPIIFaceGaze, and 9.13 on Gaze360 degrees gaze estimation error by taking ResNet-50 as the backbone.

Via

Access Paper or Ask Questions