Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kevin Desai

ThermalTap: Passive Application Fingerprinting in VR Headsets via Thermal Side Channels

May 13, 2026

Mahsin Bin Akram, A H M Nazmus Sakib, OFM Riaz Rahman Aranya, Raveen Wijewickrama, Kevin Desai, Murtuza Jadliwala

Abstract:Standalone virtual reality (VR) headsets process highly sensitive personal, professional, and health-related data, yet their susceptibility to non-contact physical side channels remains largely unexplored. Existing side-channel attacks typically require malicious software execution or physical access to peripherals, making them conspicuous and potentially patchable. This paper introduces ThermalTap, the first passive, non-contact side-channel attack that fingerprints VR applications solely from the long-wave infrared (LWIR) radiation emitted by the headset chassis. By treating a headset's thermal signature as a high-fidelity proxy for internal computational workloads, ThermalTap enables remote application inference at meter-scale distances without any device interaction. To achieve robust performance in real-world settings, the system combines a commodity thermal camera with a multi-modal sensor suite (capturing ambient temperature, humidity, and airflow) to normalize environmental noise. We evaluate ThermalTap using six applications across three commercial standalone headsets. In indoor settings, ThermalTap identifies applications with over 90% accuracy using only 10 seconds of thermal camera data. Under outdoor conditions, with longer session-level observations, several applications remain identifiable despite environmental variability, with the strongest outdoor application reaching 81% accuracy. Our findings establish thermal radiation as a fundamental and unavoidable privacy risk for immersive systems, exposing a critical security gap that bypasses current software-level protections and physical access controls.

Via

Access Paper or Ask Questions

AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation

Mar 26, 2026

Md Mushfiqur Azam, John Quarles, Kevin Desai

Abstract:Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: https://github.com/Mushfiq5647/AG-EgoPose.

Via

Access Paper or Ask Questions

To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models

Mar 23, 2026

OFM Riaz Rahman Aranya, Kevin Desai

Abstract:Vision-language models (VLMs) adapted to the medical domain have shown strong performance on visual question answering benchmarks, yet their robustness against two critical failure modes, hallucination and sycophancy, remains poorly understood, particularly in combination. We evaluate six VLMs (three general-purpose, three medical-specialist) on three medical VQA datasets and uncover a grounding-sycophancy tradeoff: models with the lowest hallucination propensity are the most sycophantic, while the most pressure-resistant model hallucinates more than all medical-specialist models. To characterize this tradeoff, we propose three metrics: L-VASE, a logit-space reformulation of VASE that avoids its double-normalization; CCS, a confidence-calibrated sycophancy score that penalizes high-confidence capitulation; and Clinical Safety Index (CSI), a unified safety index that combines grounding, autonomy, and calibration via a geometric mean. Across 1,151 test cases, no model achieves a CSI above 0.35, indicating that none of the evaluated 7-8B parameter VLMs is simultaneously well-grounded and robust to social pressure. Our findings suggest that joint evaluation of both properties is necessary before these models can be considered for clinical use. Code is available at https://github.com/UTSA-VIRLab/AgreeOrRight

Via

Access Paper or Ask Questions

Toward Faithful Segmentation Attribution via Benchmarking and Dual-Evidence Fusion

Mar 23, 2026

Abu Noman Md Sakib, OFM Riaz Rahman Aranya, Kevin Desai, Zijie Zhang

Abstract:Attribution maps for semantic segmentation are almost always judged by visual plausibility. Yet looking convincing does not guarantee that the highlighted pixels actually drive the model's prediction, nor that attribution credit stays within the target region. These questions require a dedicated evaluation protocol. We introduce a reproducible benchmark that tests intervention-based faithfulness, off-target leakage, perturbation robustness, and runtime on Pascal VOC and SBD across three pretrained backbones. To further demonstrate the benchmark, we propose Dual-Evidence Attribution (DEA), a lightweight correction that fuses gradient evidence with region-level intervention signals through agreement-weighted fusion. DEA increases emphasis where both sources agree and retains causal support when gradient responses are unstable. Across all completed runs, DEA consistently improves deletion-based faithfulness over gradient-only baselines and preserves strong robustness, at the cost of additional compute from intervention passes. The benchmark exposes a faithfulness-stability tradeoff among attribution families that is entirely hidden under visual evaluation, providing a foundation for principled method selection in segmentation explainability. Code is available at https://github.com/anmspro/DEA.

* CVPR 2026

Via

Access Paper or Ask Questions

SRA-Seg: Synthetic to Real Alignment for Semi-Supervised Medical Image Segmentation

Feb 03, 2026

OFM Riaz Rahman Aranya, Kevin Desai

Abstract:Synthetic data, an appealing alternative to extensive expert-annotated data for medical image segmentation, consistently fails to improve segmentation performance despite its visual realism. The reason being that synthetic and real medical images exist in different semantic feature spaces, creating a domain gap that current semi-supervised learning methods cannot bridge. We propose SRA-Seg, a framework explicitly designed to align synthetic and real feature distributions for medical image segmentation. SRA-Seg introduces a similarity-alignment (SA) loss using frozen DINOv2 embeddings to pull synthetic representations toward their nearest real counterparts in semantic space. We employ soft edge blending to create smooth anatomical transitions and continuous labels, eliminating the hard boundaries from traditional copy-paste augmentation. The framework generates pseudo-labels for synthetic images via an EMA teacher model and applies soft-segmentation losses that respect uncertainty in mixed regions. Our experiments demonstrate strong results: using only 10% labeled real data and 90% synthetic unlabeled data, SRA-Seg achieves 89.34% Dice on ACDC and 84.42% on FIVES, significantly outperforming existing semi-supervised methods and matching the performance of methods using real unlabeled data.

Via

Access Paper or Ask Questions

TRACE: Temporal Radiology with Anatomical Change Explanation for Grounded X-ray Report Generation

Feb 03, 2026

OFM Riaz Rahman Aranya, Kevin Desai

Abstract:Temporal comparison of chest X-rays is fundamental to clinical radiology, enabling detection of disease progression, treatment response, and new findings. While vision-language models have advanced single-image report generation and visual grounding, no existing method combines these capabilities for temporal change detection. We introduce Temporal Radiology with Anatomical Change Explanation (TRACE), the first model that jointly performs temporal comparison, change classification, and spatial localization. Given a prior and current chest X-ray, TRACE generates natural language descriptions of interval changes (worsened, improved, stable) while grounding each finding with bounding box coordinates. TRACE demonstrates effective spatial localization with over 90% grounding accuracy, establishing a foundation for this challenging new task. Our ablation study uncovers an emergent capability: change detection arises only when temporal comparison and spatial grounding are jointly learned, as neither alone enables meaningful change detection. This finding suggests that grounding provides a spatial attention mechanism essential for temporal reasoning.

Via

Access Paper or Ask Questions

PMPNet: Pixel Movement Prediction Network for Monocular Depth Estimation in Dynamic Scenes

Nov 04, 2024

Kebin Peng, John Quarles, Kevin Desai

Figure 1 for PMPNet: Pixel Movement Prediction Network for Monocular Depth Estimation in Dynamic Scenes

Figure 2 for PMPNet: Pixel Movement Prediction Network for Monocular Depth Estimation in Dynamic Scenes

Figure 3 for PMPNet: Pixel Movement Prediction Network for Monocular Depth Estimation in Dynamic Scenes

Figure 4 for PMPNet: Pixel Movement Prediction Network for Monocular Depth Estimation in Dynamic Scenes

Abstract:In this paper, we propose a novel method for monocular depth estimation in dynamic scenes. We first explore the arbitrariness of object's movement trajectory in dynamic scenes theoretically. To overcome the arbitrariness, we use assume that points move along a straight line over short distances and then summarize it as a triangular constraint loss in two dimensional Euclidean space. To overcome the depth inconsistency problem around the edges, we propose a deformable support window module that learns features from different shapes of objects, making depth value more accurate around edge area. The proposed model is trained and tested on two outdoor datasets - KITTI and Make3D, as well as an indoor dataset - NYU Depth V2. The quantitative and qualitative results reported on these datasets demonstrate the success of our proposed model when compared against other approaches. Ablation study results on the KITTI dataset also validate the effectiveness of the proposed pixel movement prediction module as well as the deformable support window module.

Via

Access Paper or Ask Questions

Mazed and Confused: A Dataset of Cybersickness, Working Memory, Mental Load, Physical Load, and Attention During a Real Walking Task in VR

Sep 10, 2024

Jyotirmay Nag Setu, Joshua M Le, Ripan Kumar Kundu, Barry Giesbrecht, Tobias Höllerer, Khaza Anuarul Hoque, Kevin Desai, John Quarles

Figure 1 for Mazed and Confused: A Dataset of Cybersickness, Working Memory, Mental Load, Physical Load, and Attention During a Real Walking Task in VR

Figure 2 for Mazed and Confused: A Dataset of Cybersickness, Working Memory, Mental Load, Physical Load, and Attention During a Real Walking Task in VR

Figure 3 for Mazed and Confused: A Dataset of Cybersickness, Working Memory, Mental Load, Physical Load, and Attention During a Real Walking Task in VR

Figure 4 for Mazed and Confused: A Dataset of Cybersickness, Working Memory, Mental Load, Physical Load, and Attention During a Real Walking Task in VR

Abstract:Virtual Reality (VR) is quickly establishing itself in various industries, including training, education, medicine, and entertainment, in which users are frequently required to carry out multiple complex cognitive and physical activities. However, the relationship between cognitive activities, physical activities, and familiar feelings of cybersickness is not well understood and thus can be unpredictable for developers. Researchers have previously provided labeled datasets for predicting cybersickness while users are stationary, but there have been few labeled datasets on cybersickness while users are physically walking. Thus, from 39 participants, we collected head orientation, head position, eye tracking, images, physiological readings from external sensors, and the self-reported cybersickness severity, physical load, and mental load in VR. Throughout the data collection, participants navigated mazes via real walking and performed tasks challenging their attention and working memory. To demonstrate the dataset's utility, we conducted a case study of training classifiers in which we achieved 95% accuracy for cybersickness severity classification. The noteworthy performance of the straightforward classifiers makes this dataset ideal for future researchers to develop cybersickness detection and reduction models. To better understand the features that helped with classification, we performed SHAP(SHapley Additive exPlanations) analysis, highlighting the importance of eye tracking and physiological measures for cybersickness prediction while walking. This open dataset can allow future researchers to study the connection between cybersickness and cognitive loads and develop prediction models. This dataset will empower future VR developers to design efficient and effective Virtual Environments by improving cognitive load management and minimizing cybersickness.

Via

Access Paper or Ask Questions

EncodeNet: A Framework for Boosting DNN Accuracy with Entropy-driven Generalized Converting Autoencoder

Apr 21, 2024

Hasanul Mahmud, Kevin Desai, Palden Lama, Sushil K. Prasad

Figure 1 for EncodeNet: A Framework for Boosting DNN Accuracy with Entropy-driven Generalized Converting Autoencoder

Figure 2 for EncodeNet: A Framework for Boosting DNN Accuracy with Entropy-driven Generalized Converting Autoencoder

Figure 3 for EncodeNet: A Framework for Boosting DNN Accuracy with Entropy-driven Generalized Converting Autoencoder

Figure 4 for EncodeNet: A Framework for Boosting DNN Accuracy with Entropy-driven Generalized Converting Autoencoder

Abstract:Image classification is a fundamental task in computer vision, and the quest to enhance DNN accuracy without inflating model size or latency remains a pressing concern. We make a couple of advances in this regard, leading to a novel EncodeNet design and training framework. The first advancement involves Converting Autoencoders, a novel approach that transforms images into an easy-to-classify image of its class. Our prior work that applied the Converting Autoencoder and a simple classifier in tandem achieved moderate accuracy over simple datasets, such as MNIST and FMNIST. However, on more complex datasets like CIFAR-10, the Converting Autoencoder has a large reconstruction loss, making it unsuitable for enhancing DNN accuracy. To address these limitations, we generalize the design of Converting Autoencoders by leveraging a larger class of DNNs, those with architectures comprising feature extraction layers followed by classification layers. We incorporate a generalized algorithmic design of the Converting Autoencoder and intraclass clustering to identify representative images, leading to optimized image feature learning. Next, we demonstrate the effectiveness of our EncodeNet design and training framework, improving the accuracy of well-trained baseline DNNs while maintaining the overall model size. EncodeNet's building blocks comprise the trained encoder from our generalized Converting Autoencoders transferring knowledge to a lightweight classifier network - also extracted from the baseline DNN. Our experimental results demonstrate that EncodeNet improves the accuracy of VGG16 from 92.64% to 94.05% on CIFAR-10 and RestNet20 from 74.56% to 76.04% on CIFAR-100. It outperforms state-of-the-art techniques that rely on knowledge distillation and attention mechanisms, delivering higher accuracy for models of comparable size.

* 15 pages

Via

Access Paper or Ask Questions

A Survey on 3D Egocentric Human Pose Estimation

Mar 26, 2024

Md Mushfiqur Azam, Kevin Desai

Abstract:Egocentric human pose estimation aims to estimate human body poses and develop body representations from a first-person camera perspective. It has gained vast popularity in recent years because of its wide range of applications in sectors like XR-technologies, human-computer interaction, and fitness tracking. However, to the best of our knowledge, there is no systematic literature review based on the proposed solutions regarding egocentric 3D human pose estimation. To that end, the aim of this survey paper is to provide an extensive overview of the current state of egocentric pose estimation research. In this paper, we categorize and discuss the popular datasets and the different pose estimation models, highlighting the strengths and weaknesses of different methods by comparative analysis. This survey can be a valuable resource for both researchers and practitioners in the field, offering insights into key concepts and cutting-edge solutions in egocentric pose estimation, its wide-ranging applications, as well as the open problems with future scope.

Via

Access Paper or Ask Questions