Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wiktor Mucha

Towards Egocentric 3D Hand Pose Estimation in Unseen Domains

Jan 10, 2026

Wiktor Mucha, Michael Wray, Martin Kampel

Abstract:We present V-HPOT, a novel approach for improving the cross-domain performance of 3D hand pose estimation from egocentric images across diverse, unseen domains. State-of-the-art methods demonstrate strong performance when trained and tested within the same domain. However, they struggle to generalise to new environments due to limited training data and depth perception -- overfitting to specific camera intrinsics. Our method addresses this by estimating keypoint z-coordinates in a virtual camera space, normalised by focal length and image size, enabling camera-agnostic depth prediction. We further leverage this invariance to camera intrinsics to propose a self-supervised test-time optimisation strategy that refines the model's depth perception during inference. This is achieved by applying a 3D consistency loss between predicted and in-space scale-transformed hand poses, allowing the model to adapt to target domain characteristics without requiring ground truth annotations. V-HPOT significantly improves 3D hand pose estimation performance in cross-domain scenarios, achieving a 71% reduction in mean pose error on the H2O dataset and a 41% reduction on the AssemblyHands dataset. Compared to state-of-the-art methods, V-HPOT outperforms all single-stage approaches across all datasets and competes closely with two-stage methods, despite needing approximately x3.5 to x14 less data.

* Accepted at WACV 2026

Via

Access Paper or Ask Questions

REST-HANDS: Rehabilitation with Egocentric Vision Using Smartglasses for Treatment of Hands after Surviving Stroke

Sep 30, 2024

Wiktor Mucha, Kentaro Tanaka, Martin Kampel

Abstract:Stroke represents the third cause of death and disability worldwide, and is recognised as a significant global health problem. A major challenge for stroke survivors is persistent hand dysfunction, which severely affects the ability to perform daily activities and the overall quality of life. In order to regain their functional hand ability, stroke survivors need rehabilitation therapy. However, traditional rehabilitation requires continuous medical support, creating dependency on an overburdened healthcare system. In this paper, we explore the use of egocentric recordings from commercially available smart glasses, specifically RayBan Stories, for remote hand rehabilitation. Our approach includes offline experiments to evaluate the potential of smart glasses for automatic exercise recognition, exercise form evaluation and repetition counting. We present REST-HANDS, the first dataset of egocentric hand exercise videos. Using state-of-the-art methods, we establish benchmarks with high accuracy rates for exercise recognition (98.55%), form evaluation (86.98%), and repetition counting (mean absolute error of 1.33). Our study demonstrates the feasibility of using egocentric video from smart glasses for remote rehabilitation, paving the way for further research.

* Accepted at ACVR ECCV 2024

Via

Access Paper or Ask Questions

SHARP: Segmentation of Hands and Arms by Range using Pseudo-Depth for Enhanced Egocentric 3D Hand Pose Estimation and Action Recognition

Aug 19, 2024

Wiktor Mucha, Michael Wray, Martin Kampel

Figure 1 for SHARP: Segmentation of Hands and Arms by Range using Pseudo-Depth for Enhanced Egocentric 3D Hand Pose Estimation and Action Recognition

Figure 2 for SHARP: Segmentation of Hands and Arms by Range using Pseudo-Depth for Enhanced Egocentric 3D Hand Pose Estimation and Action Recognition

Figure 3 for SHARP: Segmentation of Hands and Arms by Range using Pseudo-Depth for Enhanced Egocentric 3D Hand Pose Estimation and Action Recognition

Figure 4 for SHARP: Segmentation of Hands and Arms by Range using Pseudo-Depth for Enhanced Egocentric 3D Hand Pose Estimation and Action Recognition

Abstract:Hand pose represents key information for action recognition in the egocentric perspective, where the user is interacting with objects. We propose to improve egocentric 3D hand pose estimation based on RGB frames only by using pseudo-depth images. Incorporating state-of-the-art single RGB image depth estimation techniques, we generate pseudo-depth representations of the frames and use distance knowledge to segment irrelevant parts of the scene. The resulting depth maps are then used as segmentation masks for the RGB frames. Experimental results on H2O Dataset confirm the high accuracy of the estimated pose with our method in an action recognition task. The 3D hand pose, together with information from object detection, is processed by a transformer-based action recognition network, resulting in an accuracy of 91.73%, outperforming all state-of-the-art methods. Estimations of 3D hand pose result in competitive performance with existing methods with a mean pose error of 28.66 mm. This method opens up new possibilities for employing distance information in egocentric 3D hand pose estimation without relying on depth sensors.

* Accepted at 27th International Conference on Pattern Recognition (ICPR)

Via

Access Paper or Ask Questions

TEXT2TASTE: A Versatile Egocentric Vision System for Intelligent Reading Assistance Using Large Language Model

Apr 14, 2024

Wiktor Mucha, Florin Cuconasu, Naome A. Etori, Valia Kalokyri, Giovanni Trappolini

Abstract:The ability to read, understand and find important information from written text is a critical skill in our daily lives for our independence, comfort and safety. However, a significant part of our society is affected by partial vision impairment, which leads to discomfort and dependency in daily activities. To address the limitations of this part of society, we propose an intelligent reading assistant based on smart glasses with embedded RGB cameras and a Large Language Model (LLM), whose functionality goes beyond corrective lenses. The video recorded from the egocentric perspective of a person wearing the glasses is processed to localise text information using object detection and optical character recognition methods. The LLM processes the data and allows the user to interact with the text and responds to a given query, thus extending the functionality of corrective lenses with the ability to find and summarize knowledge from the text. To evaluate our method, we create a chat-based application that allows the user to interact with the system. The evaluation is conducted in a real-world setting, such as reading menus in a restaurant, and involves four participants. The results show robust accuracy in text retrieval. The system not only provides accurate meal suggestions but also achieves high user satisfaction, highlighting the potential of smart glasses and LLMs in assisting people with special needs.

* Accepted at ICCHP 2024

Via

Access Paper or Ask Questions

In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Apr 14, 2024

Wiktor Mucha, Martin Kampel

Figure 1 for In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Figure 2 for In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Figure 3 for In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Figure 4 for In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Abstract:Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O and FPHA public benchmarks. Secondly, we present a robust action recognition architecture from 2D hand and object poses. This method incorporates EffHandEgoNet, and a transformer-based action recognition method. Evaluated on H2O and FPHA datasets, our architecture has a faster inference time and achieves an accuracy of 91.32% and 94.43%, respectively, surpassing state of the art, including 3D-based methods. Our work demonstrates that using 2D skeletal data is a robust approach for egocentric action understanding. Extensive evaluation and ablation studies show the impact of the hand pose estimation approach, and how each input affects the overall performance.

* Accepted at: The 18th IEEE International Conference on Automatic Face and Gesture Recognition

Via

Access Paper or Ask Questions

Human Action Recognition in Egocentric Perspective Using 2D Object and Hands Pose

Jun 08, 2023

Wiktor Mucha, Martin Kampel

Figure 1 for Human Action Recognition in Egocentric Perspective Using 2D Object and Hands Pose

Figure 2 for Human Action Recognition in Egocentric Perspective Using 2D Object and Hands Pose

Figure 3 for Human Action Recognition in Egocentric Perspective Using 2D Object and Hands Pose

Abstract:Egocentric action recognition is essential for healthcare and assistive technology that relies on egocentric cameras because it allows for the automatic and continuous monitoring of activities of daily living (ADLs) without requiring any conscious effort from the user. This study explores the feasibility of using 2D hand and object pose information for egocentric action recognition. While current literature focuses on 3D hand pose information, our work shows that using 2D skeleton data is a promising approach for hand-based action classification, might offer privacy enhancement, and could be less computationally demanding. The study uses a state-of-the-art transformer-based method to classify sequences and achieves validation results of 94%, outperforming other existing solutions. The accuracy of the test subset drops to 76%, indicating the need for further generalization improvement. This research highlights the potential of 2D hand and object pose information for action recognition tasks and offers a promising alternative to 3D-based methods.

Via

Access Paper or Ask Questions

State of the Art of Audio- and Video-Based Solutions for AAL

Jul 05, 2022

Slavisa Aleksic, Michael Atanasov, Jean Calleja Agius, Kenneth Camilleri, Anto Cartolovni, Pau Climent-Peerez, Sara Colantonio, Stefania Cristina, Vladimir Despotovic, Hazim Kemal Ekenel(+29 more)

Abstract:The report illustrates the state of the art of the most successful AAL applications and functions based on audio and video data, namely (i) lifelogging and self-monitoring, (ii) remote monitoring of vital signs, (iii) emotional state recognition, (iv) food intake monitoring, activity and behaviour recognition, (v) activity and personal assistance, (vi) gesture recognition, (vii) fall detection and prevention, (viii) mobility assessment and frailty recognition, and (ix) cognitive and motor rehabilitation. For these application scenarios, the report illustrates the state of play in terms of scientific advances, available products and research project. The open challenges are also highlighted.

Via

Access Paper or Ask Questions