Abstract:We propose Int3DNet, a scene-aware network that predicts 3D intention areas directly from scene geometry and head-hand motion cues, enabling robust human intention prediction without explicit object-level perception. In Mixed Reality (MR), intention prediction is critical as it enables the system to anticipate user actions and respond proactively, reducing interaction delays and ensuring seamless user experiences. Our method employs a cross attention fusion of sparse motion cues and scene point clouds, offering a novel approach that directly interprets the user's spatial intention within the scene. We evaluated Int3DNet on MoGaze and CIRCLE datasets, which are public datasets for full-body human-scene interactions, showing consistent performance across time horizons of up to 1500 ms and outperforming the baselines, even in diverse and unseen scenes. Moreover, we demonstrate the usability of proposed method through a demonstration of efficient visual question answering (VQA) based on intention areas. Int3DNet provides reliable 3D intention areas derived from head-hand motion and scene geometry, thus enabling seamless interaction between humans and MR systems through proactive processing of intention areas.




Abstract:Existing datasets for 3D hand-object interaction are limited either in the data cardinality, data variations in interaction scenarios, or the quality of annotations. In this work, we present a comprehensive new training dataset for hand-object interaction called HOGraspNet. It is the only real dataset that captures full grasp taxonomies, providing grasp annotation and wide intraclass variations. Using grasp taxonomies as atomic actions, their space and time combinatorial can represent complex hand activities around objects. We select 22 rigid objects from the YCB dataset and 8 other compound objects using shape and size taxonomies, ensuring coverage of all hand grasp configurations. The dataset includes diverse hand shapes from 99 participants aged 10 to 74, continuous video frames, and a 1.5M RGB-Depth of sparse frames with annotations. It offers labels for 3D hand and object meshes, 3D keypoints, contact maps, and \emph{grasp labels}. Accurate hand and object 3D meshes are obtained by fitting the hand parametric model (MANO) and the hand implicit function (HALO) to multi-view RGBD frames, with the MoCap system only for objects. Note that HALO fitting does not require any parameter tuning, enabling scalability to the dataset's size with comparable accuracy to MANO. We evaluate HOGraspNet on relevant tasks: grasp classification and 3D hand pose estimation. The result shows performance variations based on grasp type and object class, indicating the potential importance of the interaction space captured by our dataset. The provided data aims at learning universal shape priors or foundation models for 3D hand-object interaction. Our dataset and code are available at https://hograspnet2024.github.io/.