Abstract:The advancement of robot learning is currently hindered by the scarcity of large-scale, high-quality datasets. While established data collection methods such as teleoperation and universal manipulation interfaces dominate current datasets, they suffer from inherent limitations in scalability and real-world deployability. Human egocentric video collection, by contrast, has emerged as a promising approach to enable scalable, natural and in-the-wild data collection. As such, we present EgoLive, a large-scale, high-quality egocentric dataset designed explicitly for robot manipulation learning. EgoLive establishes three distinctive technical advantages over existing egocentric datasets: first, it represents the largest open-source annotated egocentric dataset focused on real-world task-oriented human routines to date; second, it delivers leading data quality via a customized head-mounted capture device and comprehensive high-precision multi-modal annotations; third, all data is collected exclusively in unconstrained real-world scenarios and encompasses vertical field human working data, including home service, retail, and other practical work scenarios, providing superior diversity and ecological validity. With the introduction of EgoLive, we aim to provide the research community with a scalable, high-quality dataset that accelerates breakthroughs in generalizable robotic models and facilitates the real-world deployment of robot systems.
Abstract:The Visual-Inertial Simultaneous Localization and Mapping (VI-SLAM) algorithms which are mostly based on static assumption are widely used in fields such as robotics, UAVs, VR, and autonomous driving. To overcome the localization risks caused by dynamic landmarks in most VI-SLAM systems, a robust visual-inertial motion prior SLAM system, named (IDY-VINS), is proposed in this paper which effectively handles dynamic landmarks using inertial motion prior for dynamic environments to varying degrees. Specifically, potential dynamic landmarks are preprocessed during the feature tracking phase by the probabilistic model of landmarks' minimum projection errors which are obtained from inertial motion prior and epipolar constraint. Subsequently, a bundle adjustment (BA) residual is proposed considering the minimum projection error prior for dynamic candidate landmarks. This residual is integrated into a sliding window based nonlinear optimization process to estimate camera poses, IMU states and landmark positions while minimizing the impact of dynamic candidate landmarks that deviate from the motion prior. Finally, experimental results demonstrate that our proposed system outperforms state-of-the-art methods in terms of localization accuracy and time cost by robustly mitigating the influence of dynamic landmarks.