Abstract:The prevailing paradigm of perceptive humanoid locomotion relies heavily on active depth sensors. However, this depth-centric approach fundamentally discards the rich semantic and dense appearance cues of the visual world, severing low-level control from the high-level reasoning essential for general embodied intelligence. While monocular RGB offers a ubiquitous, information-dense alternative, end-to-end reinforcement learning from raw 2D pixels suffers from extreme sample inefficiency and catastrophic sim-to-real collapse due to the inherent loss of geometric scale. To break this deadlock, we propose GeoLoco, a purely RGB-driven locomotion framework that conceptualizes monocular images as high-dimensional 3D latent representations by harnessing the powerful geometric priors of a frozen, scale-aware Visual Foundation Model (VFM). Rather than naive feature concatenation, we design a proprioceptive-query multi-head cross-attention mechanism that dynamically attends to task-critical topological features conditioned on the robot's real-time gait phase. Crucially, to prevent the policy from overfitting to superficial textures, we introduce a dual-head auxiliary learning scheme. This explicit regularization forces the high-dimensional latent space to strictly align with the physical terrain geometry, ensuring robust zero-shot sim-to-real transfer. Trained exclusively in simulation, GeoLoco achieves robust zero-shot transfer to the Unitree G1 humanoid and successfully negotiates challenging terrains.




Abstract:The autonomous control of flippers plays an important role in enhancing the intelligent operation of tracked robots within complex environments. While existing methods mainly rely on hand-crafted control models, in this paper, we introduce a novel approach that leverages deep reinforcement learning (DRL) techniques for autonomous flipper control in complex terrains. Specifically, we propose a new DRL network named AT-D3QN, which ensures safe and smooth flipper control for tracked robots. It comprises two modules, a feature extraction and fusion module for extracting and integrating robot and environment state features, and a deep Q-Learning control generation module for incorporating expert knowledge to obtain a smooth and efficient control strategy. To train the network, a novel reward function is proposed, considering both learning efficiency and passing smoothness. A simulation environment is constructed using the Pymunk physics engine for training. We then directly apply the trained model to a more realistic Gazebo simulation for quantitative analysis. The consistently high performance of the proposed approach validates its superiority over manual teleoperation.