Abstract:Recent advances in visuomotor policy learning have enabled robots to perform control directly from visual inputs. Yet, extending such end-to-end learning from single-arm to bimanual manipulation remains challenging due to the need for both independent perception and coordinated interaction between arms. Existing methods typically favor one side -- either decoupling the two arms to avoid interference or enforcing strong cross-arm coupling for coordination -- thus lacking a unified treatment. We propose CUBic, a Coordinated and Unified framework for Bimanual perception and control that reformulates bimanual coordination as a unified perceptual modeling problem. CUBic learns a shared tokenized representation bridging perception and control, where independence and coordination emerge intrinsically from structure rather than from hand-crafted coupling. Our approach integrates three components: unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. Extensive experiments on the RoboTwin benchmark show that CUBic consistently surpasses standard baselines, achieving marked improvements in coordination accuracy and task success rates over state-of-the-art visuomotor baselines.
Abstract:Real-world robot manipulation in dynamic unstructured environments requires lifelong adaptability to evolving objects, scenes and tasks. Traditional imitation learning relies on static training paradigms, which are ill-suited for lifelong adaptation. Although Continual Imitation Learnin (CIL) enables incremental task adaptation while preserving learned knowledge, current CIL methods primarily overlook the intrinsic skill characteristics of robot manipulation or depend on manually defined and rigid skills, leading to suboptimal cross-task knowledge transfer. To address these issues, we propose Skill Prompts-based HiErarchical Continual Imitation Learning (SPECI), a novel end-to-end hierarchical CIL policy architecture for robot manipulation. The SPECI framework consists of a multimodal perception and fusion module for heterogeneous sensory information encoding, a high-level skill inference module for dynamic skill extraction and selection, and a low-level action execution module for precise action generation. To enable efficient knowledge transfer on both skill and task levels, SPECI performs continual implicit skill acquisition and reuse via an expandable skill codebook and an attention-driven skill selection mechanism. Furthermore, we introduce mode approximation to augment the last two modules with task-specific and task-sharing parameters, thereby enhancing task-level knowledge transfer. Extensive experiments on diverse manipulation task suites demonstrate that SPECI consistently outperforms state-of-the-art CIL methods across all evaluated metrics, revealing exceptional bidirectional knowledge transfer and superior overall performance.