Abstract:Recent work leverages the capabilities and commonsense priors of generative models for robot control. In this paper, we present an agentic control system in which a reasoning-capable language model plans and executes tasks by selecting and invoking robot skills within an iterative planner and executor loop. We deploy the system on two physical robot platforms in two settings: (i) tabletop grasping, placement, and box insertion in indoor mobile manipulation (Mobipick) and (ii) autonomous agricultural navigation and sensing (Valdemar). Both settings involve uncertainty, partial observability, sensor noise, and ambiguous natural-language commands. The system exposes structured introspection of its planning and decision process, reacts to exogenous events via explicit event checks, and supports operator interventions that modify or redirect ongoing execution. Across both platforms, our proof-of-concept experiments reveal substantial fragility, including non-deterministic suboptimal behavior, instruction-following errors, and high sensitivity to prompt specification. At the same time, the architecture is flexible: transfer to a different robot and task domain largely required updating the system prompt (domain model, affordances, and action catalogue) and re-binding the same tool interface to the platform-specific skill API.




Abstract:Manipulating objects with robotic hands is a complicated task. Not only the pose of the robot's end effector, but also the fingers of the hand need to be controlled and coordinated. Using human demonstrations of movements is an intuitive and data-efficient way of guiding the robot's behavior. We propose a modular framework with an automatic embodiment mapping to transfer human hand motions to robotic systems and use motion capture to record human motion. We evaluate our approach on eight challenging tasks, in which a robotic arm with a mounted robotic hand needs to grasp and manipulate deformable objects or small, fragile material.