Abstract:Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. While recent vision-language models (VLMs) exhibit preliminary spatial awareness, they largely rely on coarse 2D perception and lack the ability to perform accurate reasoning over 3D geometry, which is crucial for precise assembly operations. To address this limitation, we propose AssemLM, a spatial multimodal large language model tailored for robotic assembly. AssemLM integrates assembly manuals, point clouds, and textual instructions to reason about and predict task-critical 6D assembly poses, enabling explicit geometric understanding throughout the assembly process. To effectively bridge raw 3D perception and high-level reasoning, we adopt a specialized point cloud encoder to capture fine-grained geometric and rotational features, which are then integrated into the multimodal language model to support accurate 3D spatial reasoning for assembly tasks. In addition, we construct AssemBench, a large-scale dataset and benchmark for assembly-oriented spatial reasoning, comprising over 900K multimodal samples with precise 6D pose annotations. AssemBench extends spatial reasoning evaluation beyond 2D and grounding tasks into full 3D geometric inference, filling a critical gap in existing embodied AI benchmarks. Extensive experiments demonstrate that AssemLM achieves state-of-the-art performance in 6D pose reasoning across diverse assembly scenarios. Furthermore, real-robot evaluations show that our model can support fine-grained and multi-step assembly execution in real-world settings, demonstrating its potential for robotic assembly applications.




Abstract:For robotic manipulation, existing robotics datasets and simulation benchmarks predominantly cater to robot-arm platforms. However, for humanoid robots equipped with dual arms and dexterous hands, simulation tasks and high-quality demonstrations are notably lacking. Bimanual dexterous manipulation is inherently more complex, as it requires coordinated arm movements and hand operations, making autonomous data collection challenging. This paper presents HumanoidGen, an automated task creation and demonstration collection framework that leverages atomic dexterous operations and LLM reasoning to generate relational constraints. Specifically, we provide spatial annotations for both assets and dexterous hands based on the atomic operations, and perform an LLM planner to generate a chain of actionable spatial constraints for arm movements based on object affordances and scenes. To further improve planning ability, we employ a variant of Monte Carlo tree search to enhance LLM reasoning for long-horizon tasks and insufficient annotation. In experiments, we create a novel benchmark with augmented scenarios to evaluate the quality of the collected data. The results show that the performance of the 2D and 3D diffusion policies can scale with the generated dataset. Project page is https://openhumanoidgen.github.io.




Abstract:Robotic assembly tasks are open challenges due to the long task horizon and complex part relations. Behavior trees (BTs) are increasingly used in robot task planning for their modularity and flexibility, but manually designing them can be effort-intensive. Large language models (LLMs) have recently been applied in robotic task planning for generating action sequences, but their ability to generate BTs has not been fully investigated. To this end, We propose LLM as BT-planner, a novel framework to leverage LLMs for BT generation in robotic assembly task planning and execution. Four in-context learning methods are introduced to utilize the natural language processing and inference capabilities of LLMs to produce task plans in BT format, reducing manual effort and ensuring robustness and comprehensibility. We also evaluate the performance of fine-tuned, fewer-parameter LLMs on the same tasks. Experiments in simulated and real-world settings show that our framework enhances LLMs' performance in BT generation, improving success rates in BT generation through in-context learning and supervised fine-tuning.




Abstract:In this work, we propose an LLM-based BT generation framework to leverage the strengths of both for sequential manipulation planning. To enable human-robot collaborative task planning and enhance intuitive robot programming by nonexperts, the framework takes human instructions to initiate the generation of action sequences and human feedback to refine BT generation in runtime. All presented methods within the framework are tested on a real robotic assembly example, which uses a gear set model from the Siemens Robot Assembly Challenge. We use a single manipulator with a tool-changing mechanism, a common practice in flexible manufacturing, to facilitate robust grasping of a large variety of objects. Experimental results are evaluated regarding success rate, logical coherence, executability, time consumption, and token consumption. To our knowledge, this is the first human-guided LLM-based BT generation framework that unifies various plausible ways of using LLMs to fully generate BTs that are executable on the real testbed and take into account granular knowledge of tool use.