Michael
Abstract:Enabling robots to navigate open-world environments via natural language is critical for general-purpose autonomy. Yet, Vision-Language Navigation has relied on end-to-end policies trained on expensive, embodiment-specific robot data. While recent foundation models trained on vast simulation data show promise, the challenge of scaling and generalizing due to the limited scene diversity and visual fidelity in simulation persists. To address this gap, we propose ImagiNav, a novel modular paradigm that decouples visual planning from robot actuation, enabling the direct utilization of diverse in-the-wild navigation videos. Our framework operates as a hierarchy: a Vision-Language Model first decomposes instructions into textual subgoals; a finetuned generative video model then imagines the future video trajectory towards that subgoal; finally, an inverse dynamics model extracts the trajectory from the imagined video, which can then be tracked via a low-level controller. We additionally develop a scalable data pipeline of in-the-wild navigation videos auto-labeled via inverse dynamics and a pretrained Vision-Language Model. ImagiNav demonstrates strong zero-shot transfer to robot navigation without requiring robot demonstrations, paving the way for generalist robots that learn navigation directly from unlabeled, open-world data.
Abstract:Diffusion-based robot navigation policies trained on large-scale imitation learning datasets, can generate multi-modal trajectories directly from the robot's visual observations, bypassing the traditional localization-mapping-planning pipeline and achieving strong zero-shot generalization. However, their performance remains constrained by the coverage of offline datasets, and when deployed in unseen settings, distribution shift often leads to accumulated trajectory errors and safety-critical failures. Adapting diffusion policies with reinforcement learning is challenging because their iterative denoising structure hinders effective gradient backpropagation, while also making the training of an additional value network computationally expensive and less stable. To address these issues, we propose a reinforcement learning fine-tuning framework tailored for diffusion-based navigation. The method leverages the inherent multi-trajectory sampling mechanism of diffusion models and adopts Group Relative Policy Optimization (GRPO), which estimates relative advantages across sampled trajectories without requiring a separate value network. To preserve pretrained representations while enabling adaptation, we freeze the visual encoder and selectively update the higher decoder layers and action head, enhancing safety-aware behaviors through online environmental feedback. On the PointGoal task in Isaac Sim, our approach improves the Success Rate from 52.0% to 58.7% and SPL from 0.49 to 0.54 on unseen scenes, while reducing collision frequency. Additional experiments show that the fine-tuned policy transfers zero-shot to a real quadruped platform and maintains stable performance in geometrically out-of-distribution environments, suggesting improved adaptability and safe generalization to new domains.
Abstract:Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency, limiting them to low-frequency batch-and-execute paradigm. This frequency mismatch creates an execution blind spot, causing failures in dynamic environments where targets move during the open-loop execution window. We propose TIDAL (Temporally Interleaved Diffusion and Action Loop), a hierarchical framework that decouples semantic reasoning from high-frequency actuation. TIDAL operates as a backbone-agnostic module for diffusion-based VLAs, using a dual-frequency architecture to redistribute the computational budget. Specifically, a low-frequency macro-intent loop caches semantic embeddings, while a high-frequency micro-control loop interleaves single-step flow integration with execution. This design enables approximately 9 Hz control updates on edge hardware (vs. approximately 2.4 Hz baselines) without increasing marginal overhead. To handle the resulting latency shift, we introduce a temporally misaligned training strategy where the policy learns predictive compensation using stale semantic intent alongside real-time proprioception. Additionally, we address the insensitivity of static vision encoders to velocity by incorporating a differential motion predictor. TIDAL is architectural, making it orthogonal to system-level optimizations. Experiments show a 2x performance gain over open-loop baselines in dynamic interception tasks. Despite a marginal regression in static success rates, our approach yields a 4x increase in feedback frequency and extends the effective horizon of semantic embeddings beyond the native action chunk size. Under non-paused inference protocols, TIDAL remains robust where standard baselines fail due to latency.
Abstract:We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.
Abstract:Navigating in crowded environments requires the robot to be equipped with high-level reasoning and planning techniques. Existing works focus on developing complex and heavyweight planners while ignoring the role of human intelligence. Since humans are highly capable agents who are also widely available in a crowd navigation setting, we propose an alternative scheme where the robot utilises people as planners to benefit from their effective planning decisions and social behaviours. Through a set of rule-based evaluations, we identify suitable human leaders who exhibit the potential to guide the robot towards its goal. Using a simple base planner, the robot follows the selected leader through shorthorizon subgoals that are designed to be straightforward to achieve. We demonstrate through both simulated and real-world experiments that our novel framework generates safe and efficient robot plans compared to existing planners, even without predictive or data-driven modules. Our method also brings human-like robot behaviours without explicitly defining traffic rules and social norms. Code will be available at https://github.com/centiLinda/PeopleAsPlanner.git.
Abstract:Traditional unmanned aerial vehicle (UAV) swarm missions rely heavily on expensive custom-made drones with onboard perception or external positioning systems, limiting their widespread adoption in research and education. To address this issue, we propose AirSwarm. AirSwarm democratizes multi-drone coordination using low-cost commercially available drones such as Tello or Anafi, enabling affordable swarm aerial robotics research and education. Key innovations include a hierarchical control architecture for reliable multi-UAV coordination, an infrastructure-free visual SLAM system for precise localization without external motion capture, and a ROS-based software framework for simplified swarm development. Experiments demonstrate cm-level tracking accuracy, low-latency control, communication failure resistance, formation flight, and trajectory tracking. By reducing financial and technical barriers, AirSwarm makes multi-robot education and research more accessible. The complete instructions and open source code will be available at




Abstract:Multi-robot navigation in complex environments relies on inter-robot communication and mutual observations for coordination and situational awareness. This paper studies the multi-robot navigation problem in unknown environments with line-of-sight (LoS) connectivity constraints. While previous works are limited to known environment models to derive the LoS constraints, this paper eliminates such requirements by directly formulating the LoS constraints between robots from their real-time point cloud measurements, leveraging point cloud visibility analysis techniques. We propose a novel LoS-distance metric to quantify both the urgency and sensitivity of losing LoS between robots considering potential robot movements. Moreover, to address the imbalanced urgency of losing LoS between two robots, we design a fusion function to capture the overall urgency while generating gradients that facilitate robots' collaborative movement to maintain LoS. The LoS constraints are encoded into a potential function that preserves the positivity of the Fiedler eigenvalue of the robots' network graph to ensure connectivity. Finally, we establish a LoS-constrained exploration framework that integrates the proposed connectivity controller. We showcase its applications in multi-robot exploration in complex unknown environments, where robots can always maintain the LoS connectivity through distributed sensing and communication, while collaboratively mapping the unknown environment. The implementations are open-sourced at https://github.com/bairuofei/LoS_constrained_navigation.




Abstract:Multi-axle autonomous mobile robots (AMRs) are set to revolutionize the future of robotics in logistics. As the backbone of next-generation solutions, these robots face a critical challenge: managing and minimizing the swept volume during turns while maintaining precise control. Traditional systems designed for standard vehicles often struggle with the complex dynamics of multi-axle configurations, leading to inefficiency and increased safety risk in confined spaces. Our innovative framework overcomes these limitations by combining swept volume minimization with Signed Distance Field (SDF) path planning and model predictive control (MPC) for independent wheel steering. This approach not only plans paths with an awareness of the swept volume but actively minimizes it in real-time, allowing each axle to follow a precise trajectory while significantly reducing the space the vehicle occupies. By predicting future states and adjusting the turning radius of each wheel, our method enhances both maneuverability and safety, even in the most constrained environments. Unlike previous works, our solution goes beyond basic path calculation and tracking, offering real-time path optimization with minimal swept volume and efficient individual axle control. To our knowledge, this is the first comprehensive approach to tackle these challenges, delivering life-saving improvements in control, efficiency, and safety for multi-axle AMRs. Furthermore, we will open-source our work to foster collaboration and enable others to advance safer, more efficient autonomous systems.




Abstract:This paper considers the collaborative graph exploration problem in GPS-denied environments, where a group of robots are required to cover a graph environment while maintaining reliable pose estimations in collaborative simultaneous localization and mapping (SLAM). Considering both objectives presents challenges for multi-robot pathfinding, as it involves the expensive covariance inference for SLAM uncertainty evaluation, especially considering various combinations of robots' paths. To reduce the computational complexity, we propose an efficient two-stage strategy where exploration paths are first generated for quick coverage, and then enhanced by adding informative and distance-efficient loop-closing actions, called loop edges, along the paths for reliable pose estimation. We formulate the latter problem as a non-monotone submodular maximization problem by relating SLAM uncertainty with pose graph topology, which (1) facilitates more efficient evaluation of SLAM uncertainty than covariance inference, and (2) allows the application of approximation algorithms in submodular optimization to provide optimality guarantees. We further introduce the ordering heuristics to improve objective values while preserving the optimality bound. Simulation experiments over randomly generated graph environments verify the efficiency of our methods in finding paths for quick coverage and enhanced pose graph reliability, and benchmark the performance of the approximation algorithms and the greedy-based algorithm in the loop edge selection problem. Our implementations will be open-source at https://github.com/bairuofei/CGE.
Abstract:Autonomous exploration requires the robot to explore an unknown environment while constructing an accurate map with the SLAM (Simultaneous Localization and Mapping) techniques. Without prior information, the exploratory performance is usually conservative due to the limited planning horizon. This paper exploits a prior topo-metric graph of the environment to benefit both the exploration efficiency and the pose graph accuracy in SLAM. Based on recent advancements in relating pose graph reliability with graph topology, we are able to formulate both objectives into a SLAM-aware path planning problem over the prior graph, which finds a fast exploration path with informative loop closures that globally stabilize the pose graph. Furthermore, we derive theoretical thresholds to speed up the greedy algorithm to the problem, which significantly prune non-optimal loop closures in iterations. The proposed planner is incorporated into a hierarchical exploration framework, with flexible features including path replanning and online prior map update that adds additional information to the prior graph. Extensive experiments indicate that our method has comparable exploration efficiency to others while consistently maintaining higher mapping accuracy in various environments. Our implementations will be open-source on GitHub.