Drone navigation is the process of autonomously controlling drones to navigate and fly in different environments.
Designing provably safe control is a core problem in trustworthy autonomy. However, most prior work in this regard assumes either that the system dynamics are known or deterministic, or that the state and action space are finite, significantly limiting application scope. We address this limitation by developing a probabilistic verification framework for unknown dynamical systems which combines conformal prediction with reachability analysis. In particular, we use conformal prediction to obtain valid uncertainty intervals for the unknown dynamics at each time step, with reachability then verifying whether safety is maintained within the conformal uncertainty bounds. Next, we develop an algorithmic approach for training control policies that optimize nominal reward while also maximizing the planning horizon with sound probabilistic safety guarantees. We evaluate the proposed approach in seven safe control settings spanning four domains -- cartpole, lane following, drone control, and safe navigation -- for both affine and nonlinear safety specifications. Our experiments show that the policies we learn achieve the strongest provable safety guarantees while still maintaining high average reward.
In this paper, we propose a novel semi-autonomous image sampling strategy, called stealthy coverage control, for human-enabled 3D structure reconstruction. The present mission involves a fundamental problem: while the number of images required to accurately reconstruct a 3D model depends on the structural complexity of the target scene to be reconstructed, it is not realistic to assume prior knowledge of the spatially non-uniform structural complexity. We approach this issue by leveraging human flexible reasoning and situational recognition capabilities. Specifically, we design a semi-autonomous system that leaves identification of regions that need more images and navigation of the drones to such regions to a human operator. To this end, we first present a way to reflect the human intention in autonomous coverage control. Subsequently, in order to avoid operational conflicts between manual control and autonomous coverage control, we develop the stealthy coverage control that decouples the drone motion for efficient image sampling from navigation by the human. Simulation studies on a Unity/ROS2-based simulator demonstrate that the present semi-autonomous system outperforms the one without human interventions in the sense of the reconstructed model quality.
Autonomous Unmanned Underwater Vehicles (UUVs) enable military and civilian covert operations in coastal areas without relying on support vessels or Global Navigation Satellite Systems (GNSS). Such operations are critical when surface access is not possible and stealthy navigation is required in restricted environments such as protected zones or dangerous areas under access ban. GNSS denied navigation is then essential to maintaining concealment as surfacing could expose UUVs to detection. To ensure a precise fleet positioning a constellation of beacons deployed by aerial or surface drones establish a synthetic landmark network that will guide the fleet of UUVs along an optimized path from the continental shelf to the goal on the shore. These beacons either submerged or floating emit acoustic signals for UUV localisation and navigation. A hierarchical planner generates an adaptive route for the drones executing primitive actions while continuously monitoring and replanning as needed to maintain trajectory accuracy.
Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.
Object-Goal Navigation (ObjectNav) requires an agent to autonomously explore an unknown environment and navigate toward target objects specified by a semantic label. While prior work has primarily studied zero-shot ObjectNav under 2D locomotion, extending it to aerial platforms with 3D locomotion capability remains underexplored. Aerial robots offer superior maneuverability and search efficiency, but they also introduce new challenges in spatial perception, dynamic control, and safety assurance. In this paper, we propose AION for vision-based aerial ObjectNav without relying on external localization or global maps. AION is an end-to-end dual-policy reinforcement learning (RL) framework that decouples exploration and goal-reaching behaviors into two specialized policies. We evaluate AION on the AI2-THOR benchmark and further assess its real-time performance in IsaacSim using high-fidelity drone models. Experimental results show that AION achieves superior performance across comprehensive evaluation metrics in exploration, navigation efficiency, and safety. The video can be found at https://youtu.be/TgsUm6bb7zg.
The use of artificial intelligence (AI) for drone control can have a transformative impact on drone capabilities, especially when real world information can be integrated with drone sensing, command, and control, part of a growing field of physical AI. Large language models (LLMs) can be advantageous if trained at scale on general knowledge, but especially and in particular when the training data includes information such as detailed map geography topology of the entire planet, as well as the ability to access real time situational data such as weather. However, challenges remain in the interface between drones and LLMs in general, with each application requiring a tedious, labor intensive effort to connect the LLM trained knowledge to drone command and control. Here, we solve that problem, using an interface strategy that is LLM agnostic and drone agnostic, providing the first universal, versatile, comprehensive and easy to use drone control interface. We do this using the new model context protocol (MCP) standard, an open standard that provides a universal way for AI systems to access external data, tools, and services. We develop and deploy a cloud based Linux machine hosting an MCP server that supports the Mavlink protocol, an ubiquitous drone control language used almost universally by millions of drones including Ardupilot and PX4 framework.We demonstrate flight control of a real unmanned aerial vehicle. In further testing, we demonstrate extensive flight planning and control capability in a simulated drone, integrated with a Google Maps MCP server for up to date, real time navigation information. This demonstrates a universal approach to integration of LLMs with drone command and control, a paradigm that leverages and exploits virtually all of modern AI industry with drone technology in an easy to use interface that translates natural language to drone control.
Micro-Unmanned Aerial Vehicles (UAVs) are rapidly expanding into tasks from inventory to environmental sensing, yet their short endurance and unreliable navigation in GPS-denied spaces limit deployment. Lighter-Than-Air (LTA) drones offer an energy-efficient alternative: they use a helium envelope to provide buoyancy, which enables near-zero-power drain during hovering and much longer operation. LTAs are promising, but their design is complex, and they lack integrated solutions to enable sustained autonomous operations and navigation with simple, low-infrastructure. We propose a compact, self-sustaining LTA drone that uses light for both energy harvesting and navigation. Our contributions are threefold: (i) a high-fidelity simulation framework to analyze LTA aerodynamics and select a stable, efficient configuration; (ii) a framework to integrate solar cells on the envelope to provide net-positive energy; and (iii) a point-and-go navigation system with three light-seeking algorithms operating on a single light beacon. Our LTA-analysis, together with the integrated solar panels, not only saves energy while flying, but also enables sustainable operation: providing 1 minute of flying time for every 4 minutes of energy harvesting, under illuminations of 80klux. We also demonstrate robust single-beacon navigation towards a light source that can be up to 7m away, in indoor and outdoor environments, even with moderate winds. The resulting system indicates a plausible path toward persistent, autonomous operation for indoor and outdoor monitoring. More broadly, this work provides a practical pathway for translating the promise of LTA drones into a persistent, self-sustaining aerial system.
As aerial platforms evolve from passive observers to active manipulators, the challenge shifts toward designing intuitive interfaces that allow non-expert users to command these systems naturally. This work introduces a novel concept of autonomous aerial manipulation system capable of interpreting high-level natural language commands to retrieve objects and deliver them to a human user. The system is intended to integrate a MediaPipe based on Grounding DINO and a Vision-Language-Action (VLA) model with a custom-built drone equipped with a 1-DOF gripper and an Intel RealSense RGB-D camera. VLA performs semantic reasoning to interpret the intent of a user prompt and generates a prioritized task queue for grasping of relevant objects in the scene. Grounding DINO and dynamic A* planning algorithm are used to navigate and safely relocate the object. To ensure safe and natural interaction during the handover phase, the system employs a human-centric controller driven by MediaPipe. This module provides real-time human pose estimation, allowing the drone to employ visual servoing to maintain a stable, distinct position directly in front of the user, facilitating a comfortable handover. We demonstrate the system's efficacy through real-world experiments for localization and navigation, which resulted in a 0.164m, 0.070m, and 0.084m of max, mean euclidean, and root-mean squared errors, respectively, highlighting the feasibility of VLA for aerial manipulation operations.
Recent advances in large Vision-Language Models (VLMs) have provided rich semantic understanding that empowers drones to search for open-set objects via natural language instructions. However, prior systems struggle to integrate VLMs into practical aerial systems due to orders-of-magnitude frequency mismatch between VLM inference and real-time planning, as well as VLMs' limited 3D scene understanding. They also lack a unified mechanism to balance semantic guidance with motion efficiency in large-scale environments. To address these challenges, we present AirHunt, an aerial object navigation system that efficiently locates open-set objects with zero-shot generalization in outdoor environments by seamlessly fusing VLM semantic reasoning with continuous path planning. AirHunt features a dual-pathway asynchronous architecture that establishes a synergistic interface between VLM reasoning and path planning, enabling continuous flight with adaptive semantic guidance that evolves through motion. Moreover, we propose an active dual-task reasoning module that exploits geometric and semantic redundancy to enable selective VLM querying, and a semantic-geometric coherent planning module that dynamically reconciles semantic priorities and motion efficiency in a unified framework, enabling seamless adaptation to environmental heterogeneity. We evaluate AirHunt across diverse object navigation tasks and environments, demonstrating a higher success rate with lower navigation error and reduced flight time compared to state-of-the-art methods. Real-world experiments further validate AirHunt's practical capability in complex and challenging environments. Code and dataset will be made publicly available before publication.
Marker-based landing is widely used in drone delivery and return-to-base systems for its simplicity and reliability. However, most approaches assume idealized landing site visibility and sensor performance, limiting robustness in complex urban settings. We present a simulation-based evaluation suite on the AirSim platform with systematically varied urban layouts, lighting, and weather to replicate realistic operational diversity. Using onboard camera sensors (RGB for marker detection and depth for obstacle avoidance), we benchmark two heuristic coverage patterns and a reinforcement learning-based agent, analyzing how exploration strategy and scene complexity affect success rate, path efficiency, and robustness. Results underscore the need to evaluate marker-based autonomous landing under diverse, sensor-relevant conditions to guide the development of reliable aerial navigation systems.