Tony
Abstract:Vision-based policies are widely applied in robotics for tasks such as manipulation and locomotion. On lightweight mobile robots, however, they face a trilemma of limited scene transferability, restricted onboard computation resources, and sensor hardware cost. To address these issues, we propose a knowledge distillation approach that transfers knowledge from an information-rich, appearance invariant omniview depth policy to a lightweight monocular policy. The key idea is to train the student not only to mimic the expert actions but also to align with the latent embeddings of the omni view depth teacher. Experiments demonstrate that omni-view and depth inputs improve the scene transfer and navigation performance, and that the proposed distillation method enhances the performance of a singleview monocular policy, compared with policies solely imitating actions. Real world experiments further validate the effectiveness and practicality of our approach. Code will be released publicly.
Abstract:In this paper, we study multi robot laser tag, a simplified yet practical shooting-game-style task. Classic modular approaches on these tasks face challenges such as limited observability and reliance on depth mapping and inter robot communication. To overcome these issues, we present an end-to-end visuomotor policy that maps images directly to robot actions. We train a high performing teacher policy with multi agent reinforcement learning and distill its knowledge into a vision-based student policy. Technical designs, including a permutation-invariant feature extractor and depth heatmap input, improve performance over standard architectures. Our policy outperforms classic methods by 16.7% in hitting accuracy and 6% in collision avoidance, and is successfully deployed on real robots. Code will be released publicly.
Abstract:Flight control for autonomous micro aerial vehicles (MAVs) is evolving from steady flight near equilibrium points toward more aggressive aerobatic maneuvers, such as flips, rolls, and Power Loop. Although reinforcement learning (RL) has shown great potential in these tasks, conventional RL methods often suffer from low data efficiency and limited generalization. This challenge becomes more pronounced in multi-task scenarios where a single policy is required to master multiple maneuvers. In this paper, we propose a novel end-to-end multi-task reinforcement learning framework, called GEAR (Geometric Equivariant Aerobatics Reinforcement), which fully exploits the inherent SO(2) rotational symmetry in MAV dynamics and explicitly incorporates this property into the policy network architecture. By integrating an equivariant actor network, FiLM-based task modulation, and a multi-head critic, GEAR achieves both efficiency and flexibility in learning diverse aerobatic maneuvers, enabling a data-efficient, robust, and unified framework for aerobatic control. GEAR attains a 98.85\% success rate across various aerobatic tasks, significantly outperforming baseline methods. In real-world experiments, GEAR demonstrates stable execution of multiple maneuvers and the capability to combine basic motion primitives to complete complex aerobatics.
Abstract:Monocular vision-based target motion estimation is a fundamental challenge in numerous applications. This work introduces a novel bearing-box approach that fully leverages modern 3D detection measurements that are widely available nowadays but have not been well explored for motion estimation so far. Unlike existing methods that rely on restrictive assumptions such as isotropic target shape and lateral motion, our bearing-box estimator can estimate both the target's motion and its physical size without these assumptions by exploiting the information buried in a 3D bounding box. When applied to multi-rotor micro aerial vehicles (MAVs), the estimator yields an interesting advantage: it further removes the need for higher-order motion assumptions by exploiting the unique coupling between MAV's acceleration and thrust. This is particularly significant, as higher-order motion assumptions are widely believed to be necessary in state-of-the-art bearing-based estimators. We support our claims with rigorous observability analyses and extensive experimental validation, demonstrating the estimator's superior performance in real-world scenarios.
Abstract:An aerial manipulator, comprising a multirotor base and a robotic arm, is subject to significant dynamic coupling between these two components. Therefore, achieving precise and robust motion control is a challenging yet important objective. Here, we propose a novel prescribed performance motion control framework based on variable-gain extended state observers (ESOs), referred to as PreGME. The method includes variable-gain ESOs for real-time estimation of dynamic coupling and a prescribed performance flight control that incorporates error trajectory constraints. Compared with existing methods, the proposed approach exhibits the following two characteristics. First, the adopted variable-gain ESOs can accurately estimate rapidly varying dynamic coupling. This enables the proposed method to handle manipulation tasks that require aggressive motion of the robotic arm. Second, by prescribing the performance, a preset error trajectory is generated to guide the system evolution along this trajectory. This strategy allows the proposed method to ensure the tracking error remains within the prescribed performance envelope, thereby achieving high-precision control. Experiments on a real platform, including aerial staff twirling, aerial mixology, and aerial cart-pulling experiments, are conducted to validate the effectiveness of the proposed method. Experimental results demonstrate that even under the dynamic coupling caused by rapid robotic arm motion (end-effector velocity: 1.02 m/s, acceleration: 5.10 m/s$^2$), the proposed method achieves high tracking performance.
Abstract:This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.
Abstract:Multi-task multi-agent reinforcement learning (MT-MARL) has recently gained attention for its potential to enhance MARL's adaptability across multiple tasks. However, it is challenging for existing multi-task learning methods to handle complex problems, as they are unable to handle unrelated tasks and possess limited knowledge transfer capabilities. In this paper, we propose a hierarchical approach that efficiently addresses these challenges. The high-level module utilizes a skill graph, while the low-level module employs a standard MARL algorithm. Our approach offers two contributions. First, we consider the MT-MARL problem in the context of unrelated tasks, expanding the scope of MTRL. Second, the skill graph is used as the upper layer of the standard hierarchical approach, with training independent of the lower layer, effectively handling unrelated tasks and enhancing knowledge transfer capabilities. Extensive experiments are conducted to validate these advantages and demonstrate that the proposed method outperforms the latest hierarchical MAPPO algorithms. Videos and code are available at https://github.com/WindyLab/MT-MARL-SG




Abstract:Existing micro aerial vehicle (MAV) detection methods mainly rely on the target's appearance features in RGB images, whose diversity makes it difficult to achieve generalized MAV detection. We notice that different types of MAVs share the same distinctive features in event streams due to their high-speed rotating propellers, which are hard to see in RGB images. This paper studies how to detect different types of MAVs from an event camera by fully exploiting the features of propellers in the original event stream. The proposed method consists of three modules to extract the salient and spatio-temporal features of the propellers while filtering out noise from background objects and camera motion. Since there are no existing event-based MAV datasets, we introduce a novel MAV dataset for the community. This is the first event-based MAV dataset comprising multiple scenarios and different types of MAVs. Without training, our method significantly outperforms state-of-the-art methods and can deal with challenging scenarios, achieving a precision rate of 83.0\% (+30.3\%) and a recall rate of 81.5\% (+36.4\%) on the proposed testing dataset. The dataset and code are available at: https://github.com/WindyLab/EvDetMAV.




Abstract:Recently, photo-realistic novel view synthesis from multi-view images, such as neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS), have garnered widespread attention due to their superior performance. However, most works rely on low dynamic range (LDR) images, which limits the capturing of richer scene details. Some prior works have focused on high dynamic range (HDR) scene reconstruction, typically require capturing of multi-view sharp images with different exposure times at fixed camera positions during exposure times, which is time-consuming and challenging in practice. For a more flexible data acquisition, we propose a one-stage method: \textbf{CasualHDRSplat} to easily and robustly reconstruct the 3D HDR scene from casually captured videos with auto-exposure enabled, even in the presence of severe motion blur and varying unknown exposure time. \textbf{CasualHDRSplat} contains a unified differentiable physical imaging model which first applies continuous-time trajectory constraint to imaging process so that we can jointly optimize exposure time, camera response function (CRF), camera poses, and sharp 3D HDR scene. Extensive experiments demonstrate that our approach outperforms existing methods in terms of robustness and rendering quality. Our source code will be available at https://github.com/WU-CVGL/CasualHDRSplat




Abstract:Reinforcement learning (RL) algorithms can enable high-maneuverability in unmanned aerial vehicles (MAVs), but transferring them from simulation to real-world use is challenging. Variable-pitch propeller (VPP) MAVs offer greater agility, yet their complex dynamics complicate the sim-to-real transfer. This paper introduces a novel RL framework to overcome these challenges, enabling VPP MAVs to perform advanced aerial maneuvers in real-world settings. Our approach includes real-to-sim transfer techniques-such as system identification, domain randomization, and curriculum learning to create robust training simulations and a sim-to-real transfer strategy combining a cascade control system with a fast-response low-level controller for reliable deployment. Results demonstrate the effectiveness of this framework in achieving zero-shot deployment, enabling MAVs to perform complex maneuvers such as flips and wall-backtracking.