Abstract:We present HaoMo Vision-Language Model (HMVLM), an end-to-end driving framework that implements the slow branch of a cognitively inspired fast-slow architecture. A fast controller outputs low-level steering, throttle, and brake commands, while a slow planner-a large vision-language model-generates high-level intents such as "yield to pedestrian" or "merge after the truck" without compromising latency. HMVLM introduces three upgrades: (1) selective five-view prompting with an embedded 4s history of ego kinematics, (2) multi-stage chain-of-thought (CoT) prompting that enforces a Scene Understanding -> Driving Decision -> Trajectory Inference reasoning flow, and (3) spline-based trajectory post-processing that removes late-stage jitter and sharp turns. Trained on the Waymo Open Dataset, these upgrades enable HMVLM to achieve a Rater Feedback Score (RFS) of 7.7367, securing 2nd place in the 2025 Waymo Vision-based End-to-End (E2E) Driving Challenge and surpassing the public baseline by 2.77%.
Abstract:While end-to-end autonomous driving has advanced significantly, prevailing methods remain fundamentally misaligned with human cognitive principles in both perception and planning. In this paper, we propose CogAD, a novel end-to-end autonomous driving model that emulates the hierarchical cognition mechanisms of human drivers. CogAD implements dual hierarchical mechanisms: global-to-local context processing for human-like perception and intent-conditioned multi-mode trajectory generation for cognitively-inspired planning. The proposed method demonstrates three principal advantages: comprehensive environmental understanding through hierarchical perception, robust planning exploration enabled by multi-level planning, and diverse yet reasonable multi-modal trajectory generation facilitated by dual-level uncertainty modeling. Extensive experiments on nuScenes and Bench2Drive demonstrate that CogAD achieves state-of-the-art performance in end-to-end planning, exhibiting particular superiority in long-tail scenarios and robust generalization to complex real-world driving conditions.
Abstract:The point pair feature (PPF) is widely used for 6D pose estimation. In this paper, we propose an efficient 6D pose estimation method based on the PPF framework. We introduce a well-targeted down-sampling strategy that focuses more on edge area for efficient feature extraction of complex geometry. A pose hypothesis validation approach is proposed to resolve the symmetric ambiguity by calculating edge matching degree. We perform evaluations on two challenging datasets and one real-world collected dataset, demonstrating the superiority of our method on pose estimation of geometrically complex, occluded, symmetrical objects. We further validate our method by applying it to simulated punctures.
Abstract:Correspondence-based point cloud registration is a cornerstone in robotics perception and computer vision, which seeks to estimate the best rigid transformation aligning two point clouds from the putative correspondences. However, due to the limited robustness of 3D keypoint matching approaches, outliers, probably in large numbers, are prone to exist among the correspondences, which makes robust registration methods imperative. Unfortunately, existing robust methods have their own limitations (e.g. high computational cost or limited robustness) when facing high or extreme outlier ratios, probably unsuitable for practical use. In this paper, we present a novel, fast, deterministic and guaranteed robust solver, named TriVoC (Triple-layered Voting with Consensus maximization), for the robust registration problem. We decompose the selecting of the minimal 3-point sets into 3 consecutive layers, and in each layer we design an efficient voting and correspondence sorting framework on the basis of the pairwise equal-length constraint. In this manner, the 3-point sets can be selected independently from the reduced correspondence sets according to the sorted sequence, which can significantly lower the computational cost and meanwhile provide a strong guarantee to achieve the largest consensus set (as the final inlier set) as long as a probabilistic termination condition is fulfilled. Varied experiments show that our solver TriVoC is robust against up to 99% outliers, highly accurate, time-efficient even with extreme outlier ratios, and also practical for real-world applications, showing performance superior to other state-of-the-art competitors.