Abstract:Bird's-eye-view (BEV) representations derived from multi-camera input have become a central interface for online high-definition (HD) map construction. However, most approaches rely solely on ego-centric supervision, requiring large-scale scene structure to be inferred from incomplete observations, occlusions, and diminishing information density at long range, where perspective effects and spatial sparsity hinder consistent structural reasoning. We introduce Cross-View Supervision (CVS), a representation learning paradigm that transfers geometric and topological priors from an ego-aligned overhead perspective into camera-based BEV encoders. Rather than adding auxiliary semantic losses, CVS aligns representations in a shared BEV feature space and distills globally consistent structural knowledge from a perspective-privileged teacher into the ego-centric backbone. This supervision enhances structural coherence without modifying the inference architecture or requiring overhead input at test time. Experiments on nuScenes using ego-aligned aerial imagery from the AID4AD cross-view extension demonstrate consistent improvements over StreamMapNet while maintaining identical camera-only inference. CVS yields +3.9\,mAP in the standard $60\times30\,\mathrm{m}$ region and +9.9\,mAP in the extended $100\times50\,\mathrm{m}$ setting, corresponding to a 44\% relative gain at long range. These results highlight perspective-privileged structural supervision as a promising training principle for improving BEV representation learning in HD map construction.
Abstract:Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model's own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage's input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p < 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.
Abstract:The three-dimensional bin packing problem (3D-BPP) is a longstanding challenge in operations research and logistics. Classical heuristics and constructive methods can generate packings quickly, but often fail to address industrial constraints such as stability, balance, and handling feasibility. Metaheuristics such as genetic algorithms (GAs) provide flexibility and the ability to optimize across multiple objectives; however, pure GA approaches frequently struggle with efficiency, parameter sensitivity, and scalability to industrial order sizes. This gap is especially evident when scaling to real-world pallet dimensions, where even state-of-the-art algorithms often fail to achieve robust, deployable solutions. We propose a KPI-driven GA-based pipeline for industrial 3D-BPP that integrates key performance indicators directly into a multi-objective fitness function. The methodology combines a layer-based chromosome representation with domain-specific operators and constructive heuristics to balance efficiency and feasibility. On the BED-BPP benchmark of 1,500 real-world orders, our Hybrid-GA pipeline consistently outperforms heuristic- and learning-based state-of-the-art methods, achieving up to 35% higher space utilization and 15 to 20% stronger surface support, with lower variance across orders. These improvements come at a modest runtime cost but remain feasible for batch-scale deployment, yielding stable, balanced, and space-efficient packings.
Abstract:Lane detection is a fundamental task in autonomous driving. While the problem is typically formulated as the detection of continuous boundaries, we study the problem of detecting lane boundaries that are sparsely marked by 2D points with many false positives. This problem arises in the Formula Student Driverless (FSD) competition and is challenging due to its inherent ambiguity. Previous methods are inefficient and unable to find long-horizon solutions. We propose a deterministic algorithm called CLC that uses backtracking graph search with a learned likelihood function to overcome these limitations. We impose geometric constraints on the lane candidates to guarantee a geometrically sound lane. Our exhaustive search leads to finding the global optimum in 45% of instances, and the algorithm is overall robust to up to 50% false positives. Our algorithm runs in less than 15 ms on a single CPU core, meeting the low latency requirements of autonomous racing. We extensively evaluate our method on real data and realistic racetrack layouts, and show that it outperforms the state-of-the-art by detecting long lanes over 100 m with few (0.6%) critical failures. This allows our autonomous racecar to drive close to its physical limits on a previously unknown racetrack without being limited by perception. We release our dataset with realistic Formula Student racetracks to enable further research.