Abstract:The application of Large Language Models (LLMs) for Automated Algorithm Discovery (AAD), particularly for optimisation heuristics, is an emerging field of research. This emergence necessitates robust, standardised benchmarking practices to rigorously evaluate the capabilities and limitations of LLM-driven AAD methods and the resulting generated algorithms, especially given the opacity of their design process and known issues with existing benchmarks. To address this need, we introduce BLADE (Benchmark suite for LLM-driven Automated Design and Evolution), a modular and extensible framework specifically designed for benchmarking LLM-driven AAD methods in a continuous black-box optimisation context. BLADE integrates collections of benchmark problems (including MA-BBOB and SBOX-COST among others) with instance generators and textual descriptions aimed at capability-focused testing, such as generalisation, specialisation and information exploitation. It offers flexible experimental setup options, standardised logging for reproducibility and fair comparison, incorporates methods for analysing the AAD process (e.g., Code Evolution Graphs and various visualisation approaches) and facilitates comparison against human-designed baselines through integration with established tools like IOHanalyser and IOHexplainer. BLADE provides an `out-of-the-box' solution to systematically evaluate LLM-driven AAD approaches. The framework is demonstrated through two distinct use cases exploring mutation prompt strategies and function specialisation.
Abstract:We study how large language models can be used in combination with evolutionary computation techniques to automatically discover optimization algorithms for the design of photonic structures. Building on the Large Language Model Evolutionary Algorithm (LLaMEA) framework, we introduce structured prompt engineering tailored to multilayer photonic problems such as Bragg mirror, ellipsometry inverse analysis, and solar cell antireflection coatings. We systematically explore multiple evolutionary strategies, including (1+1), (1+5), (2+10), and others, to balance exploration and exploitation. Our experiments show that LLM-generated algorithms, generated using small-scale problem instances, can match or surpass established methods like quasi-oppositional differential evolution on large-scale realistic real-world problem instances. Notably, LLaMEA's self-debugging mutation loop, augmented by automatically extracted problem-specific insights, achieves strong anytime performance and reliable convergence across diverse problem scales. This work demonstrates the feasibility of domain-focused LLM prompts and evolutionary approaches in solving optical design tasks, paving the way for rapid, automated photonic inverse design.
Abstract:Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and the open-loop gap. In this work, we establish a 3DGS-based closed-loop Reinforcement Learning (RL) training paradigm. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards that guide the policy to effectively respond to safety-critical events and understand real-world causal relationships. For better alignment with human driving behavior, IL is incorporated into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, especially 3x lower collision rate. Abundant closed-loop results are presented at https://hgao-cv.github.io/RAD.
Abstract:Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine Transformer and Mamba layers for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mmMamba-linear achieves competitive performance against existing linear and quadratic-complexity VLMs, while mmMamba-hybrid further improves performance significantly, approaching HoVLE's capabilities. At 103K tokens, mmMamba-linear demonstrates 20.6$\times$ speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves 13.5$\times$ speedup and 60.2% memory savings. Code and models are released at https://github.com/hustvl/mmMamba
Abstract:Next-generation wireless networks are conceived to provide reliable and high-data-rate communication services for diverse scenarios, such as vehicle-to-vehicle, unmanned aerial vehicles, and satellite networks. The severe Doppler spreads in the underlying time-varying channels induce destructive inter-carrier interference (ICI) in the extensively adopted orthogonal frequency division multiplexing (OFDM) waveform, leading to severe performance degradation. This calls for a new air interface design that can accommodate the severe delay-Doppler spreads in highly dynamic channels while possessing sufficient flexibility to cater to various applications. This article provides a comprehensive overview of a promising chirp-based waveform named affine frequency division multiplexing (AFDM). It is featured with two tunable parameters and achieves optimal diversity order in doubly dispersive channels (DDC). We study the fundamental principle of AFDM, illustrating its intrinsic suitability for DDC. Based on that, several potential applications of AFDM are explored. Furthermore, the major challenges and the corresponding solutions of AFDM are presented, followed by several future research directions. Finally, we draw some instructive conclusions about AFDM, hoping to provide useful inspiration for its development.
Abstract:The integration of Large Language Models (LLMs) with evolutionary computation (EC) has introduced a promising paradigm for automating the design of metaheuristic algorithms. However, existing frameworks, such as the Large Language Model Evolutionary Algorithm (LLaMEA), often lack precise control over mutation mechanisms, leading to inefficiencies in solution space exploration and potentially suboptimal convergence. This paper introduces a novel approach to mutation control within LLM-driven evolutionary frameworks, inspired by theory of genetic algorithms. Specifically, we propose dynamic mutation prompts that adaptively regulate mutation rates, leveraging a heavy-tailed power-law distribution to balance exploration and exploitation. Experiments using GPT-3.5-turbo and GPT-4o models demonstrate that GPT-3.5-turbo fails to adhere to the specific mutation instructions, while GPT-4o is able to adapt its mutation based on the prompt engineered dynamic prompts. Further experiments show that the introduction of these dynamic rates can improve the convergence speed and adaptability of LLaMEA, when using GPT-4o. This work sets the starting point for better controlled LLM-based mutations in code optimization tasks, paving the way for further advancements in automated metaheuristic design.
Abstract:Recently, the diffusion model has emerged as a powerful generative technique for robotic policy learning, capable of modeling multi-mode action distributions. Leveraging its capability for end-to-end autonomous driving is a promising direction. However, the numerous denoising steps in the robotic diffusion policy and the more dynamic, open-world nature of traffic scenes pose substantial challenges for generating diverse driving actions at a real-time speed. To address these challenges, we propose a novel truncated diffusion policy that incorporates prior multi-mode anchors and truncates the diffusion schedule, enabling the model to learn denoising from anchored Gaussian distribution to the multi-mode driving action distribution. Additionally, we design an efficient cascade diffusion decoder for enhanced interaction with conditional scene context. The proposed model, DiffusionDrive, demonstrates 10$\times$ reduction in denoising steps compared to vanilla diffusion policy, delivering superior diversity and quality in just 2 steps. On the planning-oriented NAVSIM dataset, with the aligned ResNet-34 backbone, DiffusionDrive achieves 88.1 PDMS without bells and whistles, setting a new record, while running at a real-time speed of 45 FPS on an NVIDIA 4090. Qualitative results on challenging scenarios further confirm that DiffusionDrive can robustly generate diverse plausible driving actions. Code and model will be available at https://github.com/hustvl/DiffusionDrive.
Abstract:This study addresses the challenge of fleet design optimization in the context of heterogeneous multi-robot fleets, aiming to obtain feasible designs that balance performance and costs. In the domain of autonomous multi-robot exploration, reinforcement learning agents play a central role, offering adaptability to complex terrains and facilitating collaboration among robots. However, modifying the fleet composition results in changes in the learned behavior, and training multi-robot systems using multi-agent reinforcement learning is expensive. Therefore, an exhaustive evaluation of each potential fleet design is infeasible. To tackle these hurdles, we introduce Bayesian Optimization for Fleet Design (BOFD), a framework leveraging multi-objective Bayesian Optimization to explore fleets on the Pareto front of performance and cost while accounting for uncertainty in the design space. Moreover, we establish a sub-linear bound for cumulative regret, supporting BOFD's robustness and efficacy. Extensive benchmark experiments in synthetic and simulated environments demonstrate the superiority of our framework over state-of-the-art methods, achieving efficient fleet designs with minimal fleet evaluations.
Abstract:The recent development of online static map element (a.k.a. HD map) construction algorithms has raised a vast demand for data with ground truth annotations. However, available public datasets currently cannot provide high-quality training data regarding consistency and accuracy. For instance, the manual labelled (low efficiency) nuScenes still contains misalignment and inconsistency between the HD maps and images (e.g., around 8.03 pixels reprojection error on average). To this end, we present CAMAv2: a vision-centric approach for Consistent and Accurate Map Annotation. Without LiDAR inputs, our proposed framework can still generate high-quality 3D annotations of static map elements. Specifically, the annotation can achieve high reprojection accuracy across all surrounding cameras and is spatial-temporal consistent across the whole sequence. We apply our proposed framework to the popular nuScenes dataset to provide efficient and highly accurate annotations. Compared with the original nuScenes static map element, our CAMAv2 annotations achieve lower reprojection errors (e.g., 4.96 vs. 8.03 pixels). Models trained with annotations from CAMAv2 also achieve lower reprojection errors (e.g., 5.62 vs. 8.43 pixels).
Abstract:With the increasing demand for multi-carrier communication in high-mobility scenarios, it is urgent to design new multi-carrier communication waveforms that can resist large delay-Doppler spreads. Various multi-carrier waveforms in the transform domain were proposed for the fast time-varying channels, including orthogonal time frequency space (OTFS), orthogonal chirp division multiplexing (OCDM), and affine frequency division multiplexing (AFDM). Among these, the AFDM is a strong candidate for its low implementation complexity and ability to achieve optimal diversity. This paper unifies the waveforms based on the discrete affine Fourier transform (DAFT) by using the chirp slope factor "k" in the time-frequency representation to construct a unified design framework for high-mobility communications. The design framework is employed to verify that the bit error rate performance of the DAFT-based waveform can be enhanced when the signal-to-noise ratio (SNR) is sufficiently high by adjusting the chirp slope factor "k".