Abstract:While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.
Abstract:On-demand ride-pooling has emerged as a popular urban transportation solution, addressing the efficiency limitations of traditional ride-hailing services by grouping multiple riding requests with spatiotemporal proximity into a single vehicle. Although numerous algorithms have been developed for the Ride-pool Assignment Problem (RAP) -- a core component of ride-pooling systems, there is a lack of open-source implementations, making it difficult to benchmark these algorithms on a common dataset and objective. In this paper, we present the implementation details of a ride-pool simulator that encompasses several key ride-pool assignment algorithms, along with associated components such as vehicle routing and rebalancing. We also open-source a highly optimized and modular C++ codebase, designed to facilitate the extension of new algorithms and features. Additionally, we introduce a family of swapping-based local-search heuristics to enhance existing ride-pool assignment algorithms, achieving a better balance between performance and computational efficiency. Extensive experiments on a large-scale, real-world dataset from Manhattan, NYC reveal that while all selected algorithms perform comparably, the newly proposed Multi-Round Linear Assignment with Cyclic Exchange (LA-MR-CE) algorithm achieves a state-of-the-art service rate with significantly reduced computational time. Furthermore, an in-depth analysis suggests that a performance barrier exists for all myopic ride-pool assignment algorithms due to the system's capacity bottleneck, and incorporating future information could be key to overcoming this limitation.