Benchmarking


ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems

Add code
Apr 02, 2026
Viaarxiv icon

PhiNet: Speaker Verification with Phonetic Interpretability

Add code
Apr 02, 2026
Viaarxiv icon

Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning

Add code
Apr 02, 2026
Viaarxiv icon

NED-Tree: Bridging the Semantic Gap with Nonlinear Element Decomposition Tree for LLM Nonlinear Optimization Modeling

Add code
Apr 02, 2026
Viaarxiv icon

PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance

Add code
Apr 02, 2026
Viaarxiv icon

Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection

Add code
Apr 02, 2026
Viaarxiv icon

Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning

Add code
Apr 02, 2026
Viaarxiv icon

A virtual-variable-length method for robust inverse kinematics of multi-segment continuum robots

Add code
Apr 02, 2026
Viaarxiv icon

UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models

Add code
Apr 02, 2026
Viaarxiv icon

Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models

Add code
Apr 02, 2026
Viaarxiv icon