Claude


Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Add code
Mar 27, 2026
Viaarxiv icon

Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

Add code
Mar 27, 2026
Viaarxiv icon

Can AI Scientist Agents Learn from Lab-in-the-Loop Feedback? Evidence from Iterative Perturbation Discovery

Add code
Mar 27, 2026
Viaarxiv icon

ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

Add code
Mar 27, 2026
Viaarxiv icon

VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation

Add code
Mar 27, 2026
Viaarxiv icon

MobileDev-Bench: A Comprehensive Benchmark for Evaluating Language Models on Mobile Application Development

Add code
Mar 26, 2026
Viaarxiv icon

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Add code
Mar 26, 2026
Viaarxiv icon

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Add code
Mar 26, 2026
Viaarxiv icon

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Add code
Mar 25, 2026
Viaarxiv icon

BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

Add code
Mar 25, 2026
Viaarxiv icon