GPT-4


Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?

Add code
Mar 26, 2026
Viaarxiv icon

NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders

Add code
Mar 25, 2026
Viaarxiv icon

LLMORPH: Automated Metamorphic Testing of Large Language Models

Add code
Mar 24, 2026
Viaarxiv icon

CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

Add code
Mar 23, 2026
Viaarxiv icon

CayleyPy-4: AI-Holography. Towards analogs of holographic string dualities for AI tasks

Add code
Mar 23, 2026
Viaarxiv icon

Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study

Add code
Mar 21, 2026
Viaarxiv icon

A Subgoal-driven Framework for Improving Long-Horizon LLM Agents

Add code
Mar 20, 2026
Viaarxiv icon

RPMS: Enhancing LLM-Based Embodied Planning through Rule-Augmented Memory Synergy

Add code
Mar 18, 2026
Viaarxiv icon

Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning

Add code
Mar 18, 2026
Viaarxiv icon

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

Add code
Mar 17, 2026
Viaarxiv icon