GPT-4 Turbo


Clinical Validation of Medical-based Large Language Model Chatbots on Ophthalmic Patient Queries with LLM-based Evaluation

Add code
Feb 05, 2026
Viaarxiv icon

Can LLMs Do Rocket Science? Exploring the Limits of Complex Reasoning with GTOC 12

Add code
Feb 03, 2026
Viaarxiv icon

Hidden in Plain Text: Measuring LLM Deception Quality Against Human Baselines Using Social Deduction Games

Add code
Jan 20, 2026
Viaarxiv icon

T3: Benchmarking Sycophancy and Skepticism in Causal Judgment

Add code
Jan 13, 2026
Viaarxiv icon

PrivGemo: Privacy-Preserving Dual-Tower Graph Retrieval for Empowering LLM Reasoning with Memory Augmentation

Add code
Jan 13, 2026
Viaarxiv icon

LLM Agent Framework for Intelligent Change Analysis in Urban Environment using Remote Sensing Imagery

Add code
Jan 06, 2026
Viaarxiv icon

HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

Add code
Jan 07, 2026
Viaarxiv icon

MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models

Add code
Dec 21, 2025
Viaarxiv icon

A Self-Improving Architecture for Dynamic Safety in Large Language Models

Add code
Nov 10, 2025
Viaarxiv icon

Measuring Epistemic Humility in Multimodal Large Language Models

Add code
Sep 11, 2025
Viaarxiv icon