Picture for Dhruv Kumar

Dhruv Kumar

Context Over Content: Exposing Evaluation Faking in Automated Judges

Add code
Apr 16, 2026
Viaarxiv icon

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Add code
Apr 16, 2026
Viaarxiv icon

IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

Add code
Apr 15, 2026
Viaarxiv icon

BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection

Add code
Apr 13, 2026
Viaarxiv icon

Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

Add code
Apr 08, 2026
Viaarxiv icon

LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

Add code
Apr 07, 2026
Viaarxiv icon

LLM-as-a-Judge for Time Series Explanations

Add code
Apr 02, 2026
Viaarxiv icon

Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows

Add code
Mar 15, 2026
Viaarxiv icon

Trust Regions Sell, But Who's Buying? Overlap Geometry as an Alternative Trust Region for Policy Optimization

Add code
Feb 06, 2026
Viaarxiv icon

The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation

Add code
Jan 29, 2026
Viaarxiv icon