Picture for Shafiq Joty

Shafiq Joty

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Add code
Apr 21, 2025
Viaarxiv icon

A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

Add code
Apr 12, 2025
Viaarxiv icon

ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering

Add code
Apr 10, 2025
Viaarxiv icon

Adaptation of Large Language Models

Add code
Apr 04, 2025
Viaarxiv icon

Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings

Add code
Mar 19, 2025
Viaarxiv icon

Multi$^2$: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

Add code
Feb 27, 2025
Viaarxiv icon

Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding

Add code
Feb 17, 2025
Viaarxiv icon

Demystifying Domain-adaptive Post-training for Financial LLMs

Add code
Jan 09, 2025
Figure 1 for Demystifying Domain-adaptive Post-training for Financial LLMs
Figure 2 for Demystifying Domain-adaptive Post-training for Financial LLMs
Figure 3 for Demystifying Domain-adaptive Post-training for Financial LLMs
Figure 4 for Demystifying Domain-adaptive Post-training for Financial LLMs
Viaarxiv icon

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

Add code
Dec 23, 2024
Figure 1 for StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs
Figure 2 for StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs
Figure 3 for StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs
Figure 4 for StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs
Viaarxiv icon

Preference Optimization for Reasoning with Pseudo Feedback

Add code
Nov 25, 2024
Viaarxiv icon