Picture for Muhammad Khalifa

Muhammad Khalifa

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Add code
Jan 21, 2026
Viaarxiv icon

Process Reward Models That Think

Add code
Apr 23, 2025
Figure 1 for Process Reward Models That Think
Figure 2 for Process Reward Models That Think
Figure 3 for Process Reward Models That Think
Figure 4 for Process Reward Models That Think
Viaarxiv icon

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Add code
Apr 13, 2025
Figure 1 for MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
Figure 2 for MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
Figure 3 for MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
Figure 4 for MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
Viaarxiv icon

If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs

Add code
Dec 05, 2024
Figure 1 for If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
Figure 2 for If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
Figure 3 for If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
Figure 4 for If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
Viaarxiv icon

Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation

Add code
Nov 11, 2024
Figure 1 for Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation
Figure 2 for Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation
Figure 3 for Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation
Figure 4 for Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation
Viaarxiv icon

FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs

Add code
Oct 03, 2024
Figure 1 for FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs
Figure 2 for FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs
Figure 3 for FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs
Figure 4 for FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs
Viaarxiv icon

Learning to Reason via Program Generation, Emulation, and Search

Add code
May 28, 2024
Figure 1 for Learning to Reason via Program Generation, Emulation, and Search
Figure 2 for Learning to Reason via Program Generation, Emulation, and Search
Figure 3 for Learning to Reason via Program Generation, Emulation, and Search
Figure 4 for Learning to Reason via Program Generation, Emulation, and Search
Viaarxiv icon

Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Add code
Apr 26, 2024
Figure 1 for Small Language Models Need Strong Verifiers to Self-Correct Reasoning
Figure 2 for Small Language Models Need Strong Verifiers to Self-Correct Reasoning
Figure 3 for Small Language Models Need Strong Verifiers to Self-Correct Reasoning
Figure 4 for Small Language Models Need Strong Verifiers to Self-Correct Reasoning
Viaarxiv icon

Source-Aware Training Enables Knowledge Attribution in Language Models

Add code
Apr 11, 2024
Figure 1 for Source-Aware Training Enables Knowledge Attribution in Language Models
Figure 2 for Source-Aware Training Enables Knowledge Attribution in Language Models
Figure 3 for Source-Aware Training Enables Knowledge Attribution in Language Models
Figure 4 for Source-Aware Training Enables Knowledge Attribution in Language Models
Viaarxiv icon

LitCab: Lightweight Calibration of Language Models on Outputs of Varied Lengths

Add code
Oct 30, 2023
Figure 1 for LitCab: Lightweight Calibration of Language Models on Outputs of Varied Lengths
Figure 2 for LitCab: Lightweight Calibration of Language Models on Outputs of Varied Lengths
Figure 3 for LitCab: Lightweight Calibration of Language Models on Outputs of Varied Lengths
Figure 4 for LitCab: Lightweight Calibration of Language Models on Outputs of Varied Lengths
Viaarxiv icon