Picture for Hyungjoo Chae

Hyungjoo Chae

ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Add code
May 29, 2025
Viaarxiv icon

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Add code
May 21, 2025
Viaarxiv icon

Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization

Add code
May 19, 2025
Viaarxiv icon

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Add code
Oct 17, 2024
Viaarxiv icon

Evaluating Robustness of Reward Models for Mathematical Reasoning

Add code
Oct 02, 2024
Figure 1 for Evaluating Robustness of Reward Models for Mathematical Reasoning
Figure 2 for Evaluating Robustness of Reward Models for Mathematical Reasoning
Figure 3 for Evaluating Robustness of Reward Models for Mathematical Reasoning
Figure 4 for Evaluating Robustness of Reward Models for Mathematical Reasoning
Viaarxiv icon

Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code

Add code
Sep 29, 2024
Figure 1 for Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
Figure 2 for Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
Figure 3 for Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
Figure 4 for Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
Viaarxiv icon

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics

Add code
Jun 20, 2024
Figure 1 for Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
Figure 2 for Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
Figure 3 for Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
Figure 4 for Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
Viaarxiv icon

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Add code
Jun 09, 2024
Figure 1 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Figure 2 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Figure 3 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Figure 4 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Viaarxiv icon

Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models

Add code
Apr 03, 2024
Figure 1 for Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models
Figure 2 for Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models
Figure 3 for Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models
Figure 4 for Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models
Viaarxiv icon

Evidence-Focused Fact Summarization for Knowledge-Augmented Zero-Shot Question Answering

Add code
Mar 05, 2024
Viaarxiv icon