Picture for Graham Neubig

Graham Neubig

Carnegie Mellon University

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Add code
Apr 15, 2025
Viaarxiv icon

Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering

Add code
Apr 10, 2025
Figure 1 for Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering
Figure 2 for Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering
Figure 3 for Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering
Figure 4 for Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering
Viaarxiv icon

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Add code
Apr 09, 2025
Figure 1 for SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Figure 2 for SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Figure 3 for SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Figure 4 for SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Viaarxiv icon

Inducing Programmatic Skills for Agentic Tasks

Add code
Apr 09, 2025
Figure 1 for Inducing Programmatic Skills for Agentic Tasks
Figure 2 for Inducing Programmatic Skills for Agentic Tasks
Figure 3 for Inducing Programmatic Skills for Agentic Tasks
Figure 4 for Inducing Programmatic Skills for Agentic Tasks
Viaarxiv icon

M-Prometheus: A Suite of Open Multilingual LLM Judges

Add code
Apr 07, 2025
Viaarxiv icon

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

Add code
Mar 25, 2025
Viaarxiv icon

Overtrained Language Models Are Harder to Fine-Tune

Add code
Mar 24, 2025
Figure 1 for Overtrained Language Models Are Harder to Fine-Tune
Figure 2 for Overtrained Language Models Are Harder to Fine-Tune
Figure 3 for Overtrained Language Models Are Harder to Fine-Tune
Figure 4 for Overtrained Language Models Are Harder to Fine-Tune
Viaarxiv icon

Benchmarking Failures in Tool-Augmented Language Models

Add code
Mar 18, 2025
Figure 1 for Benchmarking Failures in Tool-Augmented Language Models
Figure 2 for Benchmarking Failures in Tool-Augmented Language Models
Figure 3 for Benchmarking Failures in Tool-Augmented Language Models
Figure 4 for Benchmarking Failures in Tool-Augmented Language Models
Viaarxiv icon

Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention

Add code
Mar 11, 2025
Viaarxiv icon

Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions

Add code
Mar 05, 2025
Figure 1 for Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Figure 2 for Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Figure 3 for Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Figure 4 for Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Viaarxiv icon