Picture for Jordan Lee Boyd-Graber

Jordan Lee Boyd-Graber

DRACULA: Hunting for the Actions Users Want Deep Research Agents to Execute

Add code
Apr 26, 2026
Viaarxiv icon

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

Add code
Apr 23, 2026
Viaarxiv icon

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Add code
Mar 17, 2026
Viaarxiv icon

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

Add code
Feb 05, 2026
Viaarxiv icon

Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

Add code
Jun 18, 2025
Viaarxiv icon

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos

Add code
May 02, 2025
Viaarxiv icon

Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

Add code
Mar 09, 2025
Figure 1 for Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators
Figure 2 for Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators
Figure 3 for Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators
Figure 4 for Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators
Viaarxiv icon

GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

Add code
Feb 27, 2025
Figure 1 for GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
Figure 2 for GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
Figure 3 for GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
Figure 4 for GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
Viaarxiv icon

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Add code
Feb 19, 2025
Figure 1 for Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
Figure 2 for Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
Figure 3 for Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
Figure 4 for Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
Viaarxiv icon

Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL

Add code
Feb 18, 2025
Viaarxiv icon