Picture for Dieuwke Hupkes

Dieuwke Hupkes

Jack

The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes

Add code
Jan 15, 2026
Viaarxiv icon

MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages

Add code
Apr 15, 2025
Figure 1 for MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
Figure 2 for MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
Figure 3 for MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
Figure 4 for MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
Viaarxiv icon

Compute Optimal Scaling of Skills: Knowledge vs Reasoning

Add code
Mar 13, 2025
Viaarxiv icon

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

Add code
Feb 24, 2025
Figure 1 for Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Figure 2 for Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Figure 3 for Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Figure 4 for Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Viaarxiv icon

MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Add code
Feb 20, 2025
Figure 1 for MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Figure 2 for MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Figure 3 for MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Figure 4 for MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Viaarxiv icon

Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models

Add code
Nov 21, 2024
Figure 1 for Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
Figure 2 for Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
Figure 3 for Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
Figure 4 for Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
Viaarxiv icon

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

Add code
Nov 06, 2024
Figure 1 for Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?
Figure 2 for Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?
Figure 3 for Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?
Figure 4 for Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?
Viaarxiv icon

The Llama 3 Herd of Models

Add code
Jul 31, 2024
Viaarxiv icon

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Add code
Jun 18, 2024
Figure 1 for Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Figure 2 for Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Figure 3 for Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Figure 4 for Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Viaarxiv icon

Quantifying Variance in Evaluation Benchmarks

Add code
Jun 14, 2024
Figure 1 for Quantifying Variance in Evaluation Benchmarks
Figure 2 for Quantifying Variance in Evaluation Benchmarks
Figure 3 for Quantifying Variance in Evaluation Benchmarks
Figure 4 for Quantifying Variance in Evaluation Benchmarks
Viaarxiv icon