Picture for Samuel Albanie

Samuel Albanie

Michael Pokorny

How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

Add code
Jan 16, 2026
Viaarxiv icon

A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

Add code
Jun 09, 2025
Figure 1 for A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks
Figure 2 for A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks
Figure 3 for A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks
Figure 4 for A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks
Viaarxiv icon

Control Tax: The Price of Keeping AI in Check

Add code
Jun 05, 2025
Viaarxiv icon

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

Add code
Apr 09, 2025
Viaarxiv icon

An Approach to Technical AGI Safety and Security

Add code
Apr 02, 2025
Viaarxiv icon

Humanity's Last Exam

Add code
Jan 24, 2025
Viaarxiv icon

Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games

Add code
Dec 18, 2024
Figure 1 for Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games
Figure 2 for Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games
Figure 3 for Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games
Figure 4 for Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games
Viaarxiv icon

How to Merge Your Multimodal Models Over Time?

Add code
Dec 09, 2024
Figure 1 for How to Merge Your Multimodal Models Over Time?
Figure 2 for How to Merge Your Multimodal Models Over Time?
Figure 3 for How to Merge Your Multimodal Models Over Time?
Figure 4 for How to Merge Your Multimodal Models Over Time?
Viaarxiv icon

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Add code
Dec 09, 2024
Figure 1 for ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Figure 2 for ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Figure 3 for ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Figure 4 for ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Viaarxiv icon

Active Data Curation Effectively Distills Large-Scale Multimodal Models

Add code
Nov 27, 2024
Viaarxiv icon