Picture for Ofir Press

Ofir Press

VideoGameBench: Can Vision-Language Models complete popular video games?

Add code
May 23, 2025
Viaarxiv icon

SWE-smith: Scaling Data for Software Engineering Agents

Add code
Apr 30, 2025
Viaarxiv icon

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

Add code
Oct 04, 2024
Figure 1 for SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
Figure 2 for SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
Figure 3 for SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
Figure 4 for SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
Viaarxiv icon

EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges

Add code
Sep 24, 2024
Figure 1 for EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges
Figure 2 for EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges
Figure 3 for EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges
Figure 4 for EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges
Viaarxiv icon

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Add code
Jul 22, 2024
Viaarxiv icon

SciCode: A Research Coding Benchmark Curated by Scientists

Add code
Jul 18, 2024
Figure 1 for SciCode: A Research Coding Benchmark Curated by Scientists
Figure 2 for SciCode: A Research Coding Benchmark Curated by Scientists
Figure 3 for SciCode: A Research Coding Benchmark Curated by Scientists
Figure 4 for SciCode: A Research Coding Benchmark Curated by Scientists
Viaarxiv icon

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Add code
Oct 10, 2023
Viaarxiv icon

How Language Model Hallucinations Can Snowball

Add code
May 22, 2023
Figure 1 for How Language Model Hallucinations Can Snowball
Figure 2 for How Language Model Hallucinations Can Snowball
Figure 3 for How Language Model Hallucinations Can Snowball
Figure 4 for How Language Model Hallucinations Can Snowball
Viaarxiv icon

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Add code
Nov 09, 2022
Viaarxiv icon

What Language Model to Train if You Have One Million GPU Hours?

Add code
Nov 08, 2022
Viaarxiv icon