Picture for John Yang

John Yang

OpenThoughts: Data Recipes for Reasoning Models

Add code
Jun 05, 2025
Viaarxiv icon

LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

Add code
May 12, 2025
Viaarxiv icon

SWE-smith: Scaling Data for Software Engineering Agents

Add code
Apr 30, 2025
Viaarxiv icon

MMTEB: Massive Multilingual Text Embedding Benchmark

Add code
Feb 19, 2025
Viaarxiv icon

Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

Add code
Dec 20, 2024
Figure 1 for Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
Figure 2 for Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
Figure 3 for Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
Figure 4 for Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
Viaarxiv icon

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

Add code
Oct 04, 2024
Figure 1 for SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
Figure 2 for SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
Figure 3 for SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
Figure 4 for SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
Viaarxiv icon

EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges

Add code
Sep 24, 2024
Figure 1 for EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges
Figure 2 for EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges
Figure 3 for EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges
Figure 4 for EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges
Viaarxiv icon

ReduceFormer: Attention with Tensor Reduction by Summation

Add code
Jun 11, 2024
Viaarxiv icon

DevBench: A Comprehensive Benchmark for Software Development

Add code
Mar 15, 2024
Figure 1 for DevBench: A Comprehensive Benchmark for Software Development
Figure 2 for DevBench: A Comprehensive Benchmark for Software Development
Figure 3 for DevBench: A Comprehensive Benchmark for Software Development
Figure 4 for DevBench: A Comprehensive Benchmark for Software Development
Viaarxiv icon

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Add code
Oct 10, 2023
Viaarxiv icon