Picture for Jiahe Jin

Jiahe Jin

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

Add code
May 25, 2025
Figure 1 for DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
Figure 2 for DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
Figure 3 for DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
Figure 4 for DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
Viaarxiv icon

Efficient Agent Training for Computer Use

Add code
May 20, 2025
Viaarxiv icon

Generative AI Act II: Test Time Scaling Drives Cognition Engineering

Add code
Apr 21, 2025
Viaarxiv icon

Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?

Add code
Feb 12, 2025
Figure 1 for Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?
Figure 2 for Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?
Figure 3 for Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?
Figure 4 for Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?
Viaarxiv icon

PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World

Add code
Dec 23, 2024
Figure 1 for PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
Figure 2 for PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
Figure 3 for PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
Figure 4 for PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
Viaarxiv icon

BeHonest: Benchmarking Honesty of Large Language Models

Add code
Jun 19, 2024
Viaarxiv icon