Picture for Hangyu Guo

Hangyu Guo

WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

Add code
Mar 11, 2026
Viaarxiv icon

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Add code
Feb 26, 2026
Viaarxiv icon

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

Add code
Feb 12, 2026
Viaarxiv icon

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Add code
Feb 11, 2026
Viaarxiv icon

R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging

Add code
Feb 06, 2026
Viaarxiv icon

STEP3-VL-10B Technical Report

Add code
Jan 15, 2026
Viaarxiv icon

DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models

Add code
Apr 25, 2025
Figure 1 for DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
Figure 2 for DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
Figure 3 for DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
Figure 4 for DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
Viaarxiv icon

WiS Platform: Enhancing Evaluation of LLM-Based Multi-Agent Systems Through Game-Based Analysis

Add code
Dec 04, 2024
Figure 1 for WiS Platform: Enhancing Evaluation of LLM-Based Multi-Agent Systems Through Game-Based Analysis
Figure 2 for WiS Platform: Enhancing Evaluation of LLM-Based Multi-Agent Systems Through Game-Based Analysis
Figure 3 for WiS Platform: Enhancing Evaluation of LLM-Based Multi-Agent Systems Through Game-Based Analysis
Figure 4 for WiS Platform: Enhancing Evaluation of LLM-Based Multi-Agent Systems Through Game-Based Analysis
Viaarxiv icon

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

Add code
Dec 02, 2024
Figure 1 for PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
Figure 2 for PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
Figure 3 for PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
Figure 4 for PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
Viaarxiv icon

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Add code
Nov 13, 2024
Figure 1 for Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
Figure 2 for Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
Figure 3 for Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
Figure 4 for Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
Viaarxiv icon