Picture for Xiangru Tang

Xiangru Tang

Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

Add code
May 27, 2025
Viaarxiv icon

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Add code
May 26, 2025
Figure 1 for ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Figure 2 for ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Figure 3 for ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Figure 4 for ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Viaarxiv icon

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Add code
May 21, 2025
Figure 1 for KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
Figure 2 for KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
Figure 3 for KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
Figure 4 for KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
Viaarxiv icon

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Add code
Mar 31, 2025
Viaarxiv icon

LocAgent: Graph-Guided LLM Agents for Code Localization

Add code
Mar 12, 2025
Figure 1 for LocAgent: Graph-Guided LLM Agents for Code Localization
Figure 2 for LocAgent: Graph-Guided LLM Agents for Code Localization
Figure 3 for LocAgent: Graph-Guided LLM Agents for Code Localization
Figure 4 for LocAgent: Graph-Guided LLM Agents for Code Localization
Viaarxiv icon

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

Add code
Mar 10, 2025
Figure 1 for MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
Figure 2 for MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
Figure 3 for MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
Figure 4 for MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
Viaarxiv icon

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Add code
Mar 03, 2025
Viaarxiv icon

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Add code
Jan 21, 2025
Figure 1 for MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Figure 2 for MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Figure 3 for MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Figure 4 for MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Viaarxiv icon

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

Add code
Jan 11, 2025
Figure 1 for ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning
Figure 2 for ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning
Figure 3 for ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning
Figure 4 for ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning
Viaarxiv icon

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain

Add code
Nov 23, 2024
Figure 1 for ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
Figure 2 for ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
Figure 3 for ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
Figure 4 for ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
Viaarxiv icon