Picture for Zijian Chen

Zijian Chen

A^3: Towards Advertising Aesthetic Assessment

Add code
Mar 25, 2026
Viaarxiv icon

UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities

Add code
Mar 24, 2026
Viaarxiv icon

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

Add code
Mar 16, 2026
Viaarxiv icon

AgentIR: Reasoning-Aware Retrieval for Deep Research Agents

Add code
Mar 05, 2026
Viaarxiv icon

SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond

Add code
Mar 02, 2026
Viaarxiv icon

STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction

Add code
Feb 12, 2026
Viaarxiv icon

GTPred: Benchmarking MLLMs for Interpretable Geo-localization and Time-of-capture Prediction

Add code
Jan 19, 2026
Viaarxiv icon

KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?

Add code
Jan 13, 2026
Viaarxiv icon

Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

Add code
Nov 12, 2025
Figure 1 for Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models
Figure 2 for Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models
Figure 3 for Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models
Figure 4 for Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models
Viaarxiv icon

MACEval: A Multi-Agent Continual Evaluation Network for Large Models

Add code
Nov 12, 2025
Viaarxiv icon