Picture for Xiyu Ren

Xiyu Ren

Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction

Add code
Feb 19, 2026
Viaarxiv icon

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Add code
May 15, 2025
Viaarxiv icon

ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty

Add code
Dec 28, 2024
Figure 1 for ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty
Figure 2 for ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty
Figure 3 for ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty
Figure 4 for ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty
Viaarxiv icon