Picture for Michael R. Lyu

Michael R. Lyu

CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations

Add code
Apr 19, 2025
Figure 1 for CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations
Figure 2 for CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations
Figure 3 for CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations
Figure 4 for CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations
Viaarxiv icon

Fact-or-Fair: A Checklist for Behavioral Testing of AI Models on Fairness-Related Queries

Add code
Feb 09, 2025
Figure 1 for Fact-or-Fair: A Checklist for Behavioral Testing of AI Models on Fairness-Related Queries
Figure 2 for Fact-or-Fair: A Checklist for Behavioral Testing of AI Models on Fairness-Related Queries
Figure 3 for Fact-or-Fair: A Checklist for Behavioral Testing of AI Models on Fairness-Related Queries
Figure 4 for Fact-or-Fair: A Checklist for Behavioral Testing of AI Models on Fairness-Related Queries
Viaarxiv icon

How Should I Build A Benchmark?

Add code
Jan 18, 2025
Viaarxiv icon

MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs

Add code
Dec 19, 2024
Viaarxiv icon

XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications

Add code
Dec 10, 2024
Figure 1 for XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications
Figure 2 for XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications
Figure 3 for XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications
Figure 4 for XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications
Viaarxiv icon

C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation

Add code
Dec 06, 2024
Figure 1 for C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation
Figure 2 for C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation
Figure 3 for C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation
Figure 4 for C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation
Viaarxiv icon

On the Shortcut Learning in Multilingual Neural Machine Translation

Add code
Nov 15, 2024
Viaarxiv icon

Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?

Add code
Nov 05, 2024
Figure 1 for Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?
Figure 2 for Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?
Figure 3 for Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?
Figure 4 for Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?
Viaarxiv icon

Enhancing Temporal Modeling of Video LLMs via Time Gating

Add code
Oct 08, 2024
Figure 1 for Enhancing Temporal Modeling of Video LLMs via Time Gating
Figure 2 for Enhancing Temporal Modeling of Video LLMs via Time Gating
Figure 3 for Enhancing Temporal Modeling of Video LLMs via Time Gating
Figure 4 for Enhancing Temporal Modeling of Video LLMs via Time Gating
Viaarxiv icon

Learning to Ask: When LLMs Meet Unclear Instruction

Add code
Aug 31, 2024
Viaarxiv icon