Picture for Bill Yuchen Lin

Bill Yuchen Lin

Shammie

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Add code
Jul 15, 2024
Viaarxiv icon

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Add code
Jun 26, 2024
Viaarxiv icon

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

Add code
Jun 17, 2024
Viaarxiv icon

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Add code
Jun 16, 2024
Viaarxiv icon

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Add code
Jun 12, 2024
Viaarxiv icon

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Add code
Jun 09, 2024
Viaarxiv icon

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Add code
Jun 07, 2024
Viaarxiv icon

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Add code
May 02, 2024
Viaarxiv icon

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Add code
Apr 09, 2024
Figure 1 for VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
Figure 2 for VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
Figure 3 for VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
Figure 4 for VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
Viaarxiv icon

RewardBench: Evaluating Reward Models for Language Modeling

Add code
Mar 20, 2024
Figure 1 for RewardBench: Evaluating Reward Models for Language Modeling
Figure 2 for RewardBench: Evaluating Reward Models for Language Modeling
Figure 3 for RewardBench: Evaluating Reward Models for Language Modeling
Figure 4 for RewardBench: Evaluating Reward Models for Language Modeling
Viaarxiv icon