Picture for Bill Yuchen Lin

Bill Yuchen Lin

Shammie

Latent Action Pretraining from Videos

Add code
Oct 15, 2024
Figure 1 for Latent Action Pretraining from Videos
Figure 2 for Latent Action Pretraining from Videos
Figure 3 for Latent Action Pretraining from Videos
Figure 4 for Latent Action Pretraining from Videos
Viaarxiv icon

CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

Add code
Oct 03, 2024
Figure 1 for CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Figure 2 for CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Figure 3 for CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Figure 4 for CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Viaarxiv icon

Visual Perception in Text Strings

Add code
Oct 02, 2024
Figure 1 for Visual Perception in Text Strings
Figure 2 for Visual Perception in Text Strings
Figure 3 for Visual Perception in Text Strings
Figure 4 for Visual Perception in Text Strings
Viaarxiv icon

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

Add code
Sep 26, 2024
Figure 1 for HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
Figure 2 for HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
Figure 3 for HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
Figure 4 for HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
Viaarxiv icon

SimulBench: Evaluating Language Models with Creative Simulation Tasks

Add code
Sep 11, 2024
Figure 1 for SimulBench: Evaluating Language Models with Creative Simulation Tasks
Figure 2 for SimulBench: Evaluating Language Models with Creative Simulation Tasks
Figure 3 for SimulBench: Evaluating Language Models with Creative Simulation Tasks
Figure 4 for SimulBench: Evaluating Language Models with Creative Simulation Tasks
Viaarxiv icon

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

Add code
Jul 26, 2024
Figure 1 for OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Figure 2 for OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Figure 3 for OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Figure 4 for OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Viaarxiv icon

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Add code
Jul 15, 2024
Figure 1 for The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
Figure 2 for The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
Figure 3 for The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
Figure 4 for The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
Viaarxiv icon

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Add code
Jun 26, 2024
Figure 1 for WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Figure 2 for WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Figure 3 for WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Figure 4 for WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Viaarxiv icon

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

Add code
Jun 17, 2024
Viaarxiv icon

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Add code
Jun 16, 2024
Viaarxiv icon