Picture for Jason Wei

Jason Wei

Tony

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Add code
May 13, 2025
Viaarxiv icon

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Add code
Apr 16, 2025
Viaarxiv icon

OpenAI o1 System Card

Add code
Dec 21, 2024
Figure 1 for OpenAI o1 System Card
Figure 2 for OpenAI o1 System Card
Figure 3 for OpenAI o1 System Card
Figure 4 for OpenAI o1 System Card
Viaarxiv icon

Deliberative Alignment: Reasoning Enables Safer Language Models

Add code
Dec 20, 2024
Viaarxiv icon

Measuring short-form factuality in large language models

Add code
Nov 07, 2024
Figure 1 for Measuring short-form factuality in large language models
Figure 2 for Measuring short-form factuality in large language models
Figure 3 for Measuring short-form factuality in large language models
Figure 4 for Measuring short-form factuality in large language models
Viaarxiv icon

GPT-4o System Card

Add code
Oct 25, 2024
Viaarxiv icon

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

Add code
Oct 05, 2023
Figure 1 for FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Figure 2 for FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Figure 3 for FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Figure 4 for FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Viaarxiv icon

Flan-MoE: Scaling Instruction-Finetuned Language Models with Sparse Mixture of Experts

Add code
May 24, 2023
Viaarxiv icon

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Add code
May 22, 2023
Figure 1 for A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Figure 2 for A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Figure 3 for A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Figure 4 for A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Viaarxiv icon

Larger language models do in-context learning differently

Add code
Mar 08, 2023
Figure 1 for Larger language models do in-context learning differently
Figure 2 for Larger language models do in-context learning differently
Figure 3 for Larger language models do in-context learning differently
Figure 4 for Larger language models do in-context learning differently
Viaarxiv icon