Picture for Shayne Longpre

Shayne Longpre

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Add code
Jun 05, 2025
Viaarxiv icon

The Leaderboard Illusion

Add code
Apr 29, 2025
Viaarxiv icon

International AI Safety Report

Add code
Jan 29, 2025
Viaarxiv icon

Towards Best Practices for Open Datasets for LLM Training

Add code
Jan 14, 2025
Viaarxiv icon

Bridging the Data Provenance Gap Across Text, Speech and Video

Add code
Dec 19, 2024
Figure 1 for Bridging the Data Provenance Gap Across Text, Speech and Video
Figure 2 for Bridging the Data Provenance Gap Across Text, Speech and Video
Figure 3 for Bridging the Data Provenance Gap Across Text, Speech and Video
Figure 4 for Bridging the Data Provenance Gap Across Text, Speech and Video
Viaarxiv icon

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Add code
Dec 04, 2024
Figure 1 for Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Figure 2 for Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Figure 3 for Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Figure 4 for Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Viaarxiv icon

A Systematic Review of NeurIPS Dataset Management Practices

Add code
Oct 31, 2024
Figure 1 for A Systematic Review of NeurIPS Dataset Management Practices
Figure 2 for A Systematic Review of NeurIPS Dataset Management Practices
Figure 3 for A Systematic Review of NeurIPS Dataset Management Practices
Figure 4 for A Systematic Review of NeurIPS Dataset Management Practices
Viaarxiv icon

To Err is AI : A Case Study Informing LLM Flaw Reporting Practices

Add code
Oct 15, 2024
Figure 1 for To Err is AI : A Case Study Informing LLM Flaw Reporting Practices
Figure 2 for To Err is AI : A Case Study Informing LLM Flaw Reporting Practices
Figure 3 for To Err is AI : A Case Study Informing LLM Flaw Reporting Practices
Figure 4 for To Err is AI : A Case Study Informing LLM Flaw Reporting Practices
Viaarxiv icon

Consent in Crisis: The Rapid Decline of the AI Data Commons

Add code
Jul 24, 2024
Figure 1 for Consent in Crisis: The Rapid Decline of the AI Data Commons
Figure 2 for Consent in Crisis: The Rapid Decline of the AI Data Commons
Figure 3 for Consent in Crisis: The Rapid Decline of the AI Data Commons
Figure 4 for Consent in Crisis: The Rapid Decline of the AI Data Commons
Viaarxiv icon

The Foundation Model Transparency Index v1.1: May 2024

Add code
Jul 17, 2024
Viaarxiv icon