Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stella Biderman

Consent in Crisis: The Rapid Decline of the AI Data Commons

Jul 24, 2024

Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter(+39 more)

Figure 1 for Consent in Crisis: The Rapid Decline of the AI Data Commons

Figure 2 for Consent in Crisis: The Rapid Decline of the AI Data Commons

Figure 3 for Consent in Crisis: The Rapid Decline of the AI Data Commons

Figure 4 for Consent in Crisis: The Rapid Decline of the AI Data Commons

Abstract:General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research.

* 41 pages (13 main), 5 figures, 9 tables

Via

Access Paper or Ask Questions

LLM Circuit Analyses Are Consistent Across Training and Scale

Jul 15, 2024

Curt Tigges, Michael Hanna, Qinan Yu, Stella Biderman

Figure 1 for LLM Circuit Analyses Are Consistent Across Training and Scale

Figure 2 for LLM Circuit Analyses Are Consistent Across Training and Scale

Figure 3 for LLM Circuit Analyses Are Consistent Across Training and Scale

Figure 4 for LLM Circuit Analyses Are Consistent Across Training and Scale

Abstract:Most currently deployed large language models (LLMs) undergo continuous training or additional finetuning. By contrast, most research into LLMs' internal mechanisms focuses on models at one snapshot in time (the end of pre-training), raising the question of whether their results generalize to real-world settings. Existing studies of mechanisms over time focus on encoder-only or toy models, which differ significantly from most deployed models. In this study, we track how model mechanisms, operationalized as circuits, emerge and evolve across 300 billion tokens of training in decoder-only LLMs, in models ranging from 70 million to 2.8 billion parameters. We find that task abilities and the functional components that support them emerge consistently at similar token counts across scale. Moreover, although such components may be implemented by different attention heads over time, the overarching algorithm that they implement remains. Surprisingly, both these algorithms and the types of components involved therein can replicate across model scale. These results suggest that circuit analyses conducted on small models at the end of pre-training can provide insights that still apply after additional pre-training and over model scale.

Via

Access Paper or Ask Questions

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Jun 26, 2024

Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San(+13 more)

Figure 1 for The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Figure 2 for The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Figure 3 for The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Abstract:Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.

Via

Access Paper or Ask Questions

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

Jun 25, 2024

USVSN Sai Prashanth, Alvin Deng, Kyle O'Brien, Jyothir S V, Mohammad Aflah Khan, Jaydeep Borkar, Christopher A. Choquette-Choo, Jacob Ray Fuehne, Stella Biderman, Tracy Ke(+2 more)

Figure 1 for Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

Figure 2 for Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

Figure 3 for Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

Figure 4 for Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

Abstract:Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.

Via

Access Paper or Ask Questions

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Jun 06, 2024

Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo

Figure 1 for Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Figure 2 for Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Figure 3 for Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Figure 4 for Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Abstract:Predictable behavior from scaling advanced AI systems is an extremely desirable property. Although a well-established literature exists on how pretraining performance scales, the literature on how particular downstream capabilities scale is significantly muddier. In this work, we take a step back and ask: why has predicting specific downstream capabilities with scale remained elusive? While many factors are certainly responsible, we identify a new factor that makes modeling scaling behavior on widely used multiple-choice question-answering benchmarks challenging. Using five model families and twelve well-established multiple-choice benchmarks, we show that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively degrade the statistical relationship between performance and scale. We then reveal the mechanism causing this degradation: downstream metrics require comparing the correct choice against a small number of specific incorrect choices, meaning accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on specific incorrect choices with scale. We empirically study how probability mass on the correct choice co-varies with probability mass on incorrect choices with increasing compute, suggesting that scaling laws for incorrect choices might be achievable. Our work also explains why pretraining scaling laws are commonly regarded as more predictable than downstream capabilities and contributes towards establishing scaling-predictable evaluations of frontier AI models.

Via

Access Paper or Ask Questions

Lessons from the Trenches on Reproducible Evaluation of Language Models

May 23, 2024

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive(+19 more)

Figure 1 for Lessons from the Trenches on Reproducible Evaluation of Language Models

Figure 2 for Lessons from the Trenches on Reproducible Evaluation of Language Models

Figure 3 for Lessons from the Trenches on Reproducible Evaluation of Language Models

Figure 4 for Lessons from the Trenches on Reproducible Evaluation of Language Models

Abstract:Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers. First, we provide an overview of common challenges faced in language model evaluation. Second, we delineate best practices for addressing or lessening the impact of these challenges on research. Third, we present the Language Model Evaluation Harness (lm-eval): an open source library for independent, reproducible, and extensible evaluation of language models that seeks to address these issues. We describe the features of the library as well as case studies in which the library has been used to alleviate these methodological concerns.

Via

Access Paper or Ask Questions

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Apr 10, 2024

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou(+18 more)

Figure 1 for Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Figure 2 for Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Figure 3 for Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Figure 4 for Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Abstract:We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokenizer based on greedy matching for enhanced multilinguality. We trained four Eagle models, ranging from 0.46 to 7.5 billion parameters, and two Finch models with 1.6 and 3.1 billion parameters and find that they achieve competitive performance across a wide variety of benchmarks. We release all our models on HuggingFace under the Apache 2.0 license. Models at: https://huggingface.co/RWKV Training code at: https://github.com/RWKV/RWKV-LM Inference code at: https://github.com/RWKV/ChatRWKV Time-parallel training code at: https://github.com/RWKV/RWKV-infctx-trainer

Via

Access Paper or Ask Questions

Holographic Global Convolutional Networks for Long-Range Prediction Tasks in Malware Detection

Mar 23, 2024

Mohammad Mahmudul Alam, Edward Raff, Stella Biderman, Tim Oates, James Holt

Figure 1 for Holographic Global Convolutional Networks for Long-Range Prediction Tasks in Malware Detection

Figure 2 for Holographic Global Convolutional Networks for Long-Range Prediction Tasks in Malware Detection

Figure 3 for Holographic Global Convolutional Networks for Long-Range Prediction Tasks in Malware Detection

Figure 4 for Holographic Global Convolutional Networks for Long-Range Prediction Tasks in Malware Detection

Abstract:Malware detection is an interesting and valuable domain to work in because it has significant real-world impact and unique machine-learning challenges. We investigate existing long-range techniques and benchmarks and find that they're not very suitable in this problem area. In this paper, we introduce Holographic Global Convolutional Networks (HGConv) that utilize the properties of Holographic Reduced Representations (HRR) to encode and decode features from sequence elements. Unlike other global convolutional methods, our method does not require any intricate kernel computation or crafted kernel design. HGConv kernels are defined as simple parameters learned through backpropagation. The proposed method has achieved new SOTA results on Microsoft Malware Classification Challenge, Drebin, and EMBER malware benchmarks. With log-linear complexity in sequence length, the empirical results demonstrate substantially faster run-time by HGConv compared to other methods achieving far more efficient scaling even with sequence length $\geq 100,000$.

* To appear in Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) 2024, Valencia, Spain

Via

Access Paper or Ask Questions

On the Societal Impact of Open Foundation Models

Feb 27, 2024

Sayash Kapoor, Rishi Bommasani, Kevin Klyman, Shayne Longpre, Ashwin Ramaswami, Peter Cihon, Aspen Hopkins, Kevin Bankston, Stella Biderman, Miranda Bogen(+15 more)

Figure 1 for On the Societal Impact of Open Foundation Models

Figure 2 for On the Societal Impact of Open Foundation Models

Abstract:Foundation models are powerful technologies: how they are released publicly directly shapes their societal impact. In this position paper, we focus on open foundation models, defined here as those with broadly available model weights (e.g. Llama 2, Stable Diffusion XL). We identify five distinctive properties (e.g. greater customizability, poor monitoring) of open foundation models that lead to both their benefits and risks. Open foundation models present significant benefits, with some caveats, that span innovation, competition, the distribution of decision-making power, and transparency. To understand their risks of misuse, we design a risk assessment framework for analyzing their marginal risk. Across several misuse vectors (e.g. cyberattacks, bioweapons), we find that current research is insufficient to effectively characterize the marginal risk of open foundation models relative to pre-existing technologies. The framework helps explain why the marginal risk is low in some cases, clarifies disagreements about misuse risks by revealing that past work has focused on different subsets of the framework with different assumptions, and articulates a way forward for more constructive debate. Overall, our work helps support a more grounded assessment of the societal impact of open foundation models by outlining what research is needed to empirically validate their theoretical benefits and risks.

Via

Access Paper or Ask Questions

KMMLU: Measuring Massive Multitask Language Understanding in Korean

Feb 18, 2024

Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, Stella Biderman

Figure 1 for KMMLU: Measuring Massive Multitask Language Understanding in Korean

Figure 2 for KMMLU: Measuring Massive Multitask Language Understanding in Korean

Figure 3 for KMMLU: Measuring Massive Multitask Language Understanding in Korean

Figure 4 for KMMLU: Measuring Massive Multitask Language Understanding in Korean

Abstract:We propose KMMLU, a new Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM. Unlike previous Korean benchmarks that are translated from existing English benchmarks, KMMLU is collected from original Korean exams, capturing linguistic and cultural aspects of the Korean language. We test 26 publically available and proprietary LLMs, identifying significant room for improvement. The best publicly available model achieves 50.54% on KMMLU, far below the average human performance of 62.6%. This model was primarily trained for English and Chinese, not Korean. Current LLMs tailored to Korean, such as Polyglot-Ko, perform far worse. Surprisingly, even the most capable proprietary LLMs, e.g., GPT-4 and HyperCLOVA X, achieve 59.95% and 53.40%, respectively. This suggests that further work is needed to improve Korean LLMs, and KMMLU offers the right tool to track this progress. We make our dataset publicly available on the Hugging Face Hub and integrate the benchmark into EleutherAI's Language Model Evaluation Harness.

* Under Review

Via

Access Paper or Ask Questions