Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sarah Liaw

FOL-Pretrain: A complexity annotated corpus of first-order logic

May 20, 2025

Isabelle Lee, Sarah Liaw, Dani Yogatama

Figure 1 for FOL-Pretrain: A complexity annotated corpus of first-order logic

Figure 2 for FOL-Pretrain: A complexity annotated corpus of first-order logic

Figure 3 for FOL-Pretrain: A complexity annotated corpus of first-order logic

Figure 4 for FOL-Pretrain: A complexity annotated corpus of first-order logic

Abstract:Transformer-based large language models (LLMs) have demonstrated remarkable reasoning capabilities such as coding and solving mathematical problems to commonsense inference. While these tasks vary in complexity, they all require models to integrate and compute over structured information. Despite recent efforts to reverse-engineer LLM behavior through controlled experiments, our understanding of how these models internalize and execute complex algorithms remains limited. Progress has largely been confined to small-scale studies or shallow tasks such as basic arithmetic and grammatical pattern matching. One barrier to deeper understanding is the nature of pretraining data -- vast, heterogeneous, and often poorly annotated, making it difficult to isolate mechanisms of reasoning. To bridge this gap, we introduce a large-scale, fully open, complexity-annotated dataset of first-order logic reasoning traces, designed to probe and analyze algorithmic reasoning in LLMs. The dataset consists of 3.5 billion tokens, including 8.8 million LLM-augmented, human-annotated examples and 7.5 million synthetically generated examples. Each synthetic example is verifiably correct, produced by a custom automated theorem solver, and accompanied by metadata tracing its algorithmic provenance. We aim to provide a scalable, interpretable artifact for studying how LLMs learn and generalize symbolic reasoning processes, paving the way for more transparent and targeted investigations into the algorithmic capabilities of modern models.

Via

Access Paper or Ask Questions

Learning local neighborhoods of non-Gaussian graphical models: A measure transport approach

Mar 18, 2025

Sarah Liaw, Rebecca Morrison, Youssef Marzouk, Ricardo Baptista

Figure 1 for Learning local neighborhoods of non-Gaussian graphical models: A measure transport approach

Figure 2 for Learning local neighborhoods of non-Gaussian graphical models: A measure transport approach

Figure 3 for Learning local neighborhoods of non-Gaussian graphical models: A measure transport approach

Figure 4 for Learning local neighborhoods of non-Gaussian graphical models: A measure transport approach

Abstract:Identifying the Markov properties or conditional independencies of a collection of random variables is a fundamental task in statistics for modeling and inference. Existing approaches often learn the structure of a probabilistic graphical model, which encodes these dependencies, by assuming that the variables follow a distribution with a simple parametric form. Moreover, the computational cost of many algorithms scales poorly for high-dimensional distributions, as they need to estimate all the edges in the graph simultaneously. In this work, we propose a scalable algorithm to infer the conditional independence relationships of each variable by exploiting the local Markov property. The proposed method, named Localized Sparsity Identification for Non-Gaussian Distributions (L-SING), estimates the graph by using flexible classes of transport maps to represent the conditional distribution for each variable. We show that L-SING includes existing approaches, such as neighborhood selection with Lasso, as a special case. We demonstrate the effectiveness of our algorithm in both Gaussian and non-Gaussian settings by comparing it to existing methods. Lastly, we show the scalability of the proposed approach by applying it to high-dimensional non-Gaussian examples, including a biological dataset with more than 150 variables.

* Accepted in AAAI 2025: 23 pages, 9 figures

Via

Access Paper or Ask Questions