Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

William Schuler

Frequency Explains the Inverse Correlation of Large Language Models' Size, Training Data Amount, and Surprisal's Fit to Reading Times

Feb 03, 2024
Byung-Doh Oh, Shisen Yue, William Schuler

Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades. The current work presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends. First, residual errors from four language model families on four corpora show that the inverse correlation between model size and fit to reading times is the strongest on the subset of least frequent words, which is driven by excessively accurate predictions of larger model variants. Additionally, training dynamics reveal that during later training steps, all model variants learn to predict rare words and that larger model variants do so more accurately, which explains the detrimental effect of both training data amount and model size on fit to reading times. Finally, a feature attribution analysis demonstrates that larger model variants are able to accurately predict rare words based on both an effectively longer context window size as well as stronger local associations compared to smaller model variants. Taken together, these results indicate that Transformer-based language models' surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.

* EACL 2024

Via

Access Paper or Ask Questions

Token-wise Decomposition of Autoregressive Language Model Hidden States for Analyzing Model Predictions

May 17, 2023
Byung-Doh Oh, William Schuler

Figure 1 for Token-wise Decomposition of Autoregressive Language Model Hidden States for Analyzing Model Predictions

Figure 2 for Token-wise Decomposition of Autoregressive Language Model Hidden States for Analyzing Model Predictions

Figure 3 for Token-wise Decomposition of Autoregressive Language Model Hidden States for Analyzing Model Predictions

Figure 4 for Token-wise Decomposition of Autoregressive Language Model Hidden States for Analyzing Model Predictions

While there is much recent interest in studying why Transformer-based large language models make predictions the way they do, the complex computations performed within each layer have traditionally posed a strong bottleneck. To mitigate this shortcoming, this work presents a linear decomposition of final hidden states from autoregressive language models based on each initial input token, which is exact for virtually all contemporary Transformer architectures. This decomposition allows the definition of probability distributions that ablate the contribution of specific input tokens, which can be used to analyze their influence on model probabilities over a sequence of upcoming words with only one forward pass from the model. Using the change in next-word probability as a measure of importance, this work first examines which context words make the biggest contribution to language model predictions. Regression experiments suggest that Transformer-based language models rely primarily on collocational associations, followed by linguistic factors such as syntactic dependencies and coreference relationships in making next-word predictions. Additionally, analyses using these measures to predict syntactic dependencies and coreferent mention spans show that collocational association and repetitions of the same token respectively, largely explain the language model's predictions on the tasks.

* ACL 2023

Via

Access Paper or Ask Questions

Transformer-Based LM Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

Apr 22, 2023
Byung-Doh Oh, William Schuler

Figure 1 for Transformer-Based LM Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

Figure 2 for Transformer-Based LM Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

Figure 3 for Transformer-Based LM Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

Figure 4 for Transformer-Based LM Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

Recent psycholinguistic studies have drawn conflicting conclusions about the relationship between the quality of a language model and the ability of its surprisal estimates to predict human reading times, which has been speculated to be due to the large gap in both the amount of training data and model capacity across studies. The current work aims to consolidate these findings by evaluating surprisal estimates from Transformer-based language model variants that vary systematically in the amount of training data and model capacity on their ability to predict human reading times. The results show that surprisal estimates from most variants with contemporary model capacities provide the best fit after seeing about two billion training tokens, after which they begin to diverge from humanlike expectations. Additionally, newly-trained smaller model variants reveal a 'tipping point' at convergence, after which the decrease in language model perplexity begins to result in poorer fits to human reading times. These results suggest that the massive amount of training data is mainly responsible for the poorer fit achieved by surprisal from larger pre-trained language models, and that a certain degree of model capacity is necessary for Transformer-based language models to capture humanlike expectations.

Via

Access Paper or Ask Questions

Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

Dec 23, 2022
Byung-Doh Oh, William Schuler

Figure 1 for Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

Figure 2 for Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

Figure 3 for Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

Figure 4 for Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

This work presents a detailed linguistic analysis into why larger Transformer-based pre-trained language models with more parameters and lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times. First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for the more recently released five GPT-Neo variants and eight OPT variants on two separate datasets, replicating earlier results limited to just GPT-2 (Oh et al., 2022). Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words such as modals and conjunctions. These results suggest that the propensity of larger Transformer-based models to 'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pre-trained language models to study human language processing.

* Transactions of the Association for Computational Linguistics (pre-MIT Press publication version)

Via

Access Paper or Ask Questions

Entropy- and Distance-Based Predictors From GPT-2 Attention Patterns Predict Reading Times Over and Above GPT-2 Surprisal

Dec 21, 2022
Byung-Doh Oh, William Schuler

Figure 1 for Entropy- and Distance-Based Predictors From GPT-2 Attention Patterns Predict Reading Times Over and Above GPT-2 Surprisal

Figure 2 for Entropy- and Distance-Based Predictors From GPT-2 Attention Patterns Predict Reading Times Over and Above GPT-2 Surprisal

Figure 3 for Entropy- and Distance-Based Predictors From GPT-2 Attention Patterns Predict Reading Times Over and Above GPT-2 Surprisal

Figure 4 for Entropy- and Distance-Based Predictors From GPT-2 Attention Patterns Predict Reading Times Over and Above GPT-2 Surprisal

Transformer-based large language models are trained to make predictions about the next word by aggregating representations of previous tokens through their self-attention mechanism. In the field of cognitive modeling, such attention patterns have recently been interpreted as embodying the process of cue-based retrieval, in which attention over multiple targets is taken to generate interference and latency during retrieval. Under this framework, this work first defines an entropy-based predictor that quantifies the diffuseness of self-attention, as well as distance-based predictors that capture the incremental change in attention patterns across timesteps. Moreover, following recent studies that question the informativeness of attention weights, we also experiment with alternative methods for incorporating vector norms into attention weights. Regression experiments using predictors calculated from the GPT-2 language model show that these predictors deliver a substantially better fit to held-out self-paced reading and eye-tracking data over a rigorous baseline including GPT-2 surprisal. Additionally, the distance-based predictors generally demonstrated higher predictive power, with effect sizes of up to 6.59 ms per standard deviation on self-paced reading times (compared to 2.82 ms for surprisal) and 1.05 ms per standard deviation on eye-gaze durations (compared to 3.81 ms for surprisal).

* EMNLP 2022

Via

Access Paper or Ask Questions

A Deep Learning Approach to Analyzing Continuous-Time Systems

Sep 25, 2022
Cory Shain, William Schuler

Figure 1 for A Deep Learning Approach to Analyzing Continuous-Time Systems

Figure 2 for A Deep Learning Approach to Analyzing Continuous-Time Systems

Figure 3 for A Deep Learning Approach to Analyzing Continuous-Time Systems

Figure 4 for A Deep Learning Approach to Analyzing Continuous-Time Systems

Scientists often use observational time series data to study complex natural processes, from climate change to civil conflict to brain activity. But regression analyses of these data often assume simplistic dynamics. Recent advances in deep learning have yielded startling improvements to the performance of models of complex processes, from speech comprehension to nuclear physics to competitive gaming. But deep learning is generally not used for scientific analysis. Here, we bridge this gap by showing that deep learning can be used, not just to imitate, but to analyze complex processes, providing flexible function approximation while preserving interpretability. Our approach -- the continuous-time deconvolutional regressive neural network (CDRNN) -- relaxes standard simplifying assumptions (e.g., linearity, stationarity, and homoscedasticity) that are implausible for many natural systems and may critically affect the interpretation of data. We evaluate CDRNNs on incremental human language processing, a domain with complex continuous dynamics. We demonstrate dramatic improvements to predictive likelihood in behavioral and neuroimaging data, and we show that CDRNNs enable flexible discovery of novel patterns in exploratory analyses, provide robust control of possible confounds in confirmatory analyses, and open up research questions that are otherwise hard to study using observational data.

* Main article: 11 pages, 1 table, 3 figures; Supplementary Information: 51 pages, 11 tables, 28 figures

Via

Access Paper or Ask Questions

The Importance of Category Labels in Grammar Induction with Child-directed Utterances

Jun 20, 2020
Lifeng Jin, William Schuler

Figure 1 for The Importance of Category Labels in Grammar Induction with Child-directed Utterances

Figure 2 for The Importance of Category Labels in Grammar Induction with Child-directed Utterances

Figure 3 for The Importance of Category Labels in Grammar Induction with Child-directed Utterances

Figure 4 for The Importance of Category Labels in Grammar Induction with Child-directed Utterances

Recent progress in grammar induction has shown that grammar induction is possible without explicit assumptions of language-specific knowledge. However, evaluation of induced grammars usually has ignored phrasal labels, an essential part of a grammar. Experiments in this work using a labeled evaluation metric, RH, show that linguistically motivated predictions about grammar sparsity and use of categories can only be revealed through labeled evaluation. Furthermore, depth-bounding as an implementation of human memory constraints in grammar inducers is still effective with labeled evaluation on multilingual transcribed child-directed utterances.

* The 16th International Conference on Parsing Technologies (IWPT 2020)

Via

Access Paper or Ask Questions

Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction

Sep 10, 2018
Lifeng Jin, Finale Doshi-Velez, Timothy Miller, William Schuler, Lane Schwartz

Figure 1 for Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction

Figure 2 for Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction

Figure 3 for Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction

Figure 4 for Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction

There have been several recent attempts to improve the accuracy of grammar induction systems by bounding the recursive complexity of the induction model (Ponvert et al., 2011; Noji and Johnson, 2016; Shain et al., 2016; Jin et al., 2018). Modern depth-bounded grammar inducers have been shown to be more accurate than early unbounded PCFG inducers, but this technique has never been compared against unbounded induction within the same system, in part because most previous depth-bounding models are built around sequence models, the complexity of which grows exponentially with the maximum allowed depth. The present work instead applies depth bounds within a chart-based Bayesian PCFG inducer (Johnson et al., 2007b), where bounding can be switched on and off, and then samples trees with and without bounding. Results show that depth-bounding is indeed significantly effective in limiting the search space of the inducer and thereby increasing the accuracy of the resulting parsing model. Moreover, parsing results on English, Chinese and German show that this bounded model with a new inference technique is able to produce parse trees more accurately than or competitively with state-of-the-art constituency-based grammar induction models.

* EMNLP 2018

Via

Access Paper or Ask Questions

Unsupervised Grammar Induction with Depth-bounded PCFG

Feb 26, 2018
Lifeng Jin, Finale Doshi-Velez, Timothy Miller, William Schuler, Lane Schwartz

There has been recent interest in applying cognitively or empirically motivated bounds on recursion depth to limit the search space of grammar induction models (Ponvert et al., 2011; Noji and Johnson, 2016; Shain et al., 2016). This work extends this depth-bounding approach to probabilistic context-free grammar induction (DB-PCFG), which has a smaller parameter space than hierarchical sequence models, and therefore more fully exploits the space reductions of depth-bounding. Results for this model on grammar acquisition from transcribed child-directed speech and newswire text exceed or are competitive with those of other models when evaluated on parse accuracy. Moreover, gram- mars acquired from this model demonstrate a consistent use of category labels, something which has not been demonstrated by other acquisition models.

* Accepted by Transactions of the Association for Computational Linguistics

Via

Access Paper or Ask Questions

Interleaved semantic interpretation in environment-based parsing

Jun 18, 2002
William Schuler

Figure 1 for Interleaved semantic interpretation in environment-based parsing

Figure 2 for Interleaved semantic interpretation in environment-based parsing

Figure 3 for Interleaved semantic interpretation in environment-based parsing

This paper extends a polynomial-time parsing algorithm that resolves structural ambiguity in input to a speech-based user interface by calculating and comparing the denotations of rival constituents, given some model of the interfaced application environment (Schuler 2001). The algorithm is extended to incorporate a full set of logical operators, including quantifiers and conjunctions, into this calculation without increasing the complexity of the overall algorithm beyond polynomial time, both in terms of the length of the input and the number of entities in the environment model.

* Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002)

Via

Access Paper or Ask Questions