Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Laurence Aitchison

Function-Space Learning Rates

Feb 24, 2025

Edward Milsom, Ben Anson, Laurence Aitchison

Figure 1 for Function-Space Learning Rates

Figure 2 for Function-Space Learning Rates

Figure 3 for Function-Space Learning Rates

Figure 4 for Function-Space Learning Rates

Abstract:We consider layerwise function-space learning rates, which measure the magnitude of the change in a neural network's output function in response to an update to a parameter tensor. This contrasts with traditional learning rates, which describe the magnitude of changes in parameter space. We develop efficient methods to measure and set function-space learning rates in arbitrary neural networks, requiring only minimal computational overhead through a few additional backward passes that can be performed at the start of, or periodically during, training. We demonstrate two key applications: (1) analysing the dynamics of standard neural network optimisers in function space, rather than parameter space, and (2) introducing FLeRM (Function-space Learning Rate Matching), a novel approach to hyperparameter transfer across model scales. FLeRM records function-space learning rates while training a small, cheap base model, then automatically adjusts parameter-space layerwise learning rates when training larger models to maintain consistent function-space updates. FLeRM gives hyperparameter transfer across model width, depth, initialisation scale, and LoRA rank in various architectures including MLPs with residual connections and transformers with different layer normalisation schemes.

* 19 pages

Via

Access Paper or Ask Questions

Sparse Autoencoders Can Interpret Randomly Initialized Transformers

Jan 29, 2025

Thomas Heap, Tim Lawson, Lucy Farnik, Laurence Aitchison

Abstract:Sparse autoencoders (SAEs) are an increasingly popular technique for interpreting the internal representations of transformers. In this paper, we apply SAEs to 'interpret' random transformers, i.e., transformers where the parameters are sampled IID from a Gaussian rather than trained on text data. We find that random and trained transformers produce similarly interpretable SAE latents, and we confirm this finding quantitatively using an open-source auto-interpretability pipeline. Further, we find that SAE quality metrics are broadly similar for random and trained transformers. We find that these results hold across model sizes and layers. We discuss a number of number interesting questions that this work raises for the use of SAEs and auto-interpretability in the context of mechanistic interpretability.

Via

Access Paper or Ask Questions

Why you don't overfit, and don't need Bayes if you only train for one epoch

Nov 19, 2024

Laurence Aitchison

Abstract:Here, we show that in the data-rich setting where you only train on each datapoint once (or equivalently, you only train for one epoch), standard "maximum likelihood" training optimizes the true data generating process (DGP) loss, which is equivalent to the test loss. Further, we show that the Bayesian model average optimizes the same objective, albeit while taking the expectation over uncertainty induced by finite data. As standard maximum likelihood training in the single-epoch setting optimizes the same objective as Bayesian inference, we argue that we do not expect Bayesian inference to offer any advantages in terms of overfitting or calibration in these settings. This explains the diminishing importance of Bayes in areas such as LLMs, which are often trained with one (or very few) epochs.

Via

Access Paper or Ask Questions

Human-inspired Perspectives: A Survey on AI Long-term Memory

Nov 01, 2024

Zihong He, Weizhe Lin, Hao Zheng, Fan Zhang, Matt Jones, Laurence Aitchison, Xuhai Xu, Miao Liu, Per Ola Kristensson, Junxiao Shen

Figure 1 for Human-inspired Perspectives: A Survey on AI Long-term Memory

Figure 2 for Human-inspired Perspectives: A Survey on AI Long-term Memory

Figure 3 for Human-inspired Perspectives: A Survey on AI Long-term Memory

Figure 4 for Human-inspired Perspectives: A Survey on AI Long-term Memory

Abstract:With the rapid advancement of AI systems, their abilities to store, retrieve, and utilize information over the long term - referred to as long-term memory - have become increasingly significant. These capabilities are crucial for enhancing the performance of AI systems across a wide range of tasks. However, there is currently no comprehensive survey that systematically investigates AI's long-term memory capabilities, formulates a theoretical framework, and inspires the development of next-generation AI long-term memory systems. This paper begins by systematically introducing the mechanisms of human long-term memory, then explores AI long-term memory mechanisms, establishing a mapping between the two. Based on the mapping relationships identified, we extend the current cognitive architectures and propose the Cognitive Architecture of Self-Adaptive Long-term Memory (SALM). SALM provides a theoretical framework for the practice of AI long-term memory and holds potential for guiding the creation of next-generation long-term memory driven AI systems. Finally, we delve into the future directions and application prospects of AI long-term memory.

Via

Access Paper or Ask Questions

Stochastic Kernel Regularisation Improves Generalisation in Deep Kernel Machines

Oct 08, 2024

Edward Milsom, Ben Anson, Laurence Aitchison

Figure 1 for Stochastic Kernel Regularisation Improves Generalisation in Deep Kernel Machines

Figure 2 for Stochastic Kernel Regularisation Improves Generalisation in Deep Kernel Machines

Figure 3 for Stochastic Kernel Regularisation Improves Generalisation in Deep Kernel Machines

Figure 4 for Stochastic Kernel Regularisation Improves Generalisation in Deep Kernel Machines

Abstract:Recent work developed convolutional deep kernel machines, achieving 92.7% test accuracy on CIFAR-10 using a ResNet-inspired architecture, which is SOTA for kernel methods. However, this still lags behind neural networks, which easily achieve over 94% test accuracy with similar architectures. In this work we introduce several modifications to improve the convolutional deep kernel machine's generalisation, including stochastic kernel regularisation, which adds noise to the learned Gram matrices during training. The resulting model achieves 94.5% test accuracy on CIFAR-10. This finding has important theoretical and practical implications, as it demonstrates that the ability to perform well on complex tasks like image classification is not unique to neural networks. Instead, other approaches including deep kernel methods can achieve excellent performance on such tasks, as long as they have the capacity to learn representations from data.

* Neurips 2024 Camera Ready Version (without checklist)

Via

Access Paper or Ask Questions

Residual Stream Analysis with Multi-Layer SAEs

Sep 06, 2024

Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison

Figure 1 for Residual Stream Analysis with Multi-Layer SAEs

Figure 2 for Residual Stream Analysis with Multi-Layer SAEs

Figure 3 for Residual Stream Analysis with Multi-Layer SAEs

Figure 4 for Residual Stream Analysis with Multi-Layer SAEs

Abstract:Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, standard SAEs are trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer simultaneously. The residual stream is usually understood as preserving information across layers, so we expected to, and did, find individual SAE features that are active at multiple layers. Interestingly, while a single SAE feature is active at different layers for different prompts, for a single prompt, we find that a single feature is far more likely to be active at a single layer. For larger underlying models, we find that the cosine similarities between adjacent layers in the residual stream are higher, so we expect more features to be active at multiple layers. These results show that MLSAEs are a promising method to study information flow in transformers. We release our code to train and analyze MLSAEs at https://github.com/tim-lawson/mlsae.

* 16 pages, 12 figures

Via

Access Paper or Ask Questions

Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model

Jul 19, 2024

Henry Addison, Elizabeth Kendon, Suman Ravuri, Laurence Aitchison, Peter AG Watson

Figure 1 for Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model

Figure 2 for Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model

Figure 3 for Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model

Figure 4 for Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model

Abstract:High-resolution climate simulations are very valuable for understanding climate change impacts and planning adaptation measures. This has motivated use of regional climate models at sufficiently fine resolution to capture important small-scale atmospheric processes, such as convective storms. However, these regional models have very high computational costs, limiting their applicability. We present CPMGEM, a novel application of a generative machine learning model, a diffusion model, to skilfully emulate precipitation simulations from such a high-resolution model over England and Wales at much lower cost. This emulator enables stochastic generation of high-resolution (8.8km), daily-mean precipitation samples conditioned on coarse-resolution (60km) weather states from a global climate model. The output is fine enough for use in applications such as flood inundation modelling. The emulator produces precipitation predictions with realistic intensities and spatial structures and captures most of the 21st century climate change signal. We show evidence that the emulator has skill for extreme events up to and including 1-in-100 year intensities. Potential applications include producing high-resolution precipitation predictions for large-ensemble climate simulations and downscaling different climate models and climate change scenarios to better sample uncertainty in climate changes at local-scale.

* 29 pages, 11 figures, 3 tables

Via

Access Paper or Ask Questions

Questionable practices in machine learning

Jul 17, 2024

Gavin Leech, Juan J. Vazquez, Misha Yagudin, Niclas Kupper, Laurence Aitchison

Figure 1 for Questionable practices in machine learning

Figure 2 for Questionable practices in machine learning

Figure 3 for Questionable practices in machine learning

Figure 4 for Questionable practices in machine learning

Abstract:Evaluating modern ML models is hard. The strong incentive for researchers and companies to report a state-of-the-art result on some metric often leads to questionable research practices (QRPs): bad practices which fall short of outright research fraud. We describe 43 such practices which can undermine reported results, giving examples where possible. Our list emphasises the evaluation of large language models (LLMs) on public benchmarks. We also discuss "irreproducible research practices", i.e. decisions that make it difficult or impossible for other researchers to reproduce, build on or audit previous research.

Via

Access Paper or Ask Questions

Using Neural Networks for Data Cleaning in Weather Datasets

Jun 21, 2024

Jack R. P. Hanslope, Laurence Aitchison

Abstract:In climate science, we often want to compare across different datasets. Difficulties can arise in doing this due to inevitable mismatches that arise between observational and reanalysis data, or even between different reanalyses. This misalignment can raise problems for any work that seeks to make inferences about one dataset from another. We considered tropical cyclone location as an example task with one dataset providing atmospheric conditions (ERA5) and another providing storm tracks (IBTrACS). We found that while the examples often aligned well, there were a considerable proportion (around 25%) which were not well aligned. We trained a neural network to map from the wind field to the storm location; in this setting misalignment in the datasets appears as "label noise" (i.e. the labelled storm location does not correspond to the underlying wind field). We found that this neural network trained only on the often noisy labels from IBTrACS had a denoising effect, and performed better than the IBTrACS labels themselves, as measured by human preferences. Remarkably, this even held true for training points, on which we might have expected the network to overfit to the IBTrACS predictions.

* 6 pages, 2 figures, ICML 2024 Workshop on Machine Learning for Earth System Modeling

Via

Access Paper or Ask Questions

Instruction Tuning With Loss Over Instructions

May 23, 2024

Zhengyan Shi, Adam X. Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz, Aldo Lipani

Abstract:Instruction tuning plays a crucial role in shaping the outputs of language models (LMs) to desired styles. In this work, we propose a simple yet effective method, Instruction Modelling (IM), which trains LMs by applying a loss function to the instruction and prompt part rather than solely to the output part. Through experiments across 21 diverse benchmarks, we show that, in many scenarios, IM can effectively improve the LM performance on both NLP tasks (e.g., MMLU, TruthfulQA, and HumanEval) and open-ended generation benchmarks (e.g., MT-Bench and AlpacaEval). Remarkably, in the most advantageous case, IM boosts model performance on AlpacaEval 1.0 by over 100%. We identify two key factors influencing the effectiveness of IM: (1) The ratio between instruction length and output length in the training data; and (2) The number of training examples. We observe that IM is especially beneficial when trained on datasets with lengthy instructions paired with brief outputs, or under the Superficial Alignment Hypothesis (SAH) where a small amount of training examples are used for instruction tuning. Further analysis substantiates our hypothesis that the improvement can be attributed to reduced overfitting to instruction tuning datasets. Our work provides practical guidance for instruction tuning LMs, especially in low-resource scenarios.

* Code is available at https://github.com/ZhengxiangShi/InstructionModelling

Via

Access Paper or Ask Questions