Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Siegfried Handschuh

University of St.Gallen

Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

Apr 10, 2026

Götz-Henrik Wiegand, Lorena Raichle, Rico Städeli, Tomas Hrycej, Bernhard Bermeitinger, Siegfried Handschuh

Abstract:Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although scaling laws describe this trend at large scale, their implications in controlled, smaller-scale settings remain less explored. In this work, we isolate dataset-size effects using a strongly reduced attention-only decoder architecture. By training on progressively larger power-of-two subsets, we observe smooth performance improvements accompanied by clear diminishing returns, consistent with scaling-law behavior. Using only about 30% of the training data is sufficient to reach approximately 90% of the full-data validation token-level accuracy. These results provide actionable insights into dataset scaling in a controlled, component-isolated setting and offer practical guidance for balancing dataset size and computational cost in compute- and data-restricted environments, such as small research labs and exploratory model development.

* Presented as a paper at 3rd DATA-FM workshop @ ICLR 2026, Brazil. Published at 13th IEEE Swiss Conference on Data Science and AI (SDS 2026)

Via

Access Paper or Ask Questions

Efficient Neural Network Training via Subset Pretraining

Oct 21, 2024

Jan Spörer, Bernhard Bermeitinger, Tomas Hrycej, Niklas Limacher, Siegfried Handschuh

Figure 1 for Efficient Neural Network Training via Subset Pretraining

Figure 2 for Efficient Neural Network Training via Subset Pretraining

Figure 3 for Efficient Neural Network Training via Subset Pretraining

Figure 4 for Efficient Neural Network Training via Subset Pretraining

Abstract:In training neural networks, it is common practice to use partial gradients computed over batches, mostly very small subsets of the training set. This approach is motivated by the argument that such a partial gradient is close to the true one, with precision growing only with the square root of the batch size. A theoretical justification is with the help of stochastic approximation theory. However, the conditions for the validity of this theory are not satisfied in the usual learning rate schedules. Batch processing is also difficult to combine with efficient second-order optimization methods. This proposal is based on another hypothesis: the loss minimum of the training set can be expected to be well-approximated by the minima of its subsets. Such subset minima can be computed in a fraction of the time necessary for optimizing over the whole training set. This hypothesis has been tested with the help of the MNIST, CIFAR-10, and CIFAR-100 image classification benchmarks, optionally extended by training data augmentation. The experiments have confirmed that results equivalent to conventional training can be reached. In summary, even small subsets are representative if the overdetermination ratio for the given model parameter set sufficiently exceeds unity. The computing expense can be reduced to a tenth or less.

* To appear in KDIR 2024

Via

Access Paper or Ask Questions

Reducing the Transformer Architecture to a Minimum

Oct 17, 2024

Bernhard Bermeitinger, Tomas Hrycej, Massimo Pavone, Julianus Kath, Siegfried Handschuh

Figure 1 for Reducing the Transformer Architecture to a Minimum

Figure 2 for Reducing the Transformer Architecture to a Minimum

Figure 3 for Reducing the Transformer Architecture to a Minimum

Figure 4 for Reducing the Transformer Architecture to a Minimum

Abstract:Transformers are a widespread and successful model architecture, particularly in Natural Language Processing (NLP) and Computer Vision (CV). The essential innovation of this architecture is the Attention Mechanism, which solves the problem of extracting relevant context information from long sequences in NLP and realistic scenes in CV. A classical neural network component, a Multi-Layer Perceptron (MLP), complements the attention mechanism. Its necessity is frequently justified by its capability of modeling nonlinear relationships. However, the attention mechanism itself is nonlinear through its internal use of similarity measures. A possible hypothesis is that this nonlinearity is sufficient for modeling typical application problems. As the MLPs usually contain the most trainable parameters of the whole model, their omission would substantially reduce the parameter set size. Further components can also be reorganized to reduce the number of parameters. Under some conditions, query and key matrices can be collapsed into a single matrix of the same size. The same is true about value and projection matrices, which can also be omitted without eliminating the substance of the attention mechanism. Initially, the similarity measure was defined asymmetrically, with peculiar properties such as that a token is possibly dissimilar to itself. A possible symmetric definition requires only half of the parameters. We have laid the groundwork by testing widespread CV benchmarks: MNIST and CIFAR-10. The tests have shown that simplified transformer architectures (a) without MLP, (b) with collapsed matrices, and (c) symmetric similarity matrices exhibit similar performance as the original architecture, saving up to 90% of parameters without hurting the classification performance.

* 8 pages, to appear in KDIR2024

Via

Access Paper or Ask Questions

Make Deep Networks Shallow Again

Sep 15, 2023

Bernhard Bermeitinger, Tomas Hrycej, Siegfried Handschuh

Figure 1 for Make Deep Networks Shallow Again

Figure 2 for Make Deep Networks Shallow Again

Figure 3 for Make Deep Networks Shallow Again

Figure 4 for Make Deep Networks Shallow Again

Abstract:Deep neural networks have a good success record and are thus viewed as the best architecture choice for complex applications. Their main shortcoming has been, for a long time, the vanishing gradient which prevented the numerical optimization algorithms from acceptable convergence. A breakthrough has been achieved by the concept of residual connections -- an identity mapping parallel to a conventional layer. This concept is applicable to stacks of layers of the same dimension and substantially alleviates the vanishing gradient problem. A stack of residual connection layers can be expressed as an expansion of terms similar to the Taylor expansion. This expansion suggests the possibility of truncating the higher-order terms and receiving an architecture consisting of a single broad layer composed of all initially stacked layers in parallel. In other words, a sequential deep architecture is substituted by a parallel shallow one. Prompted by this theory, we investigated the performance capabilities of the parallel architecture in comparison to the sequential one. The computer vision datasets MNIST and CIFAR10 were used to train both architectures for a total of 6912 combinations of varying numbers of convolutional layers, numbers of filters, kernel sizes, and other meta parameters. Our findings demonstrate a surprising equivalence between the deep (sequential) and shallow (parallel) architectures. Both layouts produced similar results in terms of training and validation set loss. This discovery implies that a wide, shallow architecture can potentially replace a deep network without sacrificing performance. Such substitution has the potential to simplify network architectures, improve optimization efficiency, and accelerate the training process.

* to be published at KDIR2023, Rome

Via

Access Paper or Ask Questions

Discourse-Aware Text Simplification: From Complex Sentences to Linked Propositions

Aug 01, 2023

Christina Niklaus, Matthias Cetto, André Freitas, Siegfried Handschuh

Figure 1 for Discourse-Aware Text Simplification: From Complex Sentences to Linked Propositions

Figure 2 for Discourse-Aware Text Simplification: From Complex Sentences to Linked Propositions

Figure 3 for Discourse-Aware Text Simplification: From Complex Sentences to Linked Propositions

Figure 4 for Discourse-Aware Text Simplification: From Complex Sentences to Linked Propositions

Abstract:Sentences that present a complex syntax act as a major stumbling block for downstream Natural Language Processing applications whose predictive quality deteriorates with sentence length and complexity. The task of Text Simplification (TS) may remedy this situation. It aims to modify sentences in order to make them easier to process, using a set of rewriting operations, such as reordering, deletion, or splitting. State-of-the-art syntactic TS approaches suffer from two major drawbacks: first, they follow a very conservative approach in that they tend to retain the input rather than transforming it, and second, they ignore the cohesive nature of texts, where context spread across clauses or sentences is needed to infer the true meaning of a statement. To address these problems, we present a discourse-aware TS approach that splits and rephrases complex English sentences within the semantic context in which they occur. Based on a linguistically grounded transformation stage that uses clausal and phrasal disembedding mechanisms, complex sentences are transformed into shorter utterances with a simple canonical structure that can be easily analyzed by downstream applications. With sentence splitting, we thus address a TS task that has hardly been explored so far. Moreover, we introduce the notion of minimality in this context, as we aim to decompose source sentences into a set of self-contained minimal semantic units. To avoid breaking down the input into a disjointed sequence of statements that is difficult to interpret because important contextual information is missing, we incorporate the semantic context between the split propositions in the form of hierarchical structures and semantic relationships. In that way, we generate a semantic hierarchy of minimal propositions that leads to a novel representation of complex assertions that puts a semantic layer on top of the simplified sentences.

Via

Access Paper or Ask Questions

Analyzing FOMC Minutes: Accuracy and Constraints of Language Models

Apr 20, 2023

Wonseong Kim, Jan Frederic Spörer, Siegfried Handschuh

Figure 1 for Analyzing FOMC Minutes: Accuracy and Constraints of Language Models

Figure 2 for Analyzing FOMC Minutes: Accuracy and Constraints of Language Models

Figure 3 for Analyzing FOMC Minutes: Accuracy and Constraints of Language Models

Figure 4 for Analyzing FOMC Minutes: Accuracy and Constraints of Language Models

Abstract:This research article analyzes the language used in the official statements released by the Federal Open Market Committee (FOMC) after its scheduled meetings to gain insights into the impact of FOMC official statements on financial markets and economic forecasting. The study reveals that the FOMC is careful to avoid expressing emotion in their sentences and follows a set of templates to cover economic situations. The analysis employs advanced language modeling techniques such as VADER and FinBERT, and a trial test with GPT-4. The results show that FinBERT outperforms other techniques in predicting negative sentiment accurately. However, the study also highlights the challenges and limitations of using current NLP techniques to analyze FOMC texts and suggests the potential for enhancing language models and exploring alternative approaches.

* 15pages, 4 figures, 5 tables

Via

Access Paper or Ask Questions

Number of Attention Heads vs Number of Transformer-Encoders in Computer Vision

Sep 15, 2022

Tomas Hrycej, Bernhard Bermeitinger, Siegfried Handschuh

Figure 1 for Number of Attention Heads vs Number of Transformer-Encoders in Computer Vision

Figure 2 for Number of Attention Heads vs Number of Transformer-Encoders in Computer Vision

Figure 3 for Number of Attention Heads vs Number of Transformer-Encoders in Computer Vision

Figure 4 for Number of Attention Heads vs Number of Transformer-Encoders in Computer Vision

Abstract:Determining an appropriate number of attention heads on one hand and the number of transformer-encoders, on the other hand, is an important choice for Computer Vision (CV) tasks using the Transformer architecture. Computing experiments confirmed the expectation that the total number of parameters has to satisfy the condition of overdetermination (i.e., number of constraints significantly exceeding the number of parameters). Then, good generalization performance can be expected. This sets the boundaries within which the number of heads and the number of transformers can be chosen. If the role of context in images to be classified can be assumed to be small, it is favorable to use multiple transformers with a low number of heads (such as one or two). In classifying objects whose class may heavily depend on the context within the image (i.e., the meaning of a patch being dependent on other patches), the number of heads is equally important as that of transformers.

Via

Access Paper or Ask Questions

Training Neural Networks in Single vs Double Precision

Sep 15, 2022

Tomas Hrycej, Bernhard Bermeitinger, Siegfried Handschuh

Figure 1 for Training Neural Networks in Single vs Double Precision

Figure 2 for Training Neural Networks in Single vs Double Precision

Figure 3 for Training Neural Networks in Single vs Double Precision

Abstract:The commitment to single-precision floating-point arithmetic is widespread in the deep learning community. To evaluate whether this commitment is justified, the influence of computing precision (single and double precision) on the optimization performance of the Conjugate Gradient (CG) method (a second-order optimization algorithm) and RMSprop (a first-order algorithm) has been investigated. Tests of neural networks with one to five fully connected hidden layers and moderate or strong nonlinearity with up to 4 million network parameters have been optimized for Mean Square Error (MSE). The training tasks have been set up so that their MSE minimum was known to be zero. Computing experiments have disclosed that single-precision can keep up (with superlinear convergence) with double-precision as long as line search finds an improvement. First-order methods such as RMSprop do not benefit from double precision. However, for moderately nonlinear tasks, CG is clearly superior. For strongly nonlinear tasks, both algorithm classes find only solutions fairly poor in terms of mean square error as related to the output variance. CG with double floating-point precision is superior whenever the solutions have the potential to be useful for the application goal.

Via

Access Paper or Ask Questions

Uncovering More Shallow Heuristics: Probing the Natural Language Inference Capacities of Transformer-Based Pre-Trained Language Models Using Syllogistic Patterns

Jan 19, 2022

Reto Gubelmann, Siegfried Handschuh

Figure 1 for Uncovering More Shallow Heuristics: Probing the Natural Language Inference Capacities of Transformer-Based Pre-Trained Language Models Using Syllogistic Patterns

Figure 2 for Uncovering More Shallow Heuristics: Probing the Natural Language Inference Capacities of Transformer-Based Pre-Trained Language Models Using Syllogistic Patterns

Figure 3 for Uncovering More Shallow Heuristics: Probing the Natural Language Inference Capacities of Transformer-Based Pre-Trained Language Models Using Syllogistic Patterns

Figure 4 for Uncovering More Shallow Heuristics: Probing the Natural Language Inference Capacities of Transformer-Based Pre-Trained Language Models Using Syllogistic Patterns

Abstract:In this article, we explore the shallow heuristics used by transformer-based pre-trained language models (PLMs) that are fine-tuned for natural language inference (NLI). To do so, we construct or own dataset based on syllogistic, and we evaluate a number of models' performance on our dataset. We find evidence that the models rely heavily on certain shallow heuristics, picking up on symmetries and asymmetries between premise and hypothesis. We suggest that the lack of generalization observable in our study, which is becoming a topic of lively debate in the field, means that the PLMs are currently not learning NLI, but rather spurious heuristics.

Via

Access Paper or Ask Questions

Exploring the Promises of Transformer-Based LMs for the Representation of Normative Claims in the Legal Domain

Aug 25, 2021

Reto Gubelmann, Peter Hongler, Siegfried Handschuh

Figure 1 for Exploring the Promises of Transformer-Based LMs for the Representation of Normative Claims in the Legal Domain

Figure 2 for Exploring the Promises of Transformer-Based LMs for the Representation of Normative Claims in the Legal Domain

Figure 3 for Exploring the Promises of Transformer-Based LMs for the Representation of Normative Claims in the Legal Domain

Figure 4 for Exploring the Promises of Transformer-Based LMs for the Representation of Normative Claims in the Legal Domain

Abstract:In this article, we explore the potential of transformer-based language models (LMs) to correctly represent normative statements in the legal domain, taking tax law as our use case. In our experiment, we use a variety of LMs as bases for both word- and sentence-based clusterers that are then evaluated on a small, expert-compiled test-set, consisting of real-world samples from tax law research literature that can be clearly assigned to one of four normative theories. The results of the experiment show that clusterers based on sentence-BERT-embeddings deliver the most promising results. Based on this main experiment, we make first attempts at using the best performing models in a bootstrapping loop to build classifiers that map normative claims on one of these four normative theories.

* 11 pages, 3 figures

Via

Access Paper or Ask Questions