Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dominik Scheinert

Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study

Feb 26, 2026

Philipp Wiesner, Soeren Becker, Brett Cornick, Dominik Scheinert, Alexander Acker, Odej Kao

Abstract:Training large language models (LLMs) requires substantial compute and energy. At the same time, renewable energy sources regularly produce more electricity than the grid can absorb, leading to curtailment, the deliberate reduction of clean generation that would otherwise go to waste. These periods represent an opportunity: if training is aligned with curtailment windows, LLMs can be pretrained using electricity that is both clean and cheap. This technical report presents a system that performs full-parameter LLM training across geo-distributed GPU clusters during regional curtailment windows, elastically switching between local single-site training and federated multi-site synchronization as sites become available or unavailable. Our prototype trains a 561M-parameter transformer model across three clusters using the Flower federated learning framework, with curtailment periods derived from real-world marginal carbon intensity traces. Preliminary results show that curtailment-aware scheduling preserves training quality while reducing operational emissions to 5-12% of single-site baselines.

* Technical report

Via

Access Paper or Ask Questions

What happens when nanochat meets DiLoCo?

Nov 14, 2025

Alexander Acker, Soeren Becker, Sasho Nedelkoski, Dominik Scheinert, Odej Kao, Philipp Wiesner

Abstract:Although LLM training is typically centralized with high-bandwidth interconnects and large compute budgets, emerging methods target communication-constrained training in distributed environments. The model trade-offs introduced by this shift remain underexplored, and our goal is to study them. We use the open-source nanochat project, a compact 8K-line full-stack ChatGPT-like implementation containing tokenization, pretraining, fine-tuning, and serving, as a controlled baseline. We implement the DiLoCo algorithm as a lightweight wrapper over nanochat's training loop, performing multiple local steps per worker before synchronization with an outer optimizer, effectively reducing communication by orders of magnitude. This inner-outer training is compared against a standard data-parallel (DDP) setup. Because nanochat is small and inspectable, it enables controlled pipeline adaptations and allows direct comparison with the conventional centralized baseline. DiLoCo achieves stable convergence and competitive loss in pretraining but yields worse MMLU, GSM8K, and HumanEval scores after mid-training and SFT. We discover that using DiLoCo-pretrained weights and running mid- and post-training with DDP fails to recover performance, revealing irreversible representation drift from asynchronous updates that impairs downstream alignment. We provide this implementation as an official fork of nanochat on GitHub.

* 8pages, 3 figures, technical report

Via

Access Paper or Ask Questions

Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

Aug 22, 2023

Dominik Scheinert, Philipp Wiesner, Thorsten Wittkopp, Lauritz Thamsen, Jonathan Will, Odej Kao

Figure 1 for Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

Figure 2 for Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

Figure 3 for Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

Figure 4 for Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

Abstract:Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due to the cold-start problem, this often leads to lengthy and costly profiling phases. However, big data analytics jobs across users can share many common properties: they often operate on similar infrastructure, using similar algorithms implemented in similar frameworks. The potential in sharing aggregated profiling runs to collaboratively address the cold start problem is largely unexplored. We present Karasu, an approach to more efficient resource configuration profiling that promotes data sharing among users working with similar infrastructures, frameworks, algorithms, or datasets. Karasu trains lightweight performance models using aggregated runtime information of collaborators and combines them into an ensemble method to exploit inherent knowledge of the configuration search space. Moreover, Karasu allows the optimization of multiple objectives simultaneously. Our evaluation is based on performance data from diverse workload executions in a public cloud environment. We show that Karasu is able to significantly boost existing methods in terms of performance, search time, and cost, even when few comparable profiling runs are available that share only partial common characteristics with the target job.

* 10 pages, 9 figures

Via

Access Paper or Ask Questions

PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning

Jan 25, 2023

Thorsten Wittkopp, Dominik Scheinert, Philipp Wiesner, Alexander Acker, Odej Kao

Abstract:Due to the complexity of modern IT services, failures can be manifold, occur at any stage, and are hard to detect. For this reason, anomaly detection applied to monitoring data such as logs allows gaining relevant insights to improve IT services steadily and eradicate failures. However, existing anomaly detection methods that provide high accuracy often rely on labeled training data, which are time-consuming to obtain in practice. Therefore, we propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows provided by monitoring systems instead of labeled data. Our attention-based model uses a novel objective function for weak supervision deep learning that accounts for imbalanced data and applies an iterative learning strategy for positive and unknown samples (PU learning) to identify anomalous logs. Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets and detects anomalous log messages with an F1-score of more than 0.99 even within imprecise failure time windows.

* published in the proceedings of the 56th Hawaii International Conference on System Sciences (HICSS 2023)

Via

Access Paper or Ask Questions

Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments

Nov 24, 2022

Dominik Scheinert, Babak Sistani Zadeh Aghdam, Soeren Becker, Odej Kao, Lauritz Thamsen

Figure 1 for Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments

Figure 2 for Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments

Figure 3 for Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments

Figure 4 for Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments

Abstract:With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a filtering approach for data-emitting devices or conduct dynamic sampling based on employed prediction models. Still, existing methods are mainly requiring adaptive monitoring on edge devices, which demands device reconfigurations, utilizes additional resources, and limits the sophistication of employed models. In this paper, we propose a sampling-based and cloud-located approach that internally utilizes probabilistic forecasts and hence provides means of quantifying model uncertainties, which can be used for contextualized adaptations of sampling frequencies and consequently relieves constrained network resources. We evaluate our prototype implementation for the monitoring pipeline on a publicly available streaming dataset and demonstrate its positive impact on resource efficiency in a method comparison.

* 6 pages, 5 figures, 2 tables

Via

Access Paper or Ask Questions

Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics

Nov 15, 2022

Dominik Scheinert, Soeren Becker, Jonathan Bader, Lauritz Thamsen, Jonathan Will, Odej Kao

Figure 1 for Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics

Figure 2 for Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics

Figure 3 for Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics

Figure 4 for Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics

Abstract:Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near-optimal solution has been found. In doing so, they only obtain an implicit understanding of the underlying infrastructure, which is difficult to transfer to alternative infrastructures and, thus, profiling and modeling insights are not sustained beyond very specific situations. We present Perona, a novel approach to robust infrastructure fingerprinting for usage in the context of big data analytics. Perona employs common sets and configurations of benchmarking tools for target resources, so that resulting benchmark metrics are directly comparable and ranking is enabled. Insignificant benchmark metrics are discarded by learning a low-dimensional representation of the input metric vector, and previous benchmark executions are taken into consideration for context-awareness as well, allowing to detect resource degradation. We evaluate our approach both on data gathered from our own experiments as well as within related works for resource configuration optimization, demonstrating that Perona captures the characteristics from benchmark runs in a compact manner and produces representations that can be used directly.

* 8 pages, 5 figures, 3 tables

Via

Access Paper or Ask Questions

Magpie: Automatically Tuning Static Parameters for Distributed File Systems using Deep Reinforcement Learning

Jul 22, 2022

Houkun Zhu, Dominik Scheinert, Lauritz Thamsen, Kordian Gontarska, Odej Kao

Figure 1 for Magpie: Automatically Tuning Static Parameters for Distributed File Systems using Deep Reinforcement Learning

Figure 2 for Magpie: Automatically Tuning Static Parameters for Distributed File Systems using Deep Reinforcement Learning

Figure 3 for Magpie: Automatically Tuning Static Parameters for Distributed File Systems using Deep Reinforcement Learning

Figure 4 for Magpie: Automatically Tuning Static Parameters for Distributed File Systems using Deep Reinforcement Learning

Abstract:Distributed file systems are widely used nowadays, yet using their default configurations is often not optimal. At the same time, tuning configuration parameters is typically challenging and time-consuming. It demands expertise and tuning operations can also be expensive. This is especially the case for static parameters, where changes take effect only after a restart of the system or workloads. We propose a novel approach, Magpie, which utilizes deep reinforcement learning to tune static parameters by strategically exploring and exploiting configuration parameter spaces. To boost the tuning of the static parameters, our method employs both server and client metrics of distributed file systems to understand the relationship between static parameters and performance. Our empirical evaluation results show that Magpie can noticeably improve the performance of the distributed file system Lustre, where our approach on average achieves 91.8% throughput gains against default configuration after tuning towards single performance indicator optimization, while it reaches 39.7% more throughput gains against the baseline.

* Accepted at The IEEE International Conference on Cloud Engineering (IC2E) conference 2022

Via

Access Paper or Ask Questions

A Taxonomy of Anomalies in Log Data

Nov 26, 2021

Thorsten Wittkopp, Philipp Wiesner, Dominik Scheinert, Odej Kao

Abstract:Log data anomaly detection is a core component in the area of artificial intelligence for IT operations. However, the large amount of existing methods makes it hard to choose the right approach for a specific system. A better understanding of different kinds of anomalies, and which algorithms are suitable for detecting them, would support researchers and IT operators. Although a common taxonomy for anomalies already exists, it has not yet been applied specifically to log data, pointing out the characteristics and peculiarities in this domain. In this paper, we present a taxonomy for different kinds of log data anomalies and introduce a method for analyzing such anomalies in labeled datasets. We applied our taxonomy to the three common benchmark datasets Thunderbird, Spirit, and BGL, and trained five state-of-the-art unsupervised anomaly detection algorithms to evaluate their performance in detecting different kinds of anomalies. Our results show, that the most common anomaly type is also the easiest to predict. Moreover, deep learning-based approaches outperform data mining-based approaches in all anomaly types, but especially when it comes to detecting contextual anomalies.

* Paper accepted and presented at AIOPS workshop 2021 co-located with ICSOC 2021

Via

Access Paper or Ask Questions

LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak Supervision

Nov 25, 2021

Thorsten Wittkopp, Philipp Wiesner, Dominik Scheinert, Alexander Acker

Abstract:With increasing scale and complexity of cloud operations, automated detection of anomalies in monitoring data such as logs will be an essential part of managing future IT infrastructures. However, many methods based on artificial intelligence, such as supervised deep learning models, require large amounts of labeled training data to perform well. In practice, this data is rarely available because labeling log data is expensive, time-consuming, and requires a deep understanding of the underlying system. We present LogLAB, a novel modeling approach for automated labeling of log messages without requiring manual work by experts. Our method relies on estimated failure time windows provided by monitoring systems to produce precise labeled datasets in retrospect. It is based on the attention mechanism and uses a custom objective function for weak supervision deep learning techniques that accounts for imbalanced data. Our evaluation shows that LogLAB consistently outperforms nine benchmark approaches across three different datasets and maintains an F1-score of more than 0.98 even at large failure time windows.

* 19th International Conference on Service-Oriented Computing, 2021, 700-707
* Paper accepted on ICSOC 2021 and published on springer

Via

Access Paper or Ask Questions

On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

Nov 16, 2021

Dominik Scheinert, Alireza Alamgiralem, Jonathan Bader, Jonathan Will, Thorsten Wittkopp, Lauritz Thamsen

Figure 1 for On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

Figure 2 for On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

Figure 3 for On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

Figure 4 for On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

Abstract:With the growing amount of data, data processing workloads and the management of their resource usage becomes increasingly important. Since managing a dedicated infrastructure is in many situations infeasible or uneconomical, users progressively execute their respective workloads in the cloud. As the configuration of workloads and resources is often challenging, various methods have been proposed that either quickly profile towards a good configuration or determine one based on data from previous runs. Still, performance data to train such methods is often lacking and must be costly collected. In this paper, we propose a collaborative approach for sharing anonymized workload execution traces among users, mining them for general patterns, and exploiting clusters of historical workloads for future optimizations. We evaluate our prototype implementation for mining workload execution graphs on a publicly available trace dataset and demonstrate the predictive value of workload clusters determined through traces only.

* 6 pages, 5 figures, 1 table

Via

Access Paper or Ask Questions