Abstract:Modern deployments of Large Language Models (LLMs) increasingly require serving multiple models with diverse architectures, sizes, and specialization on shared, heterogeneous hardware. This setting introduces new challenges for resource allocation, dispatching, and scheduling, particularly under GPU memory constraints where partial CPU-GPU offloading and preemption become necessary. While existing systems primarily optimize throughput for a single model, comparatively little work addresses multi-model scheduling under these conditions. In this paper, we present an empirical study of how different LLMs behave across hardware platforms, focusing on the performance implications of layer offloading and preemption. We show that offloading leads to strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency. We further demonstrate that preemption incurs substantial overhead, largely dominated by model state reload rather than key-value cache transfer, and that this cost varies significantly across models and hardware platforms. Additionally, we highlight the role of sequence length and interconnect bandwidth in amplifying data movement and execution inefficiencies. Based on these findings, we identify a set of key features that future schedulers must consider, including model-specific offloading sensitivity, workload characteristics, and the cost structure of preemption and data transfer. These insights provide guidance for the design of next-generation LLM serving systems capable of efficiently managing heterogeneous, multi-model workloads with hybrid CPU-GPU execution.
Abstract:Data collection in an IoT environment requires simple and effective communication solutions to address resource constraints, ensure network efficiency, while achieving scalability. Efficiency is evaluated based on the timeliness of collected data (Age of Information), the energy spent per delivered unit of data, and the effectiveness in utilizing spectrum resources. This paper addresses a random multiple access adaptive system, in which a large number of devices send sporadic messages in non-periodic pattern. In particular, our analysis highlights the potential of Successive Interference Cancellation and identifies an adaptive parameter setting to maximize its benefits as the level of contention on the shared channel varies. An analytical model is defined, easily scalable with the number of nodes and yielding all the relevant metrics. Evidence of the accuracy of the model is given by comparing predicted results against simulations. The model is utilized to assess the trade-off between Age of Information and energy consumption, revealing a sharp relationship between the two. The considered approach lends itself to many generalizations and applications to massive machine-type communications and IoT networks.
Abstract:Limitation of the cost of coordination and contention among a large number of nodes calls for grant-free approaches, exploiting physical layer techniques to solve collisions. Successive Interference Cancellation (SIC) is becoming a key building block of multiple access channel receiver, in an effort to support massive Internet of Things (IoT). In this paper, we explore the large-scale performance of SIC in a theoretical framework. A general model of a SIC receiver is stated for a shared channel with $n$ transmitters. The asymptotic sum-rate performance is characterized as $n \rightarrow \infty$, for a suitably scaled target Signal to Noise Interference Ratio (SNIR). The probability distribution of the number of correctly decoded packets is shown to tend to a deterministic distribution asymptotically for large values of $n$. The asymptotic analysis is carried out for any probability distribution of the wireless channel gain, assuming that the average received power level is same for all nodes, through power control.