Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haidong Zhao

ML Inference Scheduling with Predictable Latency

Dec 24, 2025

Haidong Zhao, Nikolaos Georgantas

Figure 1 for ML Inference Scheduling with Predictable Latency

Figure 2 for ML Inference Scheduling with Predictable Latency

Figure 3 for ML Inference Scheduling with Predictable Latency

Figure 4 for ML Inference Scheduling with Predictable Latency

Abstract:Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as concurrent tasks contend for GPU resources and thereby introduce interference. Given that interference effects introduce unpredictability in scheduling, neglecting them may compromise SLO or deadline satisfaction. Nevertheless, existing interference prediction approaches remain limited in several respects, which may restrict their usefulness for scheduling. First, they are often coarse-grained, which ignores runtime co-location dynamics and thus restricts their accuracy in interference prediction. Second, they tend to use a static prediction model, which may not effectively cope with different workload characteristics. In this paper, we evaluate the potential limitations of existing interference prediction approaches, finding that coarse-grained methods can lead to noticeable deviations in prediction accuracy and that static models degrade considerably under changing workloads.

* Proceedings of the Middleware for Autonomous AIoT Systems in the Computing Continuum (MAIoT 2025)
* Accepted at MAIoT@Middleware 2025

Via

Access Paper or Ask Questions