Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gan Luo

Service-Induced Congestion in Memory-Constrained LLM Serving

Jun 14, 2026

Ruicheng Ao, Jing Dong, Gan Luo, David Simchi-Levi

Abstract:In large language model (LLM) serving, each request accumulates persistent graphics processing unit (GPU) memory during service as its key-value cache grows with every generated token. Under high concurrency, aggregate memory usage therefore increases endogenously over time: the service process itself creates future capacity pressure. When memory capacity is exceeded, systems evict active requests, discarding cached state and restarting them later, which wastes computation and reduces throughput. We develop a discrete-time dynamical model of memory-constrained LLM inference that captures admission, memory growth, and eviction under continuous batching. In the saturated-input regime, the system admits both eviction-free fixed points and limit cycles with evictions. For homogeneous workloads, we show that the eviction-free equilibrium is unstable and that, except for a Lebesgue-measure-zero exact-capture set, the system converges to a unique worst-case limit cycle that is asymptotically stable outside this exceptional set, with throughput losses as large as 50%. For heterogeneous workloads, we prove a stability criterion in the two-class common-input setting and explain how the survival-polynomial mechanism generalizes to multiple classes and heterogeneous-input lengths. Under an input-dominated scaling regime, coprime decoding lengths stabilize the eviction-free equilibrium, while non-coprime lengths create synchronized modes that drive instability. These results characterize when workload heterogeneity desynchronizes completions and helps stabilize memory-constrained serving. More broadly, we identify service-induced congestion as a structural instability mechanism and derive scheduling design principles for sustaining high throughput.

* 101 pages

Via

Access Paper or Ask Questions

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Apr 15, 2025

Ruicheng Ao, Gan Luo, David Simchi-Levi, Xinshang Wang

Figure 1 for Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Figure 2 for Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Figure 3 for Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Figure 4 for Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Abstract:Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure -- generating responses by processing text in segments and using a memory-heavy Key-Value (KV) cache -- demands significant computational resources, particularly under memory constraints. This paper formulates LLM inference optimization as a multi-stage online scheduling problem where sequential prompt arrivals and KV cache growth render conventional scheduling ineffective. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design. Building on this, we propose the Waiting for Accumulated Inference Threshold (WAIT) algorithm, which uses multiple thresholds to schedule incoming prompts optimally when output lengths are known, and extend it to Nested WAIT for cases with unknown output lengths. Theoretical analysis shows that both algorithms achieve near-optimal performance against the fluid benchmark in heavy traffic conditions, balancing throughput, latency, and Time to First Token (TTFT). Experiments with the Llama-7B model on an A100 GPU using both synthetic and real-world datasets demonstrate improved throughput and reduced latency relative to established baselines like vLLM and Sarathi. This work bridges operations research and machine learning, offering a rigorous framework for the efficient deployment of LLMs under memory constraints.

* 42 pages, 18 figures

Via

Access Paper or Ask Questions

Course Concept Expansion in MOOCs with External Knowledge and Interactive Game

Sep 17, 2019

Jifan Yu, Chenyu Wang, Gan Luo, Lei Hou, Juanzi Li, Jie Tang, Zhiyuan Liu

Figure 1 for Course Concept Expansion in MOOCs with External Knowledge and Interactive Game

Figure 2 for Course Concept Expansion in MOOCs with External Knowledge and Interactive Game

Figure 3 for Course Concept Expansion in MOOCs with External Knowledge and Interactive Game

Figure 4 for Course Concept Expansion in MOOCs with External Knowledge and Interactive Game

Abstract:As Massive Open Online Courses (MOOCs) become increasingly popular, it is promising to automatically provide extracurricular knowledge for MOOC users. Suffering from semantic drifts and lack of knowledge guidance, existing methods can not effectively expand course concepts in complex MOOC environments. In this paper, we first build a novel boundary during searching for new concepts via external knowledge base and then utilize heterogeneous features to verify the high-quality results. In addition, to involve human efforts in our model, we design an interactive optimization mechanism based on a game. Our experiments on the four datasets from Coursera and XuetangX show that the proposed method achieves significant improvements(+0.19 by MAP) over existing methods. The source code and datasets have been published.

Via

Access Paper or Ask Questions