Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Duong Tung Nguyen

Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference

Apr 08, 2026

Jiaming Cheng, Duong Tung Nguyen

Abstract:Deploying large language model (LLM) inference at scale requires jointly selecting base models, provisioning heterogeneous GPUs, configuring parallelism, and distributing workloads under tight latency, accuracy, and budget constraints. Exact mixed-integer linear programming (MILP) approaches guarantee optimality but scale poorly. We propose two constraint-aware heuristics: a Greedy Heuristic (GH) for single-pass allocation, and an Adaptive Greedy Heuristic (AGH) that enhances GH via multi-start construction, relocate-based local search, and GPU consolidation. Three constraint-aware mechanisms -- TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade -- ensure feasibility under tightly coupled memory, delay, error, and budget constraints. On workloads calibrated with the Azure LLM Inference Trace (2025), both heuristics produce feasible solutions in under one second, with AGH closely approaching optimal cost while achieving over 260x speedup on large-scale instances. Under out-of-sample stress tests with up to 1.5x parameter inflation, AGH maintains controlled SLO violations and stable cost, whereas the exact solver's placement degrades sharply.

Via

Access Paper or Ask Questions

Decentralized Federated Learning with Gradient Tracking over Time-Varying Directed Networks

Sep 25, 2024

Duong Thuy Anh Nguyen, Su Wang, Duong Tung Nguyen, Angelia Nedich, H. Vincent Poor

Figure 1 for Decentralized Federated Learning with Gradient Tracking over Time-Varying Directed Networks

Figure 2 for Decentralized Federated Learning with Gradient Tracking over Time-Varying Directed Networks

Figure 3 for Decentralized Federated Learning with Gradient Tracking over Time-Varying Directed Networks

Figure 4 for Decentralized Federated Learning with Gradient Tracking over Time-Varying Directed Networks

Abstract:We investigate the problem of agent-to-agent interaction in decentralized (federated) learning over time-varying directed graphs, and, in doing so, propose a consensus-based algorithm called DSGTm-TV. The proposed algorithm incorporates gradient tracking and heavy-ball momentum to distributively optimize a global objective function, while preserving local data privacy. Under DSGTm-TV, agents will update local model parameters and gradient estimates using information exchange with neighboring agents enabled through row- and column-stochastic mixing matrices, which we show guarantee both consensus and optimality. Our analysis establishes that DSGTm-TV exhibits linear convergence to the exact global optimum when exact gradient information is available, and converges in expectation to a neighborhood of the global optimum when employing stochastic gradients. Moreover, in contrast to existing methods, DSGTm-TV preserves convergence for networks with uncoordinated stepsizes and momentum parameters, for which we provide explicit bounds. These results enable agents to operate in a fully decentralized manner, independently optimizing their local hyper-parameters. We demonstrate the efficacy of our approach via comparisons with state-of-the-art baselines on real-world image classification and natural language processing tasks.

Via

Access Paper or Ask Questions

A Bandit Approach to Online Pricing for Heterogeneous Edge Resource Allocation

Feb 14, 2023

Jiaming Cheng, Duong Thuy Anh Nguyen, Lele Wang, Duong Tung Nguyen, Vijay K. Bhargava

Figure 1 for A Bandit Approach to Online Pricing for Heterogeneous Edge Resource Allocation

Figure 2 for A Bandit Approach to Online Pricing for Heterogeneous Edge Resource Allocation

Figure 3 for A Bandit Approach to Online Pricing for Heterogeneous Edge Resource Allocation

Figure 4 for A Bandit Approach to Online Pricing for Heterogeneous Edge Resource Allocation

Abstract:Edge Computing (EC) offers a superior user experience by positioning cloud resources in close proximity to end users. The challenge of allocating edge resources efficiently while maximizing profit for the EC platform remains a sophisticated problem, especially with the added complexity of the online arrival of resource requests. To address this challenge, we propose to cast the problem as a multi-armed bandit problem and develop two novel online pricing mechanisms, the Kullback-Leibler Upper Confidence Bound (KL-UCB) algorithm and the Min-Max Optimal algorithm, for heterogeneous edge resource allocation. These mechanisms operate in real-time and do not require prior knowledge of demand distribution, which can be difficult to obtain in practice. The proposed posted pricing schemes allow users to select and pay for their preferred resources, with the platform dynamically adjusting resource prices based on observed historical data. Numerical results show the advantages of the proposed mechanisms compared to several benchmark schemes derived from traditional bandit algorithms, including the Epsilon-Greedy, basic UCB, and Thompson Sampling algorithms.

Via

Access Paper or Ask Questions