Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing

May 18, 2025

Leyang Xue, Yao Fu, Luo Mai, Mahesh K. Marina

Figure 1 for HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing

Figure 2 for HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing

Figure 3 for HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing

Figure 4 for HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing

Share this with someone who'll enjoy it:

Abstract:Giant Deep Neural Networks (DNNs), have become indispensable for accurate and robust support of large-scale cloud based AI services. However, serving giant DNNs is prohibitively expensive from an energy consumption viewpoint easily exceeding that of training, due to the enormous scale of GPU clusters needed to hold giant DNN model partitions and replicas. Existing approaches can either optimize energy efficiency or inference accuracy but not both. To overcome this status quo, we propose HybridServe, a novel hybrid DNN model serving system that leverages multiple sized versions (small to giant) of the model to be served in tandem. Through a confidence based hybrid model serving dataflow, HybridServe prefers to serve inference requests with energy-efficient smaller models so long as accuracy is not compromised, thereby reducing the number of replicas needed for giant DNNs. HybridServe also features a dataflow planner for efficient partitioning and replication of candidate models to maximize serving system throughput. Experimental results using a prototype implementation of HybridServe show that it reduces energy footprint by up to 19.8x compared to the state-of-the-art DNN model serving systems while matching the accuracy of serving solely with giant DNNs.

View paper on

Share this with someone who'll enjoy it:

Title:HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing

Paper and Code