Alert button
Picture for Yujeong Choi

Yujeong Choi

Alert button

Hera: A Heterogeneity-Aware Multi-Tenant Inference Server for Personalized Recommendations

Feb 23, 2023
Yujeong Choi, John Kim, Minsoo Rhu

Figure 1 for Hera: A Heterogeneity-Aware Multi-Tenant Inference Server for Personalized Recommendations
Figure 2 for Hera: A Heterogeneity-Aware Multi-Tenant Inference Server for Personalized Recommendations
Figure 3 for Hera: A Heterogeneity-Aware Multi-Tenant Inference Server for Personalized Recommendations
Figure 4 for Hera: A Heterogeneity-Aware Multi-Tenant Inference Server for Personalized Recommendations

While providing low latency is a fundamental requirement in deploying recommendation services, achieving high resource utility is also crucial in cost-effectively maintaining the datacenter. Co-locating multiple workers of a model is an effective way to maximize query-level parallelism and server throughput, but the interference caused by concurrent workers at shared resources can prevent server queries from meeting its SLA. Hera utilizes the heterogeneous memory requirement of multi-tenant recommendation models to intelligently determine a productive set of co-located models and its resource allocation, providing fast response time while achieving high throughput. We show that Hera achieves an average 37.3% improvement in effective machine utilization, enabling 26% reduction in required servers, significantly improving upon the baseline recommedation inference server.

Viaarxiv icon

PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

Feb 27, 2022
Yunseong Kim, Yujeong Choi, Minsoo Rhu

Figure 1 for PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers
Figure 2 for PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers
Figure 3 for PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers
Figure 4 for PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

In cloud machine learning (ML) inference systems, providing low latency to end-users is of utmost importance. However, maximizing server utilization and system throughput is also crucial for ML service providers as it helps lower the total-cost-of-ownership. GPUs have oftentimes been criticized for ML inference usages as its massive compute and memory throughput is hard to be fully utilized under low-batch inference scenarios. To address such limitation, NVIDIA's recently announced Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions". Such feature provides cloud ML service providers the ability to utilize the reconfigurable GPU not only for large-batch training but also for small-batch inference with the potential to achieve high resource utilization. In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server. Our first proposition is a sophisticated partitioning algorithm for reconfigurable GPUs that systematically determines a heterogeneous set of multi-granular GPU partitions, best suited for the inference server's deployment. Furthermore, we co-design an elastic scheduling algorithm tailored for our heterogeneously partitioned GPU server which effectively balances low latency and high GPU utilization.

* This is an extended version of our work, which is accepted for publication at the 59th ACM/ESDA/IEEE Design Automation Conference (DAC), 2022 
Viaarxiv icon

LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference

Oct 25, 2020
Yujeong Choi, Yunseong Kim, Minsoo Rhu

Figure 1 for LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference
Figure 2 for LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference
Figure 3 for LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference
Figure 4 for LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference

In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. Prior graph batching combines the individual DNN graphs into a single one, allowing multiple inputs to be concurrently executed in parallel. We observe that the coarse-grained graph batching becomes suboptimal in effectively handling the dynamic inference request traffic, leaving significant performance left on the table. This paper proposes LazyBatching, an SLA-aware batching system that considers both scheduling and batching in the granularity of individual graph nodes, rather than the entire graph for flexible batching. We show that LazyBatching can intelligently determine the set of nodes that can be efficiently batched together, achieving an average 15x, 1.5x, and 5.5x improvement than graph batching in terms of average response time, throughput, and SLA satisfaction, respectively.

Viaarxiv icon

NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units

Nov 15, 2019
Bongjoon Hyun, Youngeun Kwon, Yujeong Choi, John Kim, Minsoo Rhu

Figure 1 for NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units
Figure 2 for NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units
Figure 3 for NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units
Figure 4 for NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units

To satisfy the compute and memory demands of deep neural networks, neural processing units (NPUs) are widely being utilized for accelerating deep learning algorithms. Similar to how GPUs have evolved from a slave device into a mainstream processor architecture, it is likely that NPUs will become first class citizens in this fast-evolving heterogeneous architecture space. This paper makes a case for enabling address translation in NPUs to decouple the virtual and physical memory address space. Through a careful data-driven application characterization study, we root-cause several limitations of prior GPU-centric address translation schemes and propose a memory management unit (MMU) that is tailored for NPUs. Compared to an oracular MMU design point, our proposal incurs only an average 0.06% performance overhead.

Viaarxiv icon

PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units

Sep 06, 2019
Yujeong Choi, Minsoo Rhu

Figure 1 for PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units
Figure 2 for PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units
Figure 3 for PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units
Figure 4 for PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units

To amortize cost, cloud vendors providing DNN acceleration as a service to end-users employ consolidation and virtualization to share the underlying resources among multiple DNN service requests. This paper makes a case for a "preemptible" neural processing unit (NPU) and a "predictive" multi-task scheduler to meet the latency demands of high-priority inference while maintaining high throughput. We evaluate both the mechanisms that enable NPUs to be preemptible and the policies that utilize them to meet scheduling objectives. We show that preemptive NPU multi-tasking can achieve an average 7.8x, 1.4x, and 4.8x improvement in latency, throughput, and SLA satisfaction, respectively.

Viaarxiv icon