Abstract:As cloud computing continues to evolve, the adoption of multi-NUMA (Non-Uniform Memory Access) architecture by cloud service providers has introduced new challenges in virtual machine (VM) scheduling. To address these challenges and more accurately reflect the complexities faced by modern cloud environments, we introduce the Dynamic VM Allocation problem in Multi-NUMA PM (DVAMP). We formally define both offline and online versions of DVAMP as mixed-integer linear programming problems, providing a rigorous mathematical foundation for analysis. A tight performance bound for greedy online algorithms is derived, offering insights into the worst-case optimality gap as a function of the number of physical machines and VM lifetime variability. To address the challenges posed by DVAMP, we propose SPANE (Symmetry-Preserving Architecture for Multi-NUMA Environments), a novel deep reinforcement learning approach that exploits the problem's inherent symmetries. SPANE produces invariant results under arbitrary permutations of physical machine states, enhancing learning efficiency and solution quality. Extensive experiments conducted on the Huawei-East-1 dataset demonstrate that SPANE outperforms existing baselines, reducing average VM wait time by 45%. Our work contributes to the field of cloud resource management by providing both theoretical insights and practical solutions for VM scheduling in multi-NUMA environments, addressing a critical gap in the literature and offering improved performance for real-world cloud systems.
Abstract:Multi-tenancy in public clouds may lead to co-location interference on shared resources, which possibly results in performance degradation of cloud applications. Cloud providers want to know when such events happen and how serious the degradation is, to perform interference-aware migrations and alleviate the problem. However, virtual machines (VM) in Infrastructure-as-a-Service public clouds are black-boxes to providers, where application-level performance information cannot be acquired. This makes performance monitoring intensely challenging as cloud providers can only rely on low-level metrics such as CPU usage and hardware counters. We propose a novel machine learning framework, Alioth, to monitor the performance degradation of cloud applications. To feed the data-hungry models, we first elaborate interference generators and conduct comprehensive co-location experiments on a testbed to build Alioth-dataset which reflects the complexity and dynamicity in real-world scenarios. Then we construct Alioth by (1) augmenting features via recovering low-level metrics under no interference using denoising auto-encoders, (2) devising a transfer learning model based on domain adaptation neural network to make models generalize on test cases unseen in offline training, and (3) developing a SHAP explainer to automate feature selection and enhance model interpretability. Experiments show that Alioth achieves an average mean absolute error of 5.29% offline and 10.8% when testing on applications unseen in the training stage, outperforming the baseline methods. Alioth is also robust in signaling quality-of-service violation under dynamicity. Finally, we demonstrate a possible application of Alioth's interpretability, providing insights to benefit the decision-making of cloud operators. The dataset and code of Alioth have been released on GitHub.