Heterogeneity has become a mainstream architecture design choice for building High Performance Computing systems. However, heterogeneity poses significant challenges for achieving performance portability of execution. Adapting a program to a new heterogeneous platform is laborious and requires developers to manually explore a vast space of execution parameters. To address those challenges, this paper proposes new extensions to OpenMP for autonomous, machine learning-driven adaptation. Our solution includes a set of novel language constructs, compiler transformations, and runtime support. We propose a producer-consumer pattern to flexibly define multiple, different variants of OpenMP code regions to enable adaptation. Those regions are transparently profiled at runtime to autonomously learn optimizing machine learning models that dynamically select the fastest variant. Our approach significantly reduces users' efforts of programming adaptive applications on heterogeneous architectures by leveraging machine learning techniques and code generation capabilities of OpenMP compilation. Using a complete reference implementation in Clang/LLVM we evaluate three use-cases of adaptive CPU-GPU execution. Experiments with HPC proxy applications and benchmarks demonstrate that the proposed adaptive OpenMP extensions automatically choose the best performing code variants for various adaptation possibilities, in several different heterogeneous platforms of CPUs and GPUs.
This paper presents a methodology for using LLVM-based tools to tune the DCA++ (dynamical clusterapproximation) application that targets the new ARM A64FX processor. The goal is to describethe changes required for the new architecture and generate efficient single instruction/multiple data(SIMD) instructions that target the new Scalable Vector Extension instruction set. During manualtuning, the authors used the LLVM tools to improve code parallelization by using OpenMP SIMD,refactored the code and applied transformation that enabled SIMD optimizations, and ensured thatthe correct libraries were used to achieve optimal performance. By applying these code changes, codespeed was increased by 1.98X and 78 GFlops were achieved on the A64FX processor. The authorsaim to automatize parts of the efforts in the OpenMP Advisor tool, which is built on top of existingand newly introduced LLVM tooling.