Abstract:Performance prediction for OpenMP workloads on heterogeneous embedded SoCs is challenging due to complex interactions between task DAG structure, control-flow irregularity, cache and branch behavior, and thermal dynamics; classical heuristics struggle under workload irregularity, tabular regressors discard structural information, and model-free RL risks overheating resource-constrained devices. We introduce GraphPerf-RT, the first surrogate that unifies task DAG topology, CFG-derived code semantics, and runtime context (per-core DVFS, thermal state, utilization) in a heterogeneous graph representation with typed edges encoding precedence, placement, and contention. Multi-task evidential heads predict makespan, energy, cache and branch misses, and utilization with calibrated uncertainty (Normal-Inverse-Gamma), enabling risk-aware scheduling that filters low-confidence rollouts. We validate GraphPerf-RT on three embedded ARM platforms (Jetson TX2, Jetson Orin NX, RUBIK Pi), achieving R^2 > 0.95 with well-calibrated uncertainty (ECE < 0.05). To demonstrate end-to-end scheduling utility, we integrate the surrogate with four RL methods on Jetson TX2: single-agent model-free (SAMFRL), single-agent model-based (SAMBRL), multi-agent model-free (MAMFRL-D3QN), and multi-agent model-based (MAMBRL-D3QN). Experiments across 5 seeds (200 episodes each) show that MAMBRL-D3QN with GraphPerf-RT as the world model achieves 66% makespan reduction (0.97 +/- 0.35s) and 82% energy reduction (0.006 +/- 0.005J) compared to model-free baselines, demonstrating that accurate, uncertainty-aware surrogates enable effective model-based planning on thermally constrained embedded systems.




Abstract:Generating realistic and diverse unstructured data is a significant challenge in reinforcement learning (RL), particularly in few-shot learning scenarios where data is scarce. Traditional RL methods often rely on extensive datasets or simulations, which are costly and time-consuming. In this paper, we introduce a distribution-aware flow matching, designed to generate synthetic unstructured data tailored specifically for an application of few-shot RL called Dynamic Voltage and Frequency Scaling (DVFS) on embedded processors. This method leverages the sample efficiency of flow matching and incorporates statistical learning techniques such as bootstrapping to improve its generalization and robustness of the latent space. Additionally, we apply feature weighting through Random Forests to prioritize critical data aspects, thereby improving the precision of the generated synthetic data. This approach not only mitigates the challenges of overfitting and data correlation in unstructured data in traditional Model-Based RL but also aligns with the Law of Large Numbers, ensuring convergence to true empirical values and optimal policy as the number of samples increases. Through extensive experimentation on an application of DVFS for low energy processing, we demonstrate that our method provides an stable convergence based on max Q-value while enhancing frame rate by 30\% in the very beginning first timestamps, making this RL model efficient in resource-constrained environments.