A Reference Architecture for Datacenter Scheduler Programming: Design and Experiments

Emerging Research Track, 2023 ACM/SPEC International Conference on Performance Engineering (ICPE '23)

Download PDF Slides

Abstract

Datacenters are the backbone of our digital society, used by the industry, academic researchers, public institutions, etc. To manage resources, data centers make use of sophisticated schedulers. Each scheduler offers different capabilities, and users use them through their APIs. However, there is not a clear understanding of what programming abstractions they offer, nor why they offer some and not others. Consequently, it is difficult to understand their differences and the performance costs imposed by their APIs. In this work, we study the programming abstractions offered by industrial schedulers, their shortcomings, and the performance costs of the shortcomings. We propose a general reference architecture for scheduler programming abstractions. Specifically, we analyze the programming abstractions of five popular industrial schedulers, analyze the differences in their APIs, identify the missing abstractions, and finally, carry out an exemplary experiment to demonstrate that schedulers sacrifice performance by under-implementing programming abstractions. In the experiments, we demonstrate that an API extension can improve task runtime by up to 23%. This work allows schedulers to identify their shortcomings and points of improvement in their APIs, but most importantly, provides a reference architecture for existing and future schedulers.