Resource and Risk Management in Datacenters

International Symposium on Parallel and Distributed Computing (ISPDC-2019), Amsterdam

Download PDF Slides

Abstract

Cloud datacenters are increasingly hosting business workloads. Such long-running, on-demand workloads raise important challenges in datacenter operation, requiring efficient online scheduling of workloads with unprecedented characteris- tics under strict service level agreements (SLAs). In this work, we propose an approach to manage the risk of not meeting SLAs. Our approach is based on portfolio scheduling, which is an online scheduling technique that dynamically selects a scheduling algorithm from a set (portfolio), subject to a possibly changing utility function. Ours is the first datacenter-scheduling approach to consider operational and disaster-recovery risks. Using trace- based simulation with traces collected from a commercial multi-datacenter environment, we give evidence that portfolio schedul- ing is able to mitigate risks significantly better than its constituent scheduling algorithms and better than datacenter engineers.