Massivizing High Performance Computing for AI and ML: VU on the Science, Design, and Engineering of AI and ML Ecosystems

ScalPerf 2022 / Sep 21, 2022

Download PDF Slides


Wherever we look, our society is turning digital. Science and engineering, business-critical and economic operations, and online education and gaming rely increasingly on the effective digitalization of their processes. For digitalization to succeed, two key challenges need to be simultaneously addressed: (1) enabling faster, better, and ethical analysis and decision-making through artificial intelligence (AI) and machine learning (ML), and (2) enabling scalable, more available, and more sustainable infrastructure for AI/ML and other Information and Communication Technology (ICT) operations, through large, yet efficient and interoperable, computer ecosystems, largely automated. The latter is the grand challenge of massivizing computer systems. Inspired by this challenge and by our experience with distributed computer systems for over 15 years, we focus on understanding, deploying, scaling, and evolving such computer ecosystems successfully, that is, satisficing performance, dependability, sustainability, and cost-effectiveness. We posit we can achieve this through an ambitious, comprehensive research program, which starts from the idea that we can address the grand, fundamental challenge by focusing on computer ecosystems rather than merely on (individual, small-scale) computer systems. In this talk, we define computer ecosystems and differentiate them from mere systems. We formulate eight principles and introduce a reference architecture for computer ecosystems supporting AI/ML and beyond across the computing continuum, as a high-level, universal framework that may guide the science, design, and engineering of such ecosystems. We synthesize a framework of resource management and scheduling (RM&S) techniques, which we argue should be explored systematically in the next decade. We can use such techniques not only to support better AI/ML processes, but also to improve the ICT infrastructure that runs them. We show early results obtained experimentally, both through controlled real-world experiments using the GradeML framework and through what-if analysis using the OpenDC simulator. This work could lead in particular to better workflow, big data, and graph processing frameworks supporting AI/ML, and the creation of new processes and services that depend on them. This vision aligns with the Manifesto on Computer Systems and Networking Research in the Netherlands [1], which the speaker co-leads. Many of our examples come from real-world prototyping and experimentation, grand experiments in computer systems, and/or benchmarking and performance analysis work conducted with the Cloud group of SPEC RG [2].

[1] Future Computer Systems and Networking Research in the Netherlands: A Manifesto, 2022. [Online]
[2] SPEC RG Cloud

BIO Alexandru Iosup is a full professor at Vrije Universiteit Amsterdam (VU), a high-quality research university in the Netherlands. He is the tenured chair of the Massivizing Computer Systems research group at the VU and visiting researcher at TU Delft. He is also elected chair of the SPEC-RG Cloud Group. His work in distributed systems and ecosystems includes over 150 peer-reviewed articles with high scientific impact, and has applications in cloud computing, big data, scientific and business-critical computing, and online gaming. His research has received prestigious recognition, including membership in the (Young) Royal Academy of Arts and Sciences of the Netherlands, the Netherlands ICT Researcher of the Year award, and a PhD from TU Delft. His leadership and innovation in education led to various awards, including the prestigious Netherlands Higher-Education Teacher of the Year. He has received a knighthood for cultural and scientific merits. Contact Alexandru at or @AIosup, or visit