@Large Research

Abstract

Wherever we look, our society is turning digital. Science and engineering, business-critical and government operations, and online education, shopping, and gaming rely increasingly on digitalization---on core processes powered by data, AI, and computers. For digitalization to succeed, we must integrate computer systems into larger _ecosystems_, effectively and efficiently managed. However successful until now, we cannot take computer ecosystems for granted: the core does not yet rely on sound principles of science, design, and engineering, and we already see many warning signs about the scalability, dependability, and sustainability of engineered operations. Creating a core that can tame these signs for the upcoming unprecedented scale and demand is the fundamental challenge of massivizing computer systems.
In this talk, inspired by this challenge and our experience with distributed computer ecosystems for over 15 years, we focus on understanding, deploying, scaling, and evolving computer ecosystems. Together with our team and partners, we envision achieving this through an ambitious, comprehensive research, development, and innovation program, where the key is starting from combined computer ecosystems rather than individual, small-scale computer systems.
Using illustrative use cases, we define core concepts, formulate principles, and introduce a reference architecture for computer ecosystems supporting complex workloads, and diverse resources and backend services across the computing continuum. We synthesize a framework of resource management and scheduling (RM & S) techniques, which we, alongside the community, should explore systematically in the next decade. We show early results obtained experimentally through controlled real-world experiments, long-term observation, and what-if analysis of short- and long-term scenarios, and focus on reproducible results. We introduce a new operational instrument for RM & S, digital twins that can mimic operations, and conduct and explain meaningful counterfactual experiments. On the data side, we introduce the computing continuum Memex to record long-term operational signals, and make them available efficiently and meaningfully.
When further pursued, these conceptual and practical advances enable holistic improvements to applications, services, and entire computer ecosystems. They let us hope to cure the many illnesses abundant in today's ecosystems. So this is a call to the community spanning distributed systems, performance engineering, software engineering, data engineering, and more: there is much to discover and achieve!
This vision aligns with the IPN SIG on Future Computer Systems and Networking and their Manifesto [1] (but any mistakes remain the speaker's). Many of our examples come from real-world prototyping and experimentation, grand experiments in computer systems, and/or benchmarking and performance analysis work conducted with the Cloud group of SPEC RG [2]. Last but not least, the work presented here has been performed by the Massivizing Computer Systems team at the VU and our partners, whom the speaker gratefully acknowledges.
[1] Future Computer Systems and Networking Research in the Netherlands: A Manifesto, 2022. [Online] https://arxiv.org/pdf/2206.03259
[2] SPEC RG Cloud https://research.spec.org/working-groups/rg-cloud/

Massivizing Computer Systems: VU on the Science, Design, and Engineering of Distributed Computer Ecosystems

Abstract

Contact