We are Massivizing Computer Systems, that is, making computer systems scalable, reliable, performant, etc., yet forming efficient, engineered ecosystems available to everyone. Our work improves large-scale infrastructure, datacenters, cloud computing, big data, and numerous popular applications in the Digital Economy.
Why? What? and How? Read further.
Our society depends today on computer systems. The Internet, banking, healthcare, energy management, ICT operations of SMEs and large organizations, media and web services, online games and social media, and even governance, are elements of a Digital Economy in which massive computer systems, such as modern datacenters, serve as factories producing services with massive consumption. Post-Moore computer systems, in which both hardware and software have experienced rapid change from (internal) amplification towards (external) diversification, proliferation, and commoditization, have already resulted in a diversity of radically new designs serving new ways to create and consume digital artifacts. It is this change that drives the need for a new research discipline at the core of modern computer science.
Overall, we uncover the laws of operation and create theories about the operation of large-scale computer systems and ecosystems, then using these to design, develop, deploy, evaluate, and benchmark useful distributed computing systems and ecosystems. We focus on:
Overall, we are employing techniques that combine best-practices from the fields of large-scale distributed and parallel computing systems, performance engineering at scale, software engineering with remote teams and of large-scale software, complex systems, simulation studies, and real-world experimentation. We have spent over a decade perfecting these techniques, and we are still learning every day.
Overall, we cover many of the keywords you know, and/or which are explicitly included in EU H2020 calls and other funding calls. Here is a non-exclusive list:
Cloud computing: IaaS, PaaS, and FaaS cloud computing; datacenter, fog, and edge computing; federated datacenters and clusters; hybrid cloud; multi-cloud and multi-cluster; reserved, spot, and on-demand services; serverless operation; heterogeneous resources; resource management and scheduling; scheduling policies; provisioning; allocation; portfolio scheduling; dynamic schesduling; horizontal and vertical scaling; auto-scaling; auto-tiering; elastic cloud management; software-defined operation; the consistency-availability-partition-tolerance trade-off space; worfklow and dataflow; bags-of-tasks; tiered services; stream processing.
Big Data: all aspects of cloud computing and performance engineering, plus convergence of HPC, cloud computing, and big data workloads; graph processing; MapReduce/Hadoop, dataflow/Spark; data processing; data partitioning and replication.
Performance engineering: auto-tuning; performance optimization; workload characterization and modeling; performance metrics; performance; scalability; elasticity; availability; reliability; dependability; energy-proportionality and energy-efficiency; measurements; observation; longitudinal studies; monitoring; performance studies; benchmarking; the platform-algorithm-workload trade-off space; performance analysis; bottleneck identification; anomaly detection; performance-issue analysis; software and hardware performance counters.
Large-scale test-labs and Living labs: demonstrators, system deployment and configuration, experimental evaluation; DevOps in practice; build-and-test workloads; background workload; methodology; reproducibility.
Indeed, there are many general topics and specific research keywords. We see Massivizing Computing Systems as the natural evolution in scope of the former grid computing, and later cloud computing, communities. As part of those communities, we as the large-scale distributed computing community have not engaged sufficiently with the science and practice of software and performance engineering (and vice-versa). Everybody stood to lose. We do not want to repeat the same mistakes, and ours is a call to research for the combined communities.
@Large is a short form of how we see systems operating in real situations: they simply (mis-)behave when put together, to the point where individual systems fail to preserve the properties and guarantees delivered under laboratory conditions, and entire ecosystems under-perform or become undependable.