Research Topics

TL;DR: We are Massivizing Computer Systems.

We are Massivizing Computer Systems, that is, making computer systems scalable, reliable, performant, etc., yet forming efficient, engineered ecosystems available to everyone. Our work improves large-scale infrastructure, datacenters, cloud computing, big data, and numerous popular applications in the Digital Economy.

Why? What? and How? Read further.

Why Massivizing Computer Systems?

Our society depends today on computer systems. The Internet, banking, healthcare, energy management, ICT operations of SMEs and large organizations, media and web services, online games and social media, and even governance, are elements of a Digital Economy in which massive computer systems, such as modern datacenters, serve as factories producing services with massive consumption. Post-Moore computer systems, in which both hardware and software have experienced rapid change from (internal) amplification towards (external) diversification, proliferation, and commoditization, have already resulted in a diversity of radically new designs serving new ways to create and consume digital artifacts. It is this change that drives the need for a new research discipline at the core of modern computer science.

What is the research when Massivizing Computer Systems?

Which fields of computer science do we address?

Overall, we uncover the laws of operation and create theories about the operation of large-scale computer systems and ecosystems, then using these to design, develop, deploy, evaluate, and benchmark useful distributed computing systems and ecosystems. We focus on:

  1. Core research in large-scale distributed computer systems, including design of system architectures and mechanisms, resource management and scheduling approaches and algorithms, design of mechanisms for resilience and reliability in computer systems, etc.
  2. System performance engineering, including performance analysis, performance tuning, workload characterization and modeling, performance analysis, and benchmarking.
  3. Observation and measurement of (large-scale) systems and ecosystems, in the lab and at large.
  4. Comparison of multiple, repeated real-world and/or simulation experiments.
  5. Understanding socio-legal impact on how large-scale systems operate and are designed. (e.g., Can we leverage the pro-social behavior of crowds, such as the formation of amateur and specialist communities, to better plan the capacity, provision resources dynamically, anticipate need for service?)

How are we Massivizing Computer Systems?

What kind of (computer) science is this? What are its core products and processes?

Overall, we are employing techniques that combine best-practices from the fields of large-scale distributed and parallel computing systems, performance engineering at scale, software engineering with remote teams and of large-scale software, complex systems, simulation studies, and real-world experimentation. We have spent over a decade perfecting these techniques, and we are still learning every day.

  1. We operate in Pasteur’s Quadrant, seeking fundamental knowledge, but inspired by real-world applications and aiming at societal impact through building real systems. Examples of direct fields of research include cloud computing and big data, for which our research takes fundamental approaches with many useful applications, for industry and society at large in e-Science, business-critical workloads, graph processing at large, and for end-consumers online gaming and large-scale higher education.
  2. We focus on building and measuring real systems, and publish our results as free open-source software and free open-access data. When simulation brings advantages, for example for long-term or for large-scale experiments, we also use simulation, typically in tandem with real-world experiments. We use mathematical modeling and analysis too, when they can bring a plus, for example in situations of steady-state or when we are trying to find more fundamental properties for simplified settings.
  3. We go the extra mile to check our code, data, experiments, and articles. We ask others to read our pre-prints, and engage with the community long before we publish. We publish our scientific articles in as much as possible in competitive, well-managed conferences and journals, and publish our software and data often. We ask new team members to re-use elements from their predecessors' research, while the predecessors are still in the team, and make sure discrepancies are addressed if and as soon as they appear. We are pioneering methods to store and share scientific knowledge in our field.
  4. We invest much time in developing the scientific and engineering skills of our team members, because there can be no science or top-level engineering results without skills that already exceed what is currently taught in most academic institutions, even at Bachelor's and Master's level.
  5. We are at the start of a big science process, where our computer science team already interacts with many other computer science teams with similar or different backgrounds, and aims to develop inter-disciplinary links especially around the application areas from point 2. We are already doing a part of the inter-disciplinary research within the SPEC and LDBC international organizations.

Are there specific research keywords related Massivizing Computer Systems?

Overall, we cover many of the keywords you know, and/or which are explicitly included in EU H2020 calls and other funding calls. Here is a non-exclusive list:

Cloud computing: IaaS, PaaS, and FaaS cloud computing; datacenter, fog, and edge computing; federated datacenters and clusters; hybrid cloud; multi-cloud and multi-cluster; reserved, spot, and on-demand services; serverless operation; heterogeneous resources; resource management and scheduling; scheduling policies; provisioning; allocation; portfolio scheduling; dynamic schesduling; horizontal and vertical scaling; auto-scaling; auto-tiering; elastic cloud management; software-defined operation; the consistency-availability-partition-tolerance trade-off space; worfklow and dataflow; bags-of-tasks; tiered services; stream processing.

Big Data: all aspects of cloud computing and performance engineering, plus convergence of HPC, cloud computing, and big data workloads; graph processing; MapReduce/Hadoop, dataflow/Spark; data processing; data partitioning and replication.

Performance engineering: auto-tuning; performance optimization; workload characterization and modeling; performance metrics; performance; scalability; elasticity; availability; reliability; dependability; energy-proportionality and energy-efficiency; measurements; observation; longitudinal studies; monitoring; performance studies; benchmarking; the platform-algorithm-workload trade-off space; performance analysis; bottleneck identification; anomaly detection; performance-issue analysis; software and hardware performance counters.

Large-scale test-labs and Living labs: demonstrators, system deployment and configuration, experimental evaluation; DevOps in practice; build-and-test workloads; background workload; methodology; reproducibility.

Why this scope of research? Does Massivizing Computing Systems respond to a practical need?

Indeed, there are many general topics and specific research keywords. We see Massivizing Computing Systems as the natural evolution in scope of the former grid computing, and later cloud computing, communities. As part of those communities, we as the large-scale distributed computing community have not engaged sufficiently with the science and practice of software and performance engineering (and vice-versa). Everybody stood to lose. We do not want to repeat the same mistakes, and ours is a call to research for the combined communities.

What is this @Large and how does it relate to Massivizing Computer Systems?

@Large is a short form of how we see systems operating in real situations: they simply (mis-)behave when put together, to the point where individual systems fail to preserve the properties and guarantees delivered under laboratory conditions, and entire ecosystems under-perform or become undependable.