@Large Research

This course focuses on distributed computing systems. In general, debugging and tuning existing systems, and designing, implementing, and analyzing new distributed computing systems remains vital and challenging for both industry and academia.

Starting with the mid-1990s, computing is undergoing a revolution, in which collections of independent computers appear to users as a single, albeit distributed, computing system. Motivated by the advent of the Internet, by the increase in the computation capacity of consumer computers, by the commoditization of server-grade machines, by energy constraints, etc., the distributed computing paradigm has permeated all fields using computers. Current distributed computing applications range from social networks to banking, from peer-to-peer file-sharing to high-performance computing used in research, from massively multiplayer online games to business-critical workloads, etc. Important advances have helped to fuse heterogeneous resources into truly global distributed systems, for example in scientific computing, where distributed computation is using Big Data and distributed sensors to produce meaningful progress for the humankind. We will focus in this course on a number of these modern examples of distributed computing systems.

Although so many distributed systems already exist, the list of conceptual and technical challenges they pose is long. Depending on requirements, even trivial communication between nodes of the distributed system can be challenging. The failure of a single node, or sometimes even a performance hiccup, can bring an entire system down; with it, other nodes or entire other systems may also crash, experiencing correlated and catastrophic failures. Data consistency and coordinating nodes remain important challenges made worse by the large-scale of real-world deployments. Poor resource management and naive scheduling can lead to orders-of-magnitude higher operational costs and consumption of energy that we simply cannot spare. It is not uncommon for a modern distributed system to quickly rise and then fall in popularity, as exemplified by the 2016 example of Pokemon Go. We will present in this course real-world situations where modern distributed systems have behaved poorly.

Addressing these challenges requires unique approaches and concepts. Separating concerns and breaking down problems into smaller cases often lead to limited success, because many properties of distributed systems can only be achieved end-to-end. Can anyone imagine a perfectly reliable production pipeline, if even one of its key stages can suffer failures? Building capability by adding resources is often offset by the distributed nature of the system. Can anyone ignore the physical limitations of communication around the globe? In this course, we will focus on the unique approaches and principles of distributed systems, from specific architectures and communication protocols, to specific concepts in resource management and scheduling, data consistency, fault-tolerance, and performance.

Syllabus

Study Goals

Upon completing this course, you will be able to:

Explain the basic concepts, objectives, and functions of distributed computing systems, e.g., communication, resource management and scheduling, data consistency, fault-tolerance, performance.
Compare the basic characteristics of distributed computing with those of other computing paradigms (i.e., centralized, parallel).
Identify proficiently the different flavors of modern distributed systems (i.e., peer-to-peer systems, cluster computing, grid computing, cloud computing, datacenters, distributed HPC, SDN, Big Data systems, IoT systems).
Analyze proficiently the trade-offs inherent in the design of modern distributed systems.
Design your portfolio distributed-system, with many basic and some complex operations of modern distributed systems.
Implement and deploy your portfolio distributed-system.
Analyze your portfolio distributed-system.

Education Method and Contact Hours

Total: 6ECs, comprised of Lectures, Design, Seminar, Self-Study, Practical work.
Lectures: 2h/week.
Design: 2h/week, systems and ecosystems focus.
Seminar: 2h/week, systems and ecosystems focus.
Self-study: scientific material, book, tech blogs.
Practical work (Lab): large assignment, focusing on analysis, design, development, deployment, and experimental evaluation. The exercise concludes with a technical report.

Gamification

This course uses gamification. In our course design, the core of gamification is not that you receive points for every step you make in learning the topics of the course, but that gamification helps stimulate you to take the step without fear of failure, and that it accommodates many types of student abilities and desires. Among other things, our approach to social gamification allows you to choose your own path of advancement.

Textbook

The course uses as textbook the book:

Maarten van Steen and Andrew S. Tanenbaum, Distributed Systems, 3rd. Ed., online edition, 2017. (free for all) [Online] Available: https://www.distributed-systems.net/index.php/books/distributed-systems-3rd-edition-2017/

Additional Study Materials

Course and Lab guides are also provided via Blackboard. Introductions to the process of each activity, including assessment, appear in the first session of each activity.

Assessment

(Mandatory) advanced lab assignments. (Turn in to TAs)
(Mandatory) final exam, written, multiple choice and open-ended questions.
(Optional) in-class design exercises, oral and written.
(Optional) in-class seminar discussion, and written questions and commentary (sent to TAs).
(Optional) in-class discussion, during lectures.
(Optional) bonus and design-your-own lab assignments. (Turn in to SAs)