This course focuses on distributed computing systems. In general, debugging and tuning existing systems, and designing, implementing, and analyzing new distributed computing systems remains vital and challenging for both industry and academia.
Starting with the mid-1990s, computing is undergoing a revolution, in which collections of independent computers appear to users as a single, albeit distributed, computing system. Motivated by the advent of the Internet, by the increase in the computation capacity of consumer computers, by the commoditization of server-grade machines, by energy constraints, etc., the distributed computing paradigm has permeated all fields using computers. Current distributed computing applications range from social networks to banking, from peer-to-peer file-sharing to high-performance computing used in research, from massively multiplayer online games to business-critical workloads, etc. Important advances have helped to fuse heterogeneous resources into truly global distributed systems, for example in scientific computing, where distributed computation is using Big Data and distributed sensors to produce meaningful progress for the humankind. We will focus in this course on a number of these modern examples of distributed computing systems.
Although so many distributed systems already exist, the list of conceptual and technical challenges they pose is long. Depending on requirements, even trivial communication between nodes of the distributed system can be challenging. The failure of a single node, or sometimes even a performance hiccup, can bring an entire system down; with it, other nodes or entire other systems may also crash, experiencing correlated and catastrophic failures. Data consistency and coordinating nodes remain important challenges made worse by the large-scale of real-world deployments. Poor resource management and naive scheduling can lead to orders-of-magnitude higher operational costs and consumption of energy that we simply cannot spare. It is not uncommon for a modern distributed system to quickly rise and then fall in popularity, as exemplified by the 2016 example of Pokemon Go. We will present in this course real-world situations where modern distributed systems have behaved poorly.
Addressing these challenges requires unique approaches and concepts. Separating concerns and breaking down problems into smaller cases often lead to limited success, because many properties of distributed systems can only be achieved end-to-end. Can anyone imagine a perfectly reliable production pipeline, if even one of its key stages can suffer failures? Building capability by adding resources is often offset by the distributed nature of the system. Can anyone ignore the physical limitations of communication around the globe? In this course, we will focus on the unique approaches and principles of distributed systems, from specific architectures and communication protocols, to specific concepts in resource management and scheduling, data consistency, fault-tolerance, and performance.
Upon completing this course, you will be able to:
This course uses gamification. In our course design, the core of gamification is not that you receive points for every step you make in learning the topics of the course, but that gamification helps stimulate you to take the step without fear of failure, and that it accommodates many types of student abilities and desires. Among other things, our approach to social gamification allows you to choose your own path of advancement.
The course uses as textbook the book:
Maarten van Steen and Andrew S. Tanenbaum, Distributed Systems, 3rd. Ed., online edition, 2017. (free for all) [Online] Available: https://www.distributed-systems.net/index.php/books/distributed-systems-3rd-edition-2017/
Course and Lab guides are also provided via Blackboard. Introductions to the process of each activity, including assessment, appear in the first session of each activity.
X_400130