Invited Lecture, CS Dept., Technion, Israel
Big Data--loosely defined as the processing and preservation of data that may be too high-volume, volatile, or varied for regular data management systems--has become a topic of interest for a variety of domains, such as e-Goverment, online retail, and online gaming. Addressing this interest, complex ecosystems such as the Hive-MapReduce-Hadoop-HDFS-etc., the DryadLINQ-Dataflow-Dryad-CosmosFS-etc., the Meteor-PACT-Nephele-HDFS-etc. have appeared.
In this presentation we focus on three main topics in Big Data processing and preservation: Graph Processing, Time-Based Data Analytics, and Data Preservation in Large-Scale Distributed Systems. We look at each from a systems' perspective.
For graph processing, we propose and apply a method (benchmarking suite) for evaluating and comparing graph-processing platforms. Our method consists of a selection of representative metrics, datasets, and algorithmic classes; and of an empirical process that exhaustively evaluates each platform. We experiment with 6 platforms, with 5 classes of graph-processing algorithms, and with 7 graphs of up to 1.8 billion edges each. Besides per-platform insights, our results present the first comprehensive comparison of graph-processing platforms.
For time-based analytics, we focus on a use case derived from monitoring global systems. We have designed and used for the past three years BTWorld, which obtains information about the performance, scalability, and reliability of the global BitTorrent network via thousands of public trackers. We show how a multi-year, multi-TB data set covering information about tens of millions of files can pose significant challenges to today's Big Data processing tools.
For data preservation, we present an ambitious goal: logging and preserving the entire history of distributed systems. The implications of archiving large amounts of daily information for science and society are clear since at least the 1940s, when Vannevar Bush defined the concept of the personal memex as an individual's device for storing and accessing all information and communication involving that individual. Among these benefits are learning about and eradicating humankind diseases, enabling human beings more creative and thought-related time by eliminating tasks that can be automated, etc. Similarly, we posit that archiving large amounts of operational traces collected from the many distributed systems that currently underpin societies across the world would be beneficial for tuning today's systems and designing better systems in the future. What is the Distributed Systems Memex? How can such a Memex be designed and implemented? To address these and related questions, we discuss a bottom-up approach, in which we focus on the archival needs of specific application areas in which distributed systems are prominent and hope to gain sufficient understanding for the future. We focus therefore on the Grid Workloads Archive, and on the Game, Peer-to-Peer, and Failure Trace Archives.
This work is based on recently published material [1-5], a digest of many publications from the past seven years, and several upcoming publications.
 Yong Guo, Alexandru Iosup: The Game Trace Archive. NetGames 2012: 1-6
 Derrick Kondo, Bahman Javadi, Alexandru Iosup, Dick H. J. Epema: The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems. CCGRID 2010: 398-407
 Boxun Zhang and Alexandru Iosup and Johan A Pouwelse and Dick Epema (2010). The peer-to-peer trace archive: design and comparative trace analysis. CoNEXT 2010 Student Workshop: Proceedings of the ACM CoNEXT Student Workshop. ACM.
 Maciej Wojciechowski, Mihai Capota, Johan A. Pouwelse, Alexandru Iosup: BTWorld: towards observing the global BitTorrent file-sharing network. HPDC 2010: 581-588
 Alexandru Iosup, Hui Li, Mathieu Jan, Shanny Anoep, Catalin Dumitrescu, Lex Wolters, Dick H. J. Epema: The Grid Workloads Archive. Future Generation Comp. Syst. 24(7): 672-686 (2008)