Big Data Systems

Scott Brandt, Carlos Maltzahn


Class Meetings

Tuesdays and Thursdays (1/8/13 - 3/14/13), 10-11:45am PT, E2-506

Office Hours (contacts: Scott, Carlos)

Tuesdays: 1-3pm PT.

Schedule and Readings

Please retrieve the readings from the Web. Contact us if you have trouble finding them. The schedule and readings might change to accommodate guest speakers.

Guest speakers so far:

1/22:    Christopher Olston (Google)

2/12:    Dean Hildebrand (IBM Almaden)

2/21:    Pat Helland (Salesforce)

2/26:    Fatma Özcan (IBM Almaden)

2/28:    Vishy Vishwanathan (Purdue University)

3/5:      Bob Felderman (Google)

Tuesday, January 8, 2013


Thursday, January 10, 2013

Big Data Now

O’Reilly Media, “Big Data Now” (2nd edition)

Tuesday,  January 15, 2013

Big Data Storage

K. V. Shvachko, “Hdfs scalability: The limits to growth,” ;login:, vol. 35, no. 2, 2010.

S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn, “Ceph: A scalable, high-performance distributed file system,” in OSDI’06, (Seattle, WA), Nov. 2006.

[optional] D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, “Finding a needle in haystack: Facebook’s photo storage,” in OSDI’10, (Vancouver, Canada), November 4-6 2010.

Optional: CMPS 290H: Provenance

Thursday, January 17, 2013

Batch-oriented processing

J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” in OSDI’04, (San Francisco, CA), Dec. 2004.

M. Isard, M. Budiu, Y. Yu, A. Birrel, and D. Fetterly, “Dryad: Distributed data-parallel programs from sequential building blocks,” in Eurosys 2007, (Lisboa, Portugal), March 21–23 2007.

Tuesday, January 22, 2013

Guest Lecture: Christopher Olston (Google)

This talk gives an overview of my team's work on large-scale data

processing at Yahoo! Research. The talk begins by introducing two data

processing systems we helped develop: PIG, a dataflow programming

environment and Hadoop-based runtime, and NOVA, a workflow manager for

Pig/Hadoop. The bulk of the talk focuses on debugging, and looks at what

can be done before, during and after execution of a data processing


* Pig's automatic EXAMPLE DATA GENERATOR is used before running a Pig

job to get a feel for what it will do, enabling certain kinds of

mistakes to be caught early and cheaply. The algorithm behind the

example generator performs a combination of sampling and synthesis to

balance several key factors---realism, conciseness and completeness---of

the example data it produces.

* INSPECTOR GADGET is a framework for creating custom tools that

monitor Pig job execution. We implemented a dozen user-requested tools,

ranging from data integrity checks to crash cause investigation to

performance profiling, each in just a few hundred lines of code.

* IBIS is a system that collects metadata about what happened during

data processing, for post-hoc analysis. The metadata is collected from

multiple sub-systems (e.g. Nova, Pig, Hadoop) that deal with data and

processing elements at different granularities (e.g. tables vs. records;

relational operators vs. reduce task attempts) and offer disparate ways

of querying it. IBIS integrates this metadata and presents a uniform and

powerful query interface to users.

[optional reading] C. A. Olson, B. C. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in SIGMOD ’08, (Vancouver, Canada), June 9-12 2008.

Thursday, January 24, 2013

Batch-oriented processing

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in NSDI 2012, 2012.

C. Engle, A. Lupher, R. Xin, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica, “Shark: Fast data analysis using coarse-grained distributed memory,” in SIGMOD ’12, (Scottsdale, AZ), May 20-24 2012.

Tuesday, January 29, 2013

Batch-processing of Graphs and Stream Processing

G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: A system for large-scale graph processing,” in SIGMOD ’10, (Indianapolis, IN), June 6-11 2010.

L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed stream computing platform,” in KDCloud’11, 2011.

Thursday, January 31, 2013

Key/value Stores

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. t E. Gruber, “Bigtable: A distributed storage system for structured data,” in OSDI’06, (Seattle, WA), November 2006.

G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Amazon’s highly available key-value store,” in Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP ’07), pp. 205–220, 2007.

Tuesday, February 5, 2013

Interactive transactions on distributed data

D. Peng and F. Dabek, “Large-scale incremental processing using distributed transactions and notifications,” in OSDI’10, (Vancouver, Canada), October 4-6 2010.

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, “Dremel: Interactive analysis of web-scale datasets,” in Proc. of the 36th Int’l Conf on Very Large Data Bases, pp. 330–339, 2010.

Thursday, February 7, 2013

Performance Virtualization

J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. L ́eon, Y. Li, A. Lloyd, and V. Yushprakh, “Megastore: Providing scalable, highly available storage for interactive services,” in CIDR ’11, (Asilomar, CA), January 9-12 2011.

J. C. Corbett et al., “Spanner: Google’s globally-distributed database,” in OSDI’12, 2012.

Tuesday, February 12, 2013

Guest Lecture: Dean Hildebrand (IBM)

F. Schmuck and R. Haskin, “GPFS: A shared-disk file system for large computing clusters,” in Proceedings of the 2002 Conference on File and Storage Technologies (FAST), pp. 231–244, USENIX, Jan. 2002.

D. Hildebrand and P. Honeyman, “Exporting storage systems in a scalable manner with pnfs,” in MSST ’05, (Monterey, CA), April 11-14 2005.

[Optional] R. Ananthanarayanan, M. Eshel, R. Haskin, M. Naik, F. Schmuck, and R. Tewari, “Panache: A parallel WAN cache for clustered filesystems,” ACM SIGOPS Operating Systems Review, vol. 42, pp. 48–53, January 2008.

Thursday, February 14, 2013

FAST 2013

Attendance of FAST 2013

Tuesday, February 19, 2013

Structured Data Processing

J. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. A. Brandt, “Scihadoop: Array-based query processing in hadoop,” in SC ’11, (Seattle, WA), November 2011.

T. Kaldewey, E. J. Shekita, and S. Tata, “Clydesdale: Structured data processing on mapreduce,” in EDBT 2012, (Berlin, Germany), March 26-30 2012.

Thursday, February 21, 2013

Guest Lecture: Pat Helland (Salesforce)

Talk 1: If You Have Too Much Data, then “Good Enough” Is Good Enough

Classic database systems offer crisp answers for a relatively small amount of data. These systems hold their data in one or a relatively  small number of computers. With a tightly defined schema and transactional consistency, the results returned from queries are crisp  and accurate.

New systems have humongous amounts of data content, change rates, and querying rates and take lots of computers to hold and process. The data quality and meaning are fuzzy. The schema, if present, is likely to vary across the data. The origin of the data may be suspect, and its staleness may vary.

Today's data systems coalesce data from many sources. The Internet, B2B, and enterprise application integration (EAI) combine data from different places. No computer is an island. This large amount of interconnectivity and interdependency has led to a relaxation of many database principles. Let's consider the ways in which today's answers differ from what we used to expect.

Talk 2: Immutability Changes Everything

For a number of decades, I've been saying "Computing Is Like Hubble's Universe, Everything Is Getting Farther Away from Everything Else". It used to be that everything you cared about ran on a single database and the transaction system presented you the abstraction of a singularity; your transaction happened at a single point in space (the database) and a single point in time (it looked like it was before or after all other transactions).

Now, we see a more complicated world. Across the Internet, we put up HTML documents or send SOAP calls and these are not in a transaction. Within a cluster, we typically write files in a file system and then read them later in a big map-reduce job that sucks up read-only files, crunches, and writes files as output. Even inside the emerging many-core systems, we see high-performance computation on shared memory but increasing cost to using semaphores. Indeed, it is clear that "Shared Memory Works Great as Long as You Don't Actually SHARE Memory".

There are emerging solutions which are based on immutable data. It seems we need to look back to our grandparents and how they managed distributed work in the days before telephones. We realize that "Accountants Don't Use Erasers" but rather accumulate immutable knowledge and then offer interpretations of their understanding based on the limited knowledge presented to them. This talk will explore a number of the ways in which our new distributed systems leverage write-once and read-many immutable data.

Tuesday, February 26, 2013

Guest Lecture: Fatma Ozcan (IBM Almaden): Improving Query Processing on Hadoop


Hadoop has become an attractive platform for large-scale data analytics. An increasingly important analytics scenario for Hadoop involves multiple (often ad hoc) grouping and aggregation queries with selection predicates over a slowly changing dataset. These queries are typically expressed via high-level query languages such as Jaql, Pig, and Hive, and are used either directly for business-intelligence applications or to prepare the data for statistical model building and machine learning. Despite Hadoop’s popularity, it still suffers from major performance bottlenecks. In this seminar, I will talk about some techniques, borrowed from parallel databases, to speed up query processing on Hadoop.

The first of these techniques addresses the lack of ability to colocate related data on the same set of nodes in an HDFS cluster. To overcome this bottleneck, I will describe CoHadoop, a lightweight extension of Hadoop that allows applications to control where data are stored. Colocation can be used to improve the efficiency of many operations, including indexing, grouping, aggregation, columnar storage, joins, and sessionization.

Next, I will present the Eagle-Eyed-Elephant (E3) framework for boosting the efficiency of query processing in Hadoop by avoiding accesses of data splits that are irrelevant to the query at hand. Using novel techniques involving inverted indexes over splits, domain segmentation, materialized views, and adaptive caching, E3 avoids accessing irrelevant splits even in the face of evolving workloads and data.

Thursday, February 28, 2013

Guest Lecture: Vishy Vishwanathan (Purdue): Challenges in Scaling Machine Learning to Big Data

Abstract: We will start with a very innocuous sounding question: How do

you estimate a multinomial distribution given some data? We will then

show that this fundamental question underlies many applications of

machine learning. We will survey some learning algorithms and challenges

in scaling them to massive datasets. The last part of the class will be

an interactive thought session on how we can bring together ideas from

systems and Machine Learning to attack this problem.

Suggested reading:

Goldwater, S., Griffiths, T. L., Johnson, M. (2011). Producing power-law

distributions and damping word frequencies with two-stage language

models. Journal of Machine Learning Research, 12, 2335-2382.

Tuesday, March 5, 2013

Guest Lecture: Bob Felderman (Google): Warehouse Scale Computing and the Perils and Pitfalls of Optimization

Abstract: Lots of the action in computer system design has migrated to the ends of the scale spectrum. Today, warehouse scale computing and mobile computing get a lot of attention. We'll present WSC from the Google perspective, then dive in to get a better idea on what it takes to create optimal systems.

Abts and B. Felderman, “A guided tour of data-center networking,” CACM, vol. 55, pp. 44–51, June 2012.

S. Han, S. Marshall, B.-G. Chun, and S. Ratnasamy, “Megapipe: A new programming interface for scalable network i/o,” in OSDI’12, 2012.

Thursday, March 7, 2013

Structured Data Processing

N. Watkins, C. Maltzahn, S. A. Brandt, and A. Manzanares, “Datamods: Programmable file system services,” in PDSW’12, (Salt Lake City, UT), November 12 2012.

J. He, J. Bent, A. Torres, G. Grider, G. Gibson, C. Maltzahn, and X.-H. Sun, “Discovering structure in unstructured i/o,” in PDSW’12, (Salt Lake City, UT), November 12 2012.

Tuesday, March 12, 2013

Project presentations

Thursday, March 14, 2013

Project presentations

Friday, March 22, 2013

Firm deadline for class project reports