Tuesday, 1/22/13 Christpher Olston, Staff Research Scientist, Google
(hosted by Phokion Kolaitis and the Database Group)

Bio: Christopher Olston is a staff research scientist at Google, working on infrastructure for "big AI". He previously worked at Yahoo! (principal research scientist) and Carnegie Mellon (assistant professor). He holds computer science degrees from Stanford (2003 Ph.D., M.S.; funded by NSF and Stanford fellowships) and UC Berkeley (B.S. with highest honors).

At Yahoo, Olston co-created Apache Pig, which is used for large-scale data processing by LinkedIn, Netflix, Salesforce, Twitter, Yahoo and others, and is offered by Amazon as a cloud service. Olston gave the 2011 Symposium on Cloud Computing keynote, and won the 2009 SIGMOD best paper award. During his flirtation with academia, Olston taught  undergrad and grad courses at Berkeley, Carnegie Mellon and Stanford, and signed several Ph.D. dissertations.

Abstract: This talk gives an overview of my team's work on large-scale data processing at Yahoo! Research. The talk begins by introducing two data processing systems we helped develop: PIG, a dataflow programming environment and Hadoop-based runtime, and NOVA, a workflow manager for Pig/Hadoop. The bulk of the talk focuses on debugging, and looks at what can be done before, during and after execution of a data processing operation:

  1. Pig's automatic EXAMPLE DATA GENERATOR is used before running a Pig job to get a feel for what it will do, enabling certain kinds of mistakes to be caught early and cheaply. The algorithm behind the example generator performs a combination of sampling and synthesis to  balance several key factors---realism, conciseness and completeness---of the example data it produces.

  1. INSPECTOR GADGET is a framework for creating custom tools that monitor Pig job execution. We implemented a dozen user-requested tools, ranging from data integrity checks to crash cause investigation to performance profiling, each in just a few hundred lines of code.

  1. IBIS is a system that collects metadata about what happened during data processing, for post-hoc analysis. The metadata is collected from multiple sub-systems (e.g. Nova, Pig, Hadoop) that deal with data and processing elements at different granularities (e.g. tables vs. records;  relational operators vs. reduce task attempts) and offer disparate ways of querying it. IBIS integrates this metadata and presents a uniform and powerful query interface to users.

Tuesday, 2/12/13 Dean Hildebrand, Master Inventor, IBM Almaden

Bio: Dr. Hildebrand is a file systems researcher at the IBM Almaden Research Lab and a recognized expert in the field of distributed and parallel file systems. He has authored numerous scientific publications, created over 21 patents, and sat on the program committee of numerous conferences including SC, HPDC, MSST, and MASCOTS.  Dr. Hildebrand pioneered pNFS and the primary author on the first paper to demonstrate the feasibility of providing standard and scalable access to any parallel file system. Dean holds a Ph.D. in Computer Science from the University of Michigan. The title of his dissertation is,"Distributed Access to Parallel File Systems". He also has a B.Sc. (Honours) from the University of British Columbia.

Title: Moving Beyond Storage of the Undead

Abstract: Many of today's data centers store and access data using several different storage systems, each of which has a specialized purpose.  While a large variety of available storage management and optimization options exist, these systems have limited functionality.  Not unlike a zombie that only utilizes part of its brain, these systems is not living up to their full potential. 

As the amount of data grows, the tolerance for using undead data management techniques is shrinking fast.  All of this fits into the emerging focus on Software Defined Storage, which proposes among other things to make storage more intelligent.  Enterprise features such as parallel data access, high-availability, backup, disaster recovery, compression, encryption, tiering, etc, that are considered optional today will become mandatory for many Big Data infrastructures.

In this talk I will discuss some of the ways our research group at IBM Almaden is seeking to make GPFS smarter for managing Big Data.  This includes new features and enhancements to existing features to support cloud architectures, pNFS, analytics, efficient disaster recovery, and more.

Thursday, 2/21/13 Pat Helland, Software Architect, SalesForce.com

Bio: Pat Helland has been working in distributed systems, transaction processing, databases, and similar areas since 1978. For most of the 1980s, he was the chief architect of Tandem Computers' TMF (Transaction Monitoring Facility), which provided distributed transactions for the NonStop System. With the exception of a two-year stint at Amazon, Helland worked at Microsoft from 1994 to 2011 where he was the chief architect for Microsoft Transaction Server, SQL Service Broker, and a number of features within Cosmos, the distributed computation and storage system underlying Bing. Pat recently joined Salesforce.com and is leading a number of new efforts working with very large-scale data.


Talk 1: If You Have Too Much Data, then “Good Enough” Is Good Enough

Classic database systems offer crisp answers for a relatively small amount of data. These systems hold their data in one or a relatively  small number of computers. With a tightly defined schema and transactional consistency, the results returned from queries are crisp  and accurate.

New systems have humongous amounts of data content, change rates, and querying rates and take lots of computers to hold and process. The data quality and meaning are fuzzy. The schema, if present, is likely to vary across the data. The origin of the data may be suspect, and its staleness may vary.

Today's data systems coalesce data from many sources. The Internet, B2B, and enterprise application integration (EAI) combine data from different places. No computer is an island. This large amount of interconnectivity and interdependency has led to a relaxation of many database principles. Let's consider the ways in which today's answers differ from what we used to expect.

Talk 2: Immutability Changes Everything

For a number of decades, I've been saying "Computing Is Like Hubble's Universe, Everything Is Getting Farther Away from Everything Else". It used to be that everything you cared about ran on a single database and the transaction system presented you the abstraction of a singularity; your transaction happened at a single point in space (the database) and a single point in time (it looked like it was before or after all other transactions).

Now, we see a more complicated world. Across the Internet, we put up HTML documents or send SOAP calls and these are not in a transaction. Within a cluster, we typically write files in a file system and then read them later in a big map-reduce job that sucks up read-only files, crunches, and writes files as output. Even inside the emerging many-core systems, we see high-performance computation on shared memory but increasing cost to using semaphores. Indeed, it is clear that "Shared Memory Works Great as Long as You Don't Actually SHARE Memory".

There are emerging solutions which are based on immutable data. It seems we need to look back to our grandparents and how they managed distributed work in the days before telephones. We realize that "Accountants Don't Use Erasers" but rather accumulate immutable knowledge and then offer interpretations of their understanding based on the limited knowledge presented to them. This talk will explore a number of the ways in which our new distributed systems leverage write-once and read-many immutable data.

Tuesday, 2/26/13 Fatma Özcan, Research Staff Member, IBM Research - Almaden
(hosted by Phokion Kolaitis and the Database Group)

Bio: I am a research staff member at the IBM Almaden Research Center , working in the information management department. Currently, I am working on the integration/synergy between the data warehouse and Hadoop systems, exploring ways to leverage the best of both worlds. I am also working on XAP (eXtreme Analytics Platform) project whose goal is to explore and extend Hadoop as an enterprise platform for large-scale enterprise analytics. In particular, I am exploring ways for efficient structured data processing using Jaql. I received my PhD in computer science from the University of Maryland, College Park, working on the IMPACT project, supervised by Prof VS Subrahmanian . I got my BSc and Msc degrees in computer engineering from Middle East Technical University, Ankara, Turkey. Previously, I worked on DB2 pureXML. I was one of the main architects of the XQuery compiler in DB2 pureXML. I worked on both XQuery and SQL/XML and in query rewrite optimizations for XML query languages. My research interests include platforms and infra-structure for large-scale data analysis, XML query semantics and optimization, XML standards, integration of heterogeneous data sources, software agents and query languages.

Title: Improving Query Processing on Hadoop

Abstract: Hadoop has become an attractive platform for large-scale data analytics. An increasingly important analytics scenario for Hadoop involves multiple (often ad hoc) grouping and aggregation queries with selection predicates over a slowly changing dataset. These queries are typically expressed via high-level query languages such as Jaql, Pig, and Hive, and are used either directly for business-intelligence applications or to prepare the data for statistical model building and machine learning. Despite Hadoop’s popularity, it still suffers from major performance bottlenecks. In this seminar, I will talk about some techniques, borrowed from parallel databases, to speed up query processing on Hadoop.

The first of these techniques addresses the lack of ability to colocate related data on the same set of nodes in an HDFS cluster. To overcome this bottleneck, I will describe CoHadoop, a lightweight extension of Hadoop that allows applications to control where data are stored. Colocation can be used to improve the efficiency of many operations, including indexing, grouping, aggregation, columnar storage, joins, and sessionization. Next, I will present the Eagle-Eyed-Elephant (E3) framework for boosting the efficiency of query processing in Hadoop by avoiding accesses of data splits that are irrelevant to the query at hand. Using novel techniques involving inverted indexes over splits, domain segmentation, materialized views, and adaptive caching, E3 avoids accessing irrelevant splits even in the face of evolving workloads and data.

Thursday, 2/28/13 Vishy Vishwanathan, Associate Professor for Statistics and Computer Science, Purdue University

Bio: Vishy Vishwanathan is Associate Professor at Purdue University with joint appointments in the departments of Statistics and Computer Science. Prior to coming to Purdue in fall 2008, he was a principal researcher in the Statistical Machine Learning program of NICTA with an adjunct appointment at the College of Engineering and Computer Science, Australian National University. He received his Ph.D. in machine learning from the Department of Computer Science and Automation, Indian Institute of Science. in the year 2003. His recent research has focused on machine learning with emphasis on graphical models, structured prediction, kernel methods, and convex optimization. He works on problems in pattern recognition, OCR, bio-informatics, text analysis, and optimization.

Title: Challenges in Scaling Machine Learning to Big Data

Abstract: We will start with a very innocuous sounding question: How do you estimate a multinomial distribution given some data? We will then show that this fundamental question underlies many applications of machine learning. We will survey some learning algorithms and challenges in scaling them to massive datasets. The last part of the class will be an interactive thought session on how we can bring together ideas from systems and Machine Learning to attack this problem.

Tuesday, 3/5/13 Bob Felderman, Principal Engineer, Platforms Hardware, Google

Bio: Bob Felderman spent time at both Princeton and UCLA before venturing out into the real world. After a short stint at Information Sciences Institute he helped to found Myricom, which became a leader in cluster computing networking technology. After 7 years there he moved to the Bay Area to apply some HPC ideas to the IP/Ethernet space while working at Packet Design, and later was a founder of Precision I/O. All of that experience eventually led him to Google, where he has spent the past 6+ years working on issues in data center networking and general system design.

Title: Warehouse Scale Computing and the Perils and Pitfalls of Optimization

Abstract: Lots of the action in computer system design has migrated to the ends of the scale spectrum. Today, warehouse scale computing and mobile computing get a lot of attention. We'll present WSC from the Google perspective, then dive in to get a better idea on what it takes to create optimal systems.