The Fast-Forward I/O and Storage Stack

Posted on April 7, 2013 by ivotron

tl;dr: This post gives a high-level introduction to the FastForward (FF) I/O stack, as described in the 2012-Q4 and 2013-Q1 milestones [1]. The milestones documents are organized on a per-component basis, so I thought it would be useful to have a single high-level intro that one could get to when trying to figure out how a particular piece fits into the whole picture.

The Stack

The proposed exascale architecture is shown in the following diagram:

Figure 1. Exascale I/O Architecture (taken from [2]).

Figure 1. Exascale I/O Architecture (taken from [2]).

In the targeted use-case, scientists write applications on top of non-POSIX interfaces that allow them to transfer storage-related logic down to the system. These applications run on compute nodes (CN) and, during the course of an execution, results are stored temporarily in I/O nodes (ION), which have the capacity of handling the burst of I/O produced during peak loads [3]. IONs hold data temporarily (depending on how often an application creates a checkpoint) and during that period a scientist can execute analysis on the temporary results. The I/O Dispatcher (IOD) executing on the IONs is in charge of structuring the random-access nature of the processes sequentialy optimized format (i.e. chunks) so that it gets efficiently pushed down to long-term storage (DAOS).

From a layered point of view, the above is generalized in the following diagram:

Figure 2. FastForward I/O Stack (taken from [2]).

Figure 2. FastForward I/O Stack (taken from [2]).

The key features that should be highlighted at this point are: concurrency, asynchrony, transactions and function-shipping. Many scientists will run their simulations concurrently and they will be able to manipulate data that is "on flight", that is, they will be able to associate versions to checkpointed data and execute analysis on a particular version (which in turn might produce one or more new versions), even if that version hasn't landed in long-term storage. Since every subsystem has active-storage capabilities, they can execute code, as long as the input data is available to it. In the ideal scenario, deciding which code executes locally on the CNs, or which gets transfered to I/O (or DAOS) nodes should be done transparently without having the user to specify anything. That is, the application developer should only provide the basic information (program/versioning logic) and the system should be in charge of deciding what code/data gets executed/stored where (CN, ION or DAOS node) and when.

The Prototype

The above is the ideal Exascale architecture. The FF project will deliver a prototype that implements some parts of it. Next, I describe what I understand are the components that will get implemented and be left out, in terms of the ideal target use-case.

The prototype stack, from a top-down point of view, has the following components:

  1. Arbitrarily Connected Graphs (ACG).
  2. HDF5 extensions.
  3. I/O Dispatcher (IOD).
  4. Distributed Application Object Storage (DAOS).
  5. Versioning Object Storage Device (VOSD).

Applications running on the compute nodes will be written in Python scripts or in C++ using the HDF5 APIs. These applications will execute analysis on graph-based data (ACG). The point is to demonstrate both the HPC and BigData use cases of the Exascale architecture. The data structures will be stored in HDF5, for which new API extensions will be implemented. These new API calls will expose the transactional, asynchronous and function-forwarding semantics of the underlying stack.

As mentioned before, the stack provides non-POSIX, object-based, transactional and asynchronous active-storage, meaning that POSIX is supplanted by new object interfaces that reach up to the HDF5 layer (Application I/O in the stack diagram); transactional semantics are present, from HDF down to the VOSD layer (Storage layer); clients don't need to wait for any blocking operation; and analysis can be shipped and executed on I/O or DAOS nodes.

In the following, I give a high-level description of each of layer.

ACG

This will exemplify how applications make use of the Exascale stack. For FF, support for GraphLab and GraphBuilder will be prototyped, which are graph-processing frameworks. GraphBuilder is a set of MapReduce tasks that extract, normalize, partition and serialize a graph out of unstructured data, and writes graph-specific formats into HDFS. These files are later consumed by GraphLab, a vertex-centric, asynchronous execution engine that runs directly on top of HDFS (i.e. non-MapReduce). The following illustrates the architecture of both frameworks:

Figure 3. GraphLab and GraphBuilder stacks (taken from [4]).

Figure 3. GraphLab and GraphBuilder stacks (taken from [4]).

In order to make both work on top of the exascale stack, both have to be modified. After these modifications are implemented, GraphBuilder will be able to write the partitioned graph in (the newly proposed) HDF5 files which will thus be stored in the IOD nodes (or IONs) in a parallel-optimized way. On the GraphLab side, HDF5-awareness will allow the library to perform at high speeds by benefiting from the new features (see next section). In general both frameworks will be modified so that calls to HDFS-based formats are replaced by the proposed HDF5 ones. This is referred to as the HDF Adaptation Layer or HAL and will provide, from the GraphBuilder/GraphLab point of view [5]:

HDF5 extensions

The extensions done to HDF5 allows an application to take full advantage of the new exascale features. The additions comprise [6] (Figure 4):

  1. Object-storage API based on HDF5 to support high-level data models. This exposes asynchronous, transactional semantics to the application, as well as end-to-end data integrity. It will also allows the usage of pointer data types, passing of hints down to the storage and support for asynchronous index building, maintenance and querying.
  2. Virtual Object Layer (VOL) plugin that translates HDF5 API requests from applications to IOD API calls.
  3. Function shipping from CN to IONs. This provides the application developer with the capacity of sending computation down to the IONs and get back results.
  4. Analysis Shipping from CN to IONs or DAOS nodes. This is similar to 3 but instead of returning the result over the network, it gets stored on the nodes and pointers to it are returned.
Figure 4. The HDF5 stack (taken from [7]).

Figure 4. The HDF5 stack (taken from [7]).

In terms of the CN-ION communication model, a client/server architecture is implemented [8]: every ION runs an IOFSL (I/O Function Shipping Layer) server [9]; the IOFSL client is integrated into the HDF5 library which runs on each CN. A client can forward requests to any number of IONs. Every I/O operation issued by HDF5 is asynchronously shipped to the IOFSL server and asynchronously executed.

IOD

The I/O dispatcher (IOD) can be simply described as the storage abstraction that the compute nodes interact with. It encapsulates the burst buffer layer as well as long-term storage. Every I/O node runs an IOD client, and in turn every IOD server runs a DAOS client which allows it to communicate down to the long-term storage subsystem (Figure 5).

Figure 5. The IOD stack (taken from [10]).

Figure 5. The IOD stack (taken from [10]).

IOD exposes the transactional, asynchronous and function-shipping functionality of the stack to applications running on the compute nodes.

DAOS and VOSD

This is the long-term storage and in the prototype consists of Lustre with extensions that will allow the system to have transactions as well as function-shipping capabilities [11].

References

[1] Intel Corporation, “Milestone 2.2 updated detailed project plan,” Dec. 2012. Available at: https://wiki.hpdd.intel.com/download/attachments/12127153/Milestone%202.3%20Solution%20Architecture%20-%20ACG.pdf?version=1&modificationDate=1364836707553&api=v2.

[2] E. Barton, “Eric barton progress update on the fast forward i/O & storage program,” Jan. 2013. Available at: http://www.youtube.com/watch?v=ynyR8Pjq8W4&feature=youtube_gdata_player.

[3] N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn, “On the role of burst buffers in leadership-class storage systems,” 2012 IEEE 28th symposium on mass storage systems and technologies (MSST), 2012, pp. 1–11.

[4] T.L. Willke, N. Jain, and H. Gu, “GraphBuilder–A scalable graph construction library for apache™ hadoop™,” 2012.

[5] P. Arnab, J. Yu, and K. Ambert, “Milestone 2.3 solution architecture - ACG,” Dec. 2012. Available at: https://wiki.hpdd.intel.com/download/attachments/12127153/Milestone%202.3%20Solution%20Architecture%20-%20ACG.pdf?version=1&modificationDate=1364836707553&api=v2.

[6] Q. Koziol and R. Aydt, “Milestone 2.3 solution architecture - HDF,” Dec. 2012. Available at: https://wiki.hpdd.intel.com/download/attachments/12127153/Milestone%202.3%20Solution%20Architecture%20-%20HDF.pdf?version=1&modificationDate=1364837248956&api=v2.

[7] M. Chaarawi, J. Soumagne, R. Aydt, and Q. Koziol, “Milestone 3.1 initial design - HDF5 IOD VOL,” Mar. 2013. Available at: https://wiki.hpdd.intel.com/download/attachments/12127153/Milestone%203.1%20Design%20HDF%20IOD%20VOL.pdf?version=1&modificationDate=1364837670178&api=v2.

[8] Q. Koziol, “Milestone 2.4 function shipping design and framework demonstration,” Dec. 2012. Available at: https://wiki.hpdd.intel.com/download/attachments/12127153/Milestone%202.4%20Function%20Shipping%20Design%20and%20Framework%20Demonstration.pdf?version=1&modificationDate=1364837379790&api=v2.

[9] N. Ali, P. Carns, K. Iskra, D. Kimpe, S. Lang, R. Latham, R. Ross, L. Ward, and P. Sadayappan, “Scalable i/O forwarding framework for high-performance computing systems,” IEEE international conference on cluster computing and workshops, 2009. CLUSTER ’09, 2009, pp. 1–10.

[10] J. Bent, “Milestone 3.1 initial design - IOD,” Mar. 2013. Available at: https://wiki.hpdd.intel.com/download/attachments/12127153/Milestone%203.1%20Design%20-%20IOD.pdf?version=1&modificationDate=1364838267076&api=v2.

[11] J. Lombardi, “Milestone 2.3 solution architecture - lustre restructuring,” Dec. 2012. Available at: https://wiki.hpdd.intel.com/download/attachments/12127153/Milestone%202.3%20Solution%20Architecture%20-%20Lustre%20Restructuring%202013-01-07.pdf?version=1&modificationDate=1364839103314&api=v2.