Reliability

Data volumes are increasing at an alarming rate and the number of storage devices will only follow suit. With larger storage systems, both for local storage systems and for storage service providers, there is an ever-increasing need to avoid data loss. Our goal is to build systems that are resilient to failures and offer such resilience efficiently and flexibly.

Hard Data & SSPiRAL Layouts

There is a growing need to survive the failure of multiple storage devices (or servers) in ever-larger storage systems. Redundant storage schemes are the obvious solution, and such applications commonly employ one of two strategies: a combination of replication and parity applied efficiently across an array of devices, or a failure-recovery scheme based on erasure coding. Computational efficiency is important when implementing redundancy schemes for disks, and so parity is particularly appealing due to its ease of computation. There are also combinations of the two approaches, but typically parity schemes tolerate only a small number of component failures, while erasure codes tend to be expensive to implement.

Excellent parity-based erasure codes and layout schemes have been devised, but prior art has focused primarily on surviving a specific number of device failures. As such, prior approaches have offered redundancy schemes that are elegant and symmetric in their layout of data. We diverge from this path seeking to find the most practical usage of multiple data storage nodes, and the most judicious use of bandwidth in paths to these nodes. This results in extreme reliability that is easy to implement and maintain, while offering excellent performance characteristics – in other words, the best redundancy scheme in the face of practical performance limits. By “practical performance limits,” we are referring to limits on the bandwidth, the number of available devices, and available storage capacity.

SSPiRAL (Survivable Storage using Parity in Redundant Array Layouts) is our redundant data layout scheme based solely on efficient parity computations, offering high reliability and maintainability.

•Jehan-François Pâris, Ahmed Amer, “Using Shared Parity Disks to Improve the Reliability of RAID Arrays,” Proceedings of the IEEE International Performance, Computing and Communications Conference (IPCCC 2009), December 2009.
•Ahmed Amer, Jehan-François Pâris, Darrell D. E. Long and Thomas Schwarz, “Progressive Parity-Based Hardening of Data Stores,” Proceedings of the 27th International Performance of Computers and Communication Conference (IPCCC 2008), Austin, TX, USA: IEEE, December 2008.
•Ahmed Amer, Darrell D. E. Long, Jehan-François Pâris and Thomas Schwarz, “Increased Reliability with SSPiRAL Data Layouts,” Proceedings of the 16th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 2008), Baltimore, MD, USA: IEEE September 2008.
•Ahmed Amer, Jehan-François Pâris, Thomas Schwarz, Vincent Ciotola, and James Larkby-Lahet, “Outshining Mirrors: MTTDL of Fixed-Order SSPiRAL Layouts,” Proceedings of the International Workshop on Storage Network Architecture and Parallel IO (SNAPI 2007), San Diego, CA, USA: September 2007.
•Vincent Ciotola, James Larkby-Lahet, and Ahmed Amer, “SSPiRAL layouts: Practical extreme reliability,” Technical Report TR-07-149, Department of Computer Science, University of Pittsburgh, 2007. (Presented at the Usenix Annual Technical Conference 2007 poster session)

This work was generously supported by the National Science Foundation under Award #0720578