Websites: Skyhook Data Management, IRIS-HEP project
Funding: NSF TI-2229773, DOE ASCR DE-NA0003525 (FWP 20-023266): UCSC subcontractor of Sandia National Labs, NSF OAC-1836650, NSF CNS-1764102, NSF CNS-1705021, and CROSS
Overviews: CCGrid'22 paper, COMPSYS'23 paper
Important Links: GitHub repository, Ceph plugin repository, getting started instructions and notebook, code walkthrough video.
The key advantage of the cloud is its elasticity. This is implemented by systems that can expand and shrink resources quickly and by disaggregation services, including compute, networking, and storage. Elasticity is also valuable for on-premise datacenters where disaggregation allows compute and storage to scale independently. This disaggregation however places greater demand on expensive top-of-rack networking resources since compute and storage nodes end up in different racks and even rows as the installation is growing. More network traffic also requires more CPU cycles to be dedicated to sending and receiving data. Therefore, disaggregation, somewhat paradoxically, amplifies the benefit of moving some compute – the compute that involves data management – into storage & network layers because data management filtering operations can reduce data movement significantly.
Combining data management with storage and networking also creates the opportunity for new services that can help avoid dataset copies and thereby can significantly save storage space. Data management-enabled storage systems can provide views by combining parts of multiple datasets: columns from one table can be combined with columns from a different table without creating copies. For this to work, these storage systems need to store sufficient metadata and naming conventions about datasets. This makes them a natural place for maintaining this metadata and servicing it to other tools in convenient formats.
The Apache Arrow Ceph Plugin
Skyhook Data Management consists of multiple subprojects at different stages of maturity, spanning storage and networking. The most mature subproject is the Apache Arrow Ceph plugin, an extension of the Ceph open source distributed storage system for the scalable storage of tables and for offloading common data management operations on them, including selection, projection, aggregation, and indexing, as well as user-defined functions (see Apache Arrow blog post and github repository). The goal of Apache Arrow Ceph Plugin is to transparently scale out data management operations across many storage servers leveraging the scale-out and availability properties of Ceph while significantly reducing the use of CPU cycles and interconnect bandwidth for unnecessary data transfers. The SkyhookDM architecture is also designed to transparently optimize for future storage devices of increasing heterogeneity and specialization. All the data movements from the Ceph OSDs to the client are using the Apache Arrow format.