An efficient, versatile, scalable, and portable storage system for scientific data containers

Project Description

Scientific Bigdata sets are becoming too large and complex to fit in RAM, forcing scientific applications to perform a lot of slow disk and network I/O. This growth also makes scientific data more vulnerable to corruptions due to crashes and human errors. This project will use recent results from algorithms, database, and storage research to improve the performance and reliability of standard scientific data formats. This will make scientific research cheaper, faster, more reliable, and more reproducible.

The Hierarchical Data Format (HDF5) standard is a container format for scientific data. It allows scientists to define and store complex data structures inside HDF5 files. Unfortunately, the current standard forces users to store all data objects and their meta-data properties inside one large physical file; this mix hinders meta-data-specific optimizations. The current storage also uses data-structures that scale poorly for large data. Lastly, the current model lacks snapshot support, important for recovery from errors.

A new HDF5 release allows users to create more versatile storage plugins to control storage policies on each object and attribute. This project is developing support for snapshots in HDF5, designing new data structures and algorithms to scale HDF5 data access on modern storage devices to Bigdata. The project is designing several new HDF5 drivers: mapping objects to a Linux file system; storing objects in a database; and accessing data objects on remote Web servers. These improvements are evaluated using large-scale visualization applications with Bigdata, stemming from real-world scientific computations.

We experimented with the Retro Berkeley DB as a storage back-end, a modified version of the Berkeley DB which supports transactions and snapshots. In order to use the Berkeley DB for storing HDF5 metadata and without performing any changes in the HDF5 source code, we created a HDF5 Virtual Object Layer (VOL) plugin that accesses the Berkeley Database for storing and retrieving HDF5 objects.

Publications

Flexible navigation through a high-dimensional parameter space using Berkeley DB snapshots, Dimokritos Stamatakis, Werner Benger, and Liuba Shrira, in Proceedings of 25th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG ’17), May 2017 [link]. Selected to be published at the Journal of WSCG ’17 [pdf].

Members

  • Faculty

  • Liuba Shrira

    Brandeis University

  • Students

  • Dimokritos Stamatakis

    PhD student, Brandeis University

Acknowledgements