The Hierarchical Data Format (HDF): a Foundation for Sustainable Data and Software

The Hierarchical Data Format (HDF): A Foundation for Sustainable Data and Software Ted Habermann, The HDF Group; Andrew Collette, IMPACT, University of Colorado; Steve Vincena, BAPSF, University of California; Jay Jay Billings, Oak Ridge National Lab; Matt GerrinG, Diamond Light Source; Pierre de Buyl, KU Leuven; Konrad Hinson, Centre de Biophysique Moléculaire, Orléans (France); Mark Könnecke, Paul Scherrer Institut; Werner BenGer, Airborne HydroMappinG GmbH and Louisiana State University; Filipe RNC Maia, Uppsala University; Suren Byna, Lawrence Berkeley National Laborato- ry Abstract When combined with community specific conventions and application-specific semantics, the Hierarchical Data Format (HDF) provides a robust and reliable foundation that many scientific communities rely on for interoperability and high-performance storage of metadata and data. This foundation addresses many technical obstacles related to sustainable data and software, critical steps towards sustainable science. Building on successes across the communities that are using, or could use, HDF5 successfully, requires significant community engagement and development efforts that are outside of the scope of traditional software development projects in research and non-profit organizations like The HDF Group. Identifying individuals and organizations that are already doing the right things and bringing them together to share best practic- es for funding and executing these activities is an important step towards our shared goal of sustainable science. Introduction Sustainable science depends on intricate relationships between data and the software tools that access and analyze them. Data must be preserved in well-documented, self-describing formats accessible on multiple platforms using many programming languages. In addition, the formats must include metadata required for using and understanding the data long after they are collected and processed. In the short-term, sustainable formats and commercial, open-source, and community tools must support specific data and analysis needs of multiple scientific communities in order to achieve the usage and support required for sustaining maintenance and development of new capabilities. In practice, this is achieved through the development of conventions that map community science data types to computer science data types in the file. These conventions ensure interoperability by providing consistency in the data layer and isolating science users from the details of the underlying storage organization and format. The Hierarchical Data Format (HDF5) addresses these needs and forms a foundation for data storage and access in many scientific communities. The flexibility of the format is achieved through simplicity. HDF5 is a container format that includes three fundamental objects: datasets, attributes, and groups. Datasets are multi-dimensional arrays of atomic or compound computer science datatypes. They provide the flexibility required for storage of diverse scientific data objects while attributes allow annotation of these objects with metadata required for use and understanding. Groups provide organization of datasets and attributes into associa- tions that define scientific data objects. Combined with community specific conventions and application-specific semantics, HDF provides a robust and reliable foundation that scientific communities rely on for interoperability and high-performance storage of metadata and data. The communities described here have overcome many obstacles to sustainable data and software. Finding effective mechanisms for sharing and building on their experiences is an exciting next step towards sustainable science. Scientific Facilities Many important scientific problems require investments in equipment or facilities that must be amortized across user communities. Scientists from all over the world travel to facilities to run experiments or use observations collected at/by these facilities. Multiple members of diverse science teams must be able to freely share original observations and derived results without worrying about platforms, operating systems and other computer details. The HDF5 format is well suited to these demands and is being used successfully in many large facilities. The NASA dust accelerator facility at the University of Colorado's Institute for Modeling Plasma, Atmospheres and Cosmic Dust (impact.colorado.edu) is entirely standardized on HDF5. The data product is an HDF5 container following conventions that represent particle impact events from the accelerator. The "science data type" is a sequence of dust-particle events, each of which has an associated set of time series representing the experimental data, and associated metadata. HDF5 files produced according to this scheme may be read using facility- provided IDL scripts, but users are free to use any HDF5-aware analysis environment (IDL, Matlab, Python, etc.) to explore the files. The use of standard HDF5 attributes and datasets means that "free-form" analysis in these environments is possible, without relying on the facility scripts or special-purpose readers. The Basic Plasma Science Facility (BAPSF, http://plasma.physics.ucla.edu) is a national user facility (supported by DOE and NSF) dedicated to experiments addressing fundamental ques- tions in many sub-fields of plasma physics. Measurements are made with user-selectable probes in the 20 m long device. Time series from these probes are digitized and stored in HDF5 using facility-specific conventions. Additional data and metadata are stored in the file to docu- ment digitizer settings, the run-time state of other programmable devices associated with the experiment, as well as diagnostic information for the plasma device itself. The Diamond Light Source (DLS, http://www.diamond.ac.uk) is the United Kingdom’s syn- chrotron, another large scientific facility. DLS uses the ‘NeXus’ conventions with HDF5 for most of their experimental data bringing many benefits to users: HDF5 is built in to the next genera- tion of faster detectors, visualization and slicing of data, analysis of data larger than memory, users can create/test tools on one slice of data and process the whole stack, saving state or results to a HDF5 file. In addition a wide range of tools are used to convert to and from HDF5. The final example involves modeling an entire facility in HDF5. The Reactor Analyzer is being developed at Oak Ridge (Billings et al., 2014) for reading, writing and viewing simulation results that uses HDF5 to handle all file processing. The system works by mapping the physical struc- tures of light-water and sodium-cooled fast reactors (reactor vessel, fuel pins, etc.) and the nuclear plant (heat exchangers, pipes, etc.) to simple representative classes. Instances of these classes are then written to HDF5 data types, creating a completely consistent, transferable, hierarchical digital description of the nuclear reactor that was simulated. Models and Tools In other scientific communities, models are fundamental tools that increase understanding. HDF5 supports input and output from models as well as visualization and analysis of results. For example, in molecular chemistry and physics, molecular coordinates constitute the basic information for simulation software. In addition to spatial coordinates, many additional fields are critical for complete understanding: velocities, masses, charges, etc. The scientific data items are mostly scalars and vectors organized in arrays with elements for specific atoms. Be- yond the organization of the storage, an important challenge remains in the interpretation of the data in different contexts (biomolecular systems, simple fluids, etc.). H5MD (http://nongnu.org/h5md/), has recently been proposed as an HDF5-based file format for molecular data (P. de Buyl et al., 2014). It addresses data organization and definition within a HDF5 file and is already being used as a basis for further specialization with K. Hinsen propos- ing a H5MD module for his MOlecular SimulAtion Interchange Conventions (MOSAIC, 2014). The "F5" model (http://www.fiberbundle.net) utilizes the concept of fiber bundles, gluing differential geometry, topology and geometric algebra to cast many spatio-temporal data categories into a hierarchy of five semantic levels. This concept fits perfectly into the capabilities of HDF5 to provide a concrete implementation of this abstract model. The F5 model has been used to efficiently store data sets from medical imaging, computational fluid dynamics, molecular dynamics and geosciences. The F5 model allows to seamlessly integrate both raster and vec- tor graphics, as well as complex data categories such as curvilinear multiblock, unstructured n- dimensional meshes and arbitrarily attributed fragmented point clouds. H5hut (http://www-vis.lbl.gov/Research/H5hut/) provides an API for parallel HDF5 encapsu- lating several data models of particle-based simulations. Particle simulations run on supercom- puters with hundreds of thousands of cores in order to study plasma and accelerator physics. These applications simulate billions to trillions of particles and produce tens of gbytes to hundreds of tbytes of data. High-performance I/O, and analysis and visualization tools for these data can be challenging, especially for domain scientists. The H5hut library provides an efficient implementation of I/O and data management operations hiding the complexity of parallel HDF5 and is simple to use. Other frameworks

Load more