The Hierarchical Data Format (HDF): A Foundation for Sustainable Data and Software Ted Habermann, The HDF Group; Andrew Collette, IMPACT, University of Colorado; Steve Vincena, BAPSF, University of California; Jay Jay Billings, Oak Ridge National Lab; Matt Gerring, Diamond Light Source; Pierre de Buyl, KU Leuven; Konrad Hinson, Centre de Biophysique Moléculaire, Orléans (France); Mark Könnecke, Paul Scherrer Institut; Werner Benger, Airborne HydroMapping GmbH and Louisiana State University; Filipe RNC Maia, Uppsala University; Suren Byna, Lawrence Berkeley National Laborato- ry

Abstract When combined with community specific conventions and application-specific semantics, the Hierarchical Data Format (HDF) provides a robust and reliable foundation that many scien- tific communities rely on for interoperability and high-performance storage of metadata and data. This foundation addresses many technical obstacles related to sustainable data and soft- ware, critical steps towards sustainable science. Building on successes across the communities that are using, or could use, HDF5 successfully, requires significant community engagement and development efforts that are outside of the scope of traditional software development projects in research and non-profit organizations like The HDF Group. Identifying individuals and organi- zations that are already doing the right things and bringing them together to share best practic- es for funding and executing these activities is an important step towards our shared goal of sustainable science.

Introduction Sustainable science depends on intricate relationships between data and the software tools that access and analyze them. Data must be preserved in well-documented, self-describing formats accessible on multiple platforms using many programming languages. In addition, the formats must include metadata required for using and understanding the data long after they are collected and processed. In the short-term, sustainable formats and commercial, open-source, and community tools must support specific data and analysis needs of multiple scientific communities in order to achieve the usage and support required for sustaining maintenance and development of new capabilities. In practice, this is achieved through the development of conventions that map community science data types to computer science data types in the file. These conventions ensure interoperability by providing consistency in the data layer and isolating science users from the details of the underlying storage organization and format. The Hierarchical Data Format (HDF5) addresses these needs and forms a foundation for data storage and access in many scientific communities. The flexibility of the format is achieved through simplicity. HDF5 is a container format that includes three fundamental objects: da- tasets, attributes, and groups. Datasets are multi-dimensional arrays of atomic or compound computer science datatypes. They provide the flexibility required for storage of diverse scien- tific data objects while attributes allow annotation of these objects with metadata required for use and understanding. Groups provide organization of datasets and attributes into associa- tions that define scientific data objects. Combined with community specific conventions and application-specific semantics, HDF provides a robust and reliable foundation that scientific communities rely on for interoperabil- ity and high-performance storage of metadata and data. The communities described here have overcome many obstacles to sustainable data and software. Finding effective mechanisms for sharing and building on their experiences is an exciting next step towards sustainable science.

Scientific Facilities Many important scientific problems require investments in equipment or facilities that must be amortized across user communities. Scientists from all over the world travel to facilities to run experiments or use observations collected at/by these facilities. Multiple members of di- verse science teams must be able to freely share original observations and derived results without worrying about platforms, operating systems and other computer details. The HDF5 format is well suited to these demands and is being used successfully in many large facilities. The NASA dust accelerator facility at the University of Colorado's Institute for Modeling Plasma, Atmospheres and Cosmic Dust (impact.colorado.edu) is entirely standardized on HDF5. The data product is an HDF5 container following conventions that represent particle im- pact events from the accelerator. The "science data type" is a sequence of dust-particle events, each of which has an associated set of time series representing the experimental data, and as- sociated metadata. HDF5 files produced according to this scheme may be read using facility- provided IDL scripts, but users are free to use any HDF5-aware analysis environment (IDL, Matlab, Python, etc.) to explore the files. The use of standard HDF5 attributes and datasets means that "free-form" analysis in these environments is possible, without relying on the facili- ty scripts or special-purpose readers. The Basic Plasma Science Facility (BAPSF, http://plasma.physics.ucla.edu) is a national user facility (supported by DOE and NSF) dedicated to experiments addressing fundamental ques- tions in many sub-fields of plasma physics. Measurements are made with user-selectable probes in the 20 m long device. Time series from these probes are digitized and stored in HDF5 using facility-specific conventions. Additional data and metadata are stored in the file to docu- ment digitizer settings, the run-time state of other programmable devices associated with the experiment, as well as diagnostic information for the plasma device itself. The Diamond Light Source (DLS, http://www.diamond.ac.uk) is the United Kingdom’s syn- chrotron, another large scientific facility. DLS uses the ‘NeXus’ conventions with HDF5 for most of their experimental data bringing many benefits to users: HDF5 is built in to the next genera- tion of faster detectors, visualization and slicing of data, analysis of data larger than memory, users can create/test tools on one slice of data and process the whole stack, saving state or re- sults to a HDF5 file. In addition a wide range of tools are used to convert to and from HDF5. The final example involves modeling an entire facility in HDF5. The Reactor Analyzer is being developed at Oak Ridge (Billings et al., 2014) for reading, writing and viewing simulation results that uses HDF5 to handle all file processing. The system works by mapping the physical struc- tures of light-water and sodium-cooled fast reactors (reactor vessel, fuel pins, etc.) and the nu- clear plant (heat exchangers, pipes, etc.) to simple representative classes. Instances of these classes are then written to HDF5 data types, creating a completely consistent, transferable, hi- erarchical digital description of the nuclear reactor that was simulated.

Models and Tools In other scientific communities, models are fundamental tools that increase understanding. HDF5 supports input and output from models as well as visualization and analysis of results. For example, in molecular chemistry and physics, molecular coordinates constitute the basic information for simulation software. In addition to spatial coordinates, many additional fields are critical for complete understanding: velocities, masses, charges, etc. The scientific data items are mostly scalars and vectors organized in arrays with elements for specific atoms. Be- yond the organization of the storage, an important challenge remains in the interpretation of the data in different contexts (biomolecular systems, simple fluids, etc.). H5MD (http://nongnu.org/h5md/), has recently been proposed as an HDF5-based for molecular data (P. de Buyl et al., 2014). It addresses data organization and definition within a HDF5 file and is already being used as a basis for further specialization with K. Hinsen propos- ing a H5MD module for his MOlecular SimulAtion Interchange Conventions (MOSAIC, 2014). The "F5" model (http://www.fiberbundle.net) utilizes the concept of fiber bundles, gluing differential geometry, topology and geometric algebra to cast many spatio-temporal data cate- gories into a hierarchy of five semantic levels. This concept perfectly into the capabilities of HDF5 to provide a concrete implementation of this abstract model. The F5 model has been used to efficiently store data sets from medical imaging, computational fluid dynamics, molecu- lar dynamics and geosciences. The F5 model allows to seamlessly integrate both raster and vec- tor graphics, as well as complex data categories such as curvilinear multiblock, unstructured n- dimensional meshes and arbitrarily attributed fragmented point clouds. H5hut (http://www-vis.lbl.gov/Research/H5hut/) provides an API for parallel HDF5 encapsu- lating several data models of particle-based simulations. Particle simulations run on supercom- puters with hundreds of thousands of cores in order to study plasma and accelerator physics. These applications simulate billions to trillions of particles and produce tens of gbytes to hun- dreds of tbytes of data. High-performance I/O, and analysis and visualization tools for these da- ta can be challenging, especially for domain scientists. The H5hut library provides an efficient implementation of I/O and data management operations hiding the complexity of parallel HDF5 and is simple to use. Other frameworks include: • the Silo framework (https://wci.llnl.gov/codes/silo/index.html) : A mesh and field I/O li- brary and scientific database • many simulation and visualization tools developed at TechX (http://www.txcorp.com). The ActivePapers project (http://www.activepapers.org/) goes even further by integrating data with the software that produces or analyzes it in a single HDF5 package for publishing and archiving reproducible research. ActivePapers can re-use data and software through a DOI- based reference, which is important both for provenance tracking and for proper attribution of computational work. Because an ActivePaper is an HDF5 file, the data it contains is immediately usable via other HDF5-based software, which greatly reduces acceptance problems that every new technology has. In fact, an ActivePaper should be considered an HDF5 file with useful extra features.

Sustainable Archives Preserving and providing access to observations, data, and model results for global users over the long-term is prohibitively expensive when each dataset requires custom access soft- ware. A uniform format brings efficiencies of scale that make sustainable archives possible. The X-ray Coherent Diffractive Imaging community has created the Coherent X-ray Imaging Databank (Maia, 2012) unified around the CXI format. This reliable, high-performance and easy to use format, built on an HDF5 base, has simplified the task of scientists analyzing data from multiple facilities. Input formats from various light sources are converted to CXI on input to the repository making it more convenient for scientists analyzing data, and reducing the number of inputs that the software developers have to take into account. The NASA Earth Observing System (http://eospso.gsfc.nasa.gov) selected HDF as the format for all their satellite data over twenty years ago and has used the HDF-EOS conventions for hundreds of NASA products and millions of data files. EOS has lasted long enough that it has already encountered many real-world sustainability problems: 1) adoption of a standard format and related conventions across a wide variety of diverse science teams with their own require- ments and priorities, 2) evolution of formats and conventions across major format versions (HDF4 vs. HDF5), 3) emergence of new programming languages, compilers, computer systems, and architectures, and 4) development of new very high-resolution, multi-channel instruments with unanticipated observation geometries and significant long-term metadata needs.

Next Steps? The ability of communities to store, access, and share modern scientific data and build data analysis and management systems on top of HDF5 is well demonstrated by the examples de- scribed here (see http://www.hdfgroup.org/HDF5/users5.html for others). Together these groups form a unique pool of knowledge and experience in sustainable science data and soft- ware. They have established that a simple underlying format that supports flexible data organi- zations and access can address many of the needs of even the most demanding scientific disci- plines and they have taken difficult steps towards sustainable science in their communities. In addition, these groups have identified critical obstacles to sustainable scientific data and software well summarized by the NeXus experience: The uptake of NeXus has been slow: at fa- cilities with working data acquisition/analysis pipelines the incentive to move towards NeXus is low, especially as scientific software development is generally underfunded. NeXus is the format of choice whenever new things are being built or substantial upgrades are made. An- other drive towards NeXus results from next generation x-ray detectors and neutron facilities that generate data sets too large to be handled by the existing schemes. Then NeXus and HDF5 become the format of choice. Other obstacles include: • NeXus is a voluntary effort; thus progress is very slow • Lack of understanding of software issues in the scientific community and in manage- ment result of insufficient funding which in turn impedes progress. • Communities are not organized to negotiate, maintain and enforce conventions. It seems clear that the social obstacles have overcome technical obstacles on the road to- wards sustainable data, software, and science. Understanding and addressing these obstacles is critical to progress. Positive Deviance (http://www.positivedeviance.org) suggests that In every community there are individuals or groups who’s successful behaviors and strategies enable them to find better solutions to problems than their peers. We have identified some of those positive devi- ants here. The next step is bringing them together to help others understand and follow the paths they have forged. Applying these ideas across the communities discussed here, and others that are using, or could use, HDF5 successfully, requires significant community engagement and development efforts that are outside of the scope of traditional software development projects in research and non-profit organizations like The HDF Group. WSSSPE2 provides an opportunity to bring together leaders with experience in similar multi-disciplinary success stories to share best prac- tices for funding and executing these activities.

References Billings et. al., A domain-specific analysis system for examining nuclear reactor simulation data for light-water and sodium-cooled fast reactors. ArXiv:1407.2795. 10 Jul 2014 P. de Buyl, P. H. Colberg and F. Höfling, H5MD: a structured, efficient, and portable file for- mat for molecular data. ArXiv:1308.6382. 2014. Maia, F. . N. . The Coherent X-ray Imaging Data Bank. Nat. Methods 9, 854–855 (2012). MOSAIC, http://mosaic-data-model.github.io/mosaic-specification/, J. Chem. Inf. Model. 54, 131–137 (2014).