Hierarchical Data Format 5 (HDF5): Why And How To Use It

Andrea Negri

22 February 2013 Goals of this seminar

Provide an overview of what HDF5 can do what is the usage and the spread of HDF5 in the scientific community stimulate the interest for this format present some references for a COMPLETE documentation (both official and unofficial) show some simple example of writing an HDF5

I will NOT talk about the internal structure of the library (B-tree structure, etc. . . ) details about advanced HDF5 usage such as parallel I/O, etc. . .

Andrea Negri 5 (HDF5): Why And How To Use It The necessity of a good data container

Common cases in Astronomy: observations: tons of huge ASCII file with tables (preferred portability), slow I/O, no compression, no optimization of the disk usage; FITS files does not support storage of very large data simulations: binary files of very large numerical arrays, often written in a parallel way (preferred I/O efficiency), portability problem, sometimes no optimization of the disk usage (it depends on the user!)

A unified and very versatile container would be great

A unique format for different needs: HDF5!

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It What is HDF5?

Basically, HDF5 = Open file format + Open source software + data model

Versatile data model that can represent very complex data objects and a wide variety of metadata; A completely portable file format with no limit on the number or size of data objects in the collection; the file format is defined by the HDF5 Specification A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with , C++, 90, and Java interfaces; A rich set of integrated performance features that allow for access time and storage space optimizations; Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection; Completely open format and open-source software.

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It A little of history

1987: Graphics task force at NCSA began work on architecture-independent format and library, HDF; 1994: NASA selected HDF as standard format for Earth Observing System; 1996-1998: DOE tri-labs and NCSA, with additional support from NASA, developed HDF5, initially called “BigHDF” 2005 NASA funded development of netCDF-4, a new version of netCDF that uses the HDF5 file format; 2006: The HDF Group, a non-profit corporation, spun off from NCSA and the University of Illinois. Currently, the HDFgroup develop and maintain HDF5.

HDF4 is still maintained by HDFgroup, but its use is obscure, the most severe limitation: no file bigger than 2 GB. HDF5 is a complete different data model, file format and library. The library can be downloaded from http://www.hdfgroup.org/HDF5/release/obtain5.html

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It Features of the file format

possibility of compression (standard Defined by the HDF5 File Format Gzip, other compression are Specification (open) possible) designed high volume (1 TB in a very strong support for file!) and complex data multidimensional arrays POSIX access (i.e. /fields/density parallel I/O, through MPI-2 I/O ...) partial read/writes on arrays are completely self-describing: max possible portability extensible data mainly binary, random access runs on many platforms, from my optimized and tunable I/O efficiency netbook to Bluegene Q (chunks data and other techniques) file format designed to work well good for HPC with other technologies wide range of native and short timescales for bug fixing user-defined datatypes supported long term access to data

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It Users

HDF5 can be used alone (as I actually do) but many organizations and groups use it as an efficient base to develop their data models and containers: all the advantages of the HDF5 format tuned for your personal needs!

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It Who uses HDF5?

All user are spread around the world! Used in astrophysics, biology, optics, meteorology, oceanography, medical image processing, bioengineering, crystallography, to store radio data from LOFAR NASA: both HDF5 and EOS for aerospace engineering, Los Alamos laboratories Lucasfilm National Oceanographic Data Center (NOAA) an incomplete list: http://www.hdfgroup.org/HDF5/users5.html

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It Explore an HDF5: tools from hdfgroup.org

There are extremely useful command line tools, the most important are: h5ls: list the datasets h5dump: Enables the user to examine the contents of an HDF5 file and dump those contents to an ASCII file h5diff: compares HDF5 files h5import: imports ASCII or binary data into HDF5 h5check: validation tool, ensure the possibility of a long-term access h5repack: repacks a file, better usage of unused space, change properties of datasets h5perf and h5per serial: measures HDF5 serial and parallel performance and more. . .

HDFView: visual tool for browsing and editing HDF4 and HDF5 files. Conversion tools between HDF5 and: HDF4, EOS5, netCDF4, gif format for images.

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It

APIs officially supported by the hdfgroup: C and Fortran 90/2003 (both low and high level), Java (high level)

Third-party bindings: GNU Data Language, IDL, MATLAB, , Mathematica, , Python (h5py, PyTables), CGNS, and others. . .

Two different philosophies: low-level and high level APIs low-level: arrays of any size and type high-level: image, table, packet table, dimension scale

C function: name command(args) Fortran subroutine: name command f(args, int error)

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It HDF5 structure: a quick look

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It HDF5 structure: a dataset

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It HDF5 datasets organize and contain “raw data dimensions values”

HDF5 Datatypes describe individual data elements in an HDF5 dataset

Wide range of datatypes supported integer, float, double, unsigned, bitfield, user-defined, any KIND in Fortran 2003 variable length types (e.g., strings) reference to object; reference to dataset region opaque types array compound

HDF5 Dataspaces describe the logical layout of the elements in an HDF5 dataset

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It How to write a simple dataset

use hdf5 CALL h5open f (error) CALL h5fcreate f(filename, H5F ACC TRUNC F, file id, error) ! begin repetitive task call h5screate simple f(rank, dims, dspace id, error) call h5dcreate f(file id,dsetname,h5t native double,dspace id, & dset id,error) call h5dwrite f(dset id, h5t native double, dset, dims, error)

call h5dclose f(dset id, error) ! end access to the dataset call h5sclose f(dspace id, error) ! term. access to data space ! end repetitive task CALL h5fclose f(file id, error) CALL h5close f(error)

now let’s see a real code

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It To learn more

Official documentation and tutorials: www.hdfgroup.org Very interesting series of slides: http://www.lofar.org/wiki/doku.php?id=public:hdf5

CINECA will held a two-day course (16-17 May 2013):

Parallel I/O and management of large scientific data @ CINECA http://events.prace-ri.eu/conferenceDisplay.py?confId=126

Andrea Negri Hierarchical Data Format 5 (HDF5): Why And How To Use It