Introduction Commonalities Examples Final Notes

Scientific File Formats

Daniel L. Wang

SLAC

6 October 2010

Daniel L. Wang Scientific File Formats Introduction Commonalities Examples Final Notes

1 Introduction

2 Commonalities

3 Examples FITS XTC ROOT I/O NetCDF HDF5 Others

4 Final Notes

Daniel L. Wang Scientific File Formats Introduction Commonalities Examples Final Notes

Introduction: why files?

Files contain many (most?) scientific data

Files last for a long time

Explosion in data→more, bigger files

Figure: Magnetic tape drive http://www.flickr. com/photos/laughingsquid/102689398/

Daniel L. Wang Scientific File Formats Introduction Commonalities Examples Final Notes

Scientific File Access Simple, non-transactional Data access: value lookups, statistics, plotting, Transformations: simple math + complex algorithms Logging/history: coarse “Image is the result of function(Image A, parameter set B)” not “$X was deducted from Account Y and added to Account Z” Longevity: >10, 50 or 100+ years

Daniel L. Wang Scientific File Formats Introduction Commonalities Examples Final Notes

Themes and Commonalities in Formats Sequential or random-access navigation Storage efficiency Self-description (i.e., metadata) Ordering: Object sequences, grids, images, N-D arrays (+ tables) Write-append (non-update) Machine portability (FP format, byte order) Standards

Daniel L. Wang Scientific File Formats FITS Introduction XTC Commonalities ROOT I/O Examples NetCDF Final Notes HDF5 Others

FITS Flexible Image Transport System. Standard astronomical data format (NASA/IAU)

Generic N-D arrays, images, and ASCII or binary tables Image tile-compression Not random access Human-readable header See also: AipsIO (used by casacore)

Daniel L. Wang Scientific File Formats FITS Introduction XTC Commonalities ROOT I/O Examples NetCDF Final Notes HDF5 Others

Primary HDU HDU type Contents Primary ASCII header + n-D array Extension HDU Image* ASCII header w/image metadata + n-D array Extension HDU ASCII Table* ASCII header w/table metadata + fixed-width ... row data Binary Table* ASCII header w/table metadata + 2-D array * Extension HDU

Daniel L. Wang Scientific File Formats FITS Introduction XTC Commonalities ROOT I/O Examples NetCDF Final Notes HDF5 Others

XTC eXtended Tagged Container (HEP, photon science)

Object : Vectors, trajectories, events, detections Not random-access No compression Lightweight, streaming

Daniel L. Wang Scientific File Formats FITS Introduction XTC Commonalities ROOT I/O Examples NetCDF Final Notes HDF5 Others

Datagram Datagram type Notes Sequence Event transitions (e.g., Datagram {Begin,End}Run, L1Accept) ... Xtc User data objects Env More information: https://confluence.slac.stanford.edu/display/PCDS/XTC+format

Daniel L. Wang Scientific File Formats FITS Introduction XTC Commonalities ROOT I/O Examples NetCDF Final Notes HDF5 Others

ROOT I/O ROOT Object I/O (HEP)

Object serialization Tree-structured (similar to fs) Object deletion Compression (deflate) Ranged values See also: LCIO

Daniel L. Wang Scientific File Formats FITS Introduction XTC Commonalities ROOT I/O Examples NetCDF Final Notes HDF5 Others

Figure: From [Brun and Rademakers, 1996]

Daniel L. Wang Scientific File Formats FITS Introduction XTC Commonalities ROOT I/O Examples NetCDF Final Notes HDF5 Others

NetCDF Network Common Data Form. (Geo)

N-dimensional arrays Arrays appendable in one dimension (v4: or more) Named dimensions with explicit coordinates (allow irregular spacing) Machine portable Slabbed, sliced, random access NetCDF4: NetCDF API on HDF5 physical structure

2000-2006 JJA wind power, courtesy Scott Capps Daniel L. Wang Scientific File Formats FITS Introduction XTC Commonalities ROOT I/O Examples NetCDF Final Notes HDF5 Others

Section Notes Header magic, #records, dimension, global attr, variable meta Non-record data fixed-size variables, incl fixed- dimensions Record data record variables (incl. record di- mensions From [Rew et al., 1997]

Daniel L. Wang Scientific File Formats FITS Introduction XTC Commonalities ROOT I/O Examples NetCDF Final Notes HDF5 Others

HDF5

N-D arrays Array nesting via pointers+VL datatypes Compression Parallel I/O (MPI I/O) Flexible, chunked data layout (slabs or custom tiles) Ragged arrays via variable-length datatypes fs-like, w/symlinks, heaps, freelists, Up to 255 byte offsets

Daniel L. Wang Scientific File Formats FITS Introduction XTC Commonalities ROOT I/O Examples NetCDF Final Notes HDF5 Others

Figure: From [HDF Group, 2010]

Daniel L. Wang Scientific File Formats FITS Introduction XTC Commonalities ROOT I/O Examples NetCDF Final Notes HDF5 Others

Many more formats!

Irregular/non-rectangular grids (e.g., geodesic, radial, etc.) [Wadsley and Shell Internationale Petroleum, 1980] , triangular meshes [G´orski et al., 2005]

See also Scientific Data Format FAQ [Stern, 1995] Figure: Hexagonal mesh http: //www.flickr.com/photos/danhorst/819469908/

From HEALPix: http://healpix.jpl.nasa.gov

Daniel L. Wang Scientific File Formats Introduction Commonalities Examples Final Notes

Final notes

Data > formats

Long-lived formats, long-lived software

Figure: Looking for data in files. http: //www.flickr.com/photos/mahmood/4616170423/ Daniel L. Wang Scientific File Formats Introduction Commonalities Examples Final Notes

References Brun, . and Rademakers, F. (1996). ROOT Object I/O System. http://www.hdfgroup.org/HDF5/doc/H5.format.html.

G´orski, K., Hivon, E., Banday, A., Wandelt, B., Hansen, F., Reinecke, M., and Bartelmann, M. (2005). HEALPix: A framework for high-resolution discretization and fast analysis of data distributed on the sphere. The Astrophysical Journal, 622:759.

HDF Group (2010). HDF5 file format specification version 2.0. http://www.hdfgroup.org/HDF5/doc/H5.format.html.

Rew, R., Davis, G., Emmerson, S., and Davies, H. (1997). NetCDF user’s guide for C. Unidata Program Center.

Stern, I. (1995). Scientific Data Format Information FAQ.

Wadsley, W. A. and Shell Internationale Petroleum (1980). Modelling reservoir geometry with non-rectangular coordinate grids. In SPE Annual Technical Conference and Exhibition, Dallas, Texas. American Institute of Mining, Metallurgical, and Petroleum Engineers.

Daniel L. Wang Scientific File Formats