Numerical container for SOFA v1.0 Page 1

Numerical container for SOFA

Piotr Majdak, Harald Ziegelwanger

In SOFA, it would be of advantage to store all the information in a single file. In order to store the information, it must be serialized to a binary stream. There are many ways how to serialize information and a so-called numerical container (NC) can be used to define the format of the binary representation of the data. Note that the NC does not use any information about the meaning of the stored data – it simply creates a stream from the data to be saved. While, in SOFA, we actually do not want to spend much time on creating a new numerical container, we could use an existing one. To this end, we specify requirements, which seem to play the major role for SOFA: • Representation of structured data: our data in SOFA will have a structure, which should be represented in the NC. • Representation of multidimensional arrays. • Free and open specifications: we neither want to reverse engineer an implementation nor to pay for a specification. The specifications must include a complete definition and examples of implementation should be available for everybody for free. • Actively developed: the NC must not be deprecated, it must be actively developed, not only to include more features but also to allow for bug fixing and adaptations to future platforms. • Existing implementations, available as application programming interfaces (APIs): this is an important issue because we do not want to implement the specifications ourself. We need a group of working on the NC, so we can focus on SOFA. Thus, following requirements for the API can be identified: ◦ Pre-compiled libraries for as many programming languages as possible: we do not want to recompile the API on every update of the NC. ◦ Matlab/Octave APIs are mandatory: Since these two programs are the de-facto-standard in acoustic related research, the NC must be available in these packages. ◦ Free and open-source: all the sources of the API must be available and free to use. Why? We must be able to compile and even further develop the APIs in the case, the support for the API will be closed. • Widely spread: large number of institutions relying on an NC reduces the probability of becoming a deprecated standard. • Self-describing: the NC must include all information about the stored data, thus be self- describing. • Network-transparent: the data must be accessible by any computer regardless of the way storing characters, bytes, and numbers. Thus, it must be available for different operating systems: (at least for Linux, Mac OS X, MS Windows) • Huge-file support: no limitations in the specifications, the implementations must not be limited to the usual file size of 2 GB. • Partial access: reading only parts of large files should be possible. • Compression: the NC should allow compression, which is especially important when transferring the data over remote??? networks. • Viewer available: a general viewer for the stored data would be nice. The viewer should be able to display the structure of the file and the data in a numeric representation. While the Numerical container for SOFA v1.0 Page 2

importance of a general viewer will probably vanish as soon as we will have a stable version of a SOFA viewer, the general viewer will be very important in the development phase. We have compared many existing NCs, i.e., file formats for scientific data representation. Formats which are proprietary, not binary, or not designed for structured data have not been considered in the comparison. We have compared so far: • HDF 5 (): numerical container originally developed at the National Center for Supercomputing Applications, it is currently supported by the non-profit HDF Group, whose mission is to ensure continued development of HDF5 technologies, and the continued accessibility of data currently stored in HDF. • netCDF (Network Common Data Form): intende