
Numerical container for SOFA v1.0 Page 1 Numerical container for SOFA Piotr Majdak, Harald Ziegelwanger In SOFA, it would be of advantage to store all the information in a single file. In order to store the information, it must be serialized to a binary stream. There are many ways how to serialize information and a so-called numerical container (NC) can be used to define the format of the binary representation of the data. Note that the NC does not use any information about the meaning of the stored data – it simply creates a stream from the data to be saved. While, in SOFA, we actually do not want to spend much time on creating a new numerical container, we could use an existing one. To this end, we specify requirements, which seem to play the major role for SOFA: • Representation of structured data: our data in SOFA will have a structure, which should be represented in the NC. • Representation of multidimensional arrays. • Free and open specifications: we neither want to reverse engineer an implementation nor to pay for a specification. The specifications must include a complete definition and examples of implementation should be available for everybody for free. • Actively developed: the NC must not be deprecated, it must be actively developed, not only to include more features but also to allow for bug fixing and adaptations to future platforms. • Existing implementations, available as application programming interfaces (APIs): this is an important issue because we do not want to implement the specifications ourself. We need a group of working on the NC, so we can focus on SOFA. Thus, following requirements for the API can be identified: ◦ Pre-compiled libraries for as many programming languages as possible: we do not want to recompile the API on every update of the NC. ◦ Matlab/Octave APIs are mandatory: Since these two programs are the de-facto-standard in acoustic related research, the NC must be available in these packages. ◦ Free and open-source: all the sources of the API must be available and free to use. Why? We must be able to compile and even further develop the APIs in the case, the support for the API will be closed. • Widely spread: large number of institutions relying on an NC reduces the probability of becoming a deprecated standard. • Self-describing: the NC must include all information about the stored data, thus be self- describing. • Network-transparent: the data must be accessible by any computer regardless of the way storing characters, bytes, and numbers. Thus, it must be available for different operating systems: (at least for Linux, Mac OS X, MS Windows) • Huge-file support: no limitations in the specifications, the implementations must not be limited to the usual file size of 2 GB. • Partial access: reading only parts of large files should be possible. • Compression: the NC should allow compression, which is especially important when transferring the data over remote??? networks. • Viewer available: a general viewer for the stored data would be nice. The viewer should be able to display the structure of the file and the data in a numeric representation. While the Numerical container for SOFA v1.0 Page 2 importance of a general viewer will probably vanish as soon as we will have a stable version of a SOFA viewer, the general viewer will be very important in the development phase. We have compared many existing NCs, i.e., file formats for scientific data representation. Formats which are proprietary, not binary, or not designed for structured data have not been considered in the comparison. We have compared so far: • HDF 5 (Hierarchical Data Format): numerical container originally developed at the National Center for Supercomputing Applications, it is currently supported by the non-profit HDF Group, whose mission is to ensure continued development of HDF5 technologies, and the continued accessibility of data currently stored in HDF. • netCDF (Network Common Data Form): intended for creation, access, and sharing multidimensional array-oriented data. We will consider the netCDF-4 format which relies on HDF5. Standard of the Open Geospatial Consortium. • CDF (Common Data Format): developed by the NASA, it allows to store multidimensional data. • SDIF (Sound Description Interchange Format): format to interchange of a variety of sound descriptions. Jointly developed by IRCAM and CNMAT. • FITS (Flexible Image Transport System): focuses on images, used in astronomy. Strictly backwards compatible (once FITS, always FITS). • SDF (Simple data format): simple, well implemented but not wide spread (one-man product). • SDXF : (Structured Data eXchange Format) standardized by RFC. Limitation of an entry to 16 MB. Not considered further because of that limitation. Formats which do not fit our requirements: • CGNS (corresponds to SOFA requirements): it defines the HDF description of data coming from computational fluid dynamics algorithms and relies on either ADF or HDF5 for saving/reading. • Different serialization formats : like YAML (used for e-mail, MIME, not a binary format) some of them are standardized but do not provide support for structures. • Many formats not supporting binary representation like XML • Many proprietary formats like MAT from Matlab The result of the comparison is shown in Table I. For more detailed comparison between CDF, netCDF, and HDF5 go to this link and scroll down to FAQ #7. Conclusions It seems like netCDF provides all the feature we would like to have. It is also backed-up by a huge community standing behind the development and implementation of the NC. The chances that netCDF becomes deprecated soon are very small. Thus, it seems like netCDF should be our choice. The following resources may be interesting for further reading: • FAQ and User's Guide • Since we want to create a web interface for remote access of the SOFA data, OPeNDAP. might be interesting where people working on netCDF seem to already have solved the problem. Table I: Comparison of different NCs. Description SDF FITS SDIF CDF netCDF HDF5 Host/Developer G. Fisher, SSL , IAU FITS Working IRCAM and CNMAT NASA University Corporation for HDF Group UC Berkeley Group Atmospheric Research Focus multidim. arrays images, meta data sound description multidim. structures multidim. structures multidim. structures Free and open Yes Yes Yes Yes Yes Yes Self-describing ? ? yes Yes Yes Yes Activity 2007 1981-now 1991-now 1985-now 1988-now 1987-now Standardized No Yes No ISTP, CDHF OGC Yes Spread Berkeley, one- Astronomy, IRCAM, MIR >100 organizations, climatology, meteorology, used by netCDF, physics, man project ESA, ESO, NASA Max/MSP-community physics oceanography, GIS biology, geography API: Free and C++ C, C++, C#, Fortran, C, C++, Matlab, Perl, C++, Java, Python, C, C++, Fortran, IDL, C, C++, Fortran, Java, open-source IDL, JAVA, R, Tcl, Python, JAVA, Lisp Matlab Java, Matlab, R, Perl, Matlab, Perl, Python, R LabView, Matlab, Python, Ruby Perl, PDL, Python # of listed apps ? >30 ~20 ? >80 >100 API binaries Windows Windows, OS X, OS X Windows Windows Windows Linux Linux Linux OS X OS X OS X (others) OS X Linu x Linu x (others) (others) (others) Android Java Matlab bindings no native Mac OS X native native native Octave bindings no ? ? ? yes no Size limit > 2 GB no limits 2 GB/record > 4 GB > 4 GB no limits Partial access no ? rudimentary yes yes yes Compression no ? no ? GZIP GZIP Viewer (platform) no for images SDIF-Edit (OS X) AUTOPLOT (all) NCO (all) H D FVIEW (all) HDFVIEW (all).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages3 Page
-
File Size-