On the Use of Specifications of Binary File
Total Page:16
File Type:pdf, Size:1020Kb
On the Use of Specifications of Binary File Formats for Analysis and Processing of Binary Scientific Data ? Alexei Hmelnov1;2[0000−0002−0125−1130] and Tianjiao Li3[0000−0002−1037−2452] 1 Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences, 134 Lermontov st. Irkutsk, Russia [email protected] 2 Institute of Mathematics, Economics and Informatics, Irkutsk State University, Gagarin Blvd. 20, Irkutsk, Russia 3 ITMO University, 49 Kronverksky Pr., St. Petersburg, Russia [email protected] Abstract. The data collected during various kinds of scientific research may be represented both by well known binary file formats and by cus- tom formats specially developed for some unique device. While thorough understanding of the file format may be required for the former case of the well known format, for the latter case of custom formats it is of crit- ical importance. For the custom formats usually only few people know how they are organized, and this expertise can easily be lost. We have developed the language FlexT (Flexible Types) for specification of binary data formats. The language is declarative and designed to be well understood for human readers. Its main elements are the data type declarations, which look very much like the usual type declarations of the imperative programming languages, but are more flexible. The primary purpose of the language FlexT development is to be able to view the binary data and to check that the data conform to the specification, and that the specification conforms to the data samples available. As a result of the tests we can be sure, that the specification is correct. The FlexT specifications doesn`t contain any surplus information besides from that about the file format. They are compact and human readable. We have also developed the algorithm for data reading code generation from the specification. In the report we'll consider some FlexT language details and the expe- rience of its application to specification of some scientific data formats. Keywords: Specifications of binary data formats · Declarative language · Sci- entific data life-cycle. ? The work was carried out with financial support of Russian Foundation for Basic Research, grant No. 18-07-00758-a. Some results were obtained using the facilities of the Centre of collective usage "Integrated information network of Irkutsk scientific educational complex" 1 Introduction In the field of natural sciences the outcomes of some research are usually repre- sented by the articles, which summarize the main results of the analysis of the collected data. But the results of the analysis depend also on the methods cho- sen, their parameters, and some other subjective factors, so the other researchers are interested in the access to the data (both raw original and processed) to be able to check the results, try their own ways of the data analysis, or compare the data to their own ones. The concept of scientific data life-cycle should meet this demand. So, it becomes not enough to just obtain the data, process them and write some articles using the results of the processing. It is also required to share the data with other researchers. These researchers may be not only our con- temporaries but also our descendants, living in a few decades from now. Our descendants, we hope, will have more advanced devices, computers and soft- ware. But they still will not be able to measure the values of physical quantities for previous periods (say, several years ago). So the data, that we have stored for them, may become very valuable. Let us consider the possible ways of the data representation, compare their features, and describe some approaches, that should simplify data sharing. 2 Options for choosing the format of scientific data When deciding on the data representation it is required to select between several groups of options. We may consider the groups as a coordinates in the space of possible file formats. Let us consider the dimensions and their possible values. The first classification is by the level of customization: { well-known file formats with ready to use data processing libraries and soft- ware (JPEG,TIFF,DBF,CSV,SHP,...); { data exchange frameworks with the data format setup/data processing/code generation libraries, software; { "universal" formats with the data format setup/data processing/code gen- eration libraries, software; { custom file formats. So it is required to select between the well-known file formats, which may already have the ready to use libraries for their processing; the data exchange frameworks and "universal" formats, which require some additional information about the particular data structures, but provide a lot of ready to use services for their handling and the custom file formats, which will require to write all the software from scratch. Since the existing file formats may impose some unac- ceptable limitations on the data structures or have too large memory overhead, the choice here is far from being obvious. The "universal" file formats will be addressed in more detail in the next sections. The second classification is by the file format usage: there exist internal and exchange file formats. The internal file formats are designed to maximize the efficiency and ease of processing by a specific software. They may contain some auxiliary data structures (e.g. indices) and be very complex. The internal file formats are usually binary, unless the files are very small. The exchange file for- mats should be rather simple and easily understandable for other programmers, but they may be not very well-suited for everyday work (e.g. data editing). The next option to consider when selecting the data representation is whether to use a text-based or a binary data representation. The binary data formats are much more space- and time-efficient, than the text-based ones. The main disadvantage of the binary data is that they look opaque to the users and it is hard to control their contents with a "naked eye". That's why programmers nevertheless often prefer to use text formats, and among them the XML-based ones are of great popularity in spite of the fact that it becomes impossible for a human being to comprehend the extremely large text files. The language FlexT can make binary data transparent. 3 The "Universal" formats The "Universal" formats allow you to store a wide (but still limited) range of information. They are always accompanied by some software libraries and utilities for handling the files of the format. Some examples: { XML (eXtensible Markup Language) is a text based format family, they may use XML schemata, various utilities and editors, libraries/code generation software to simplify data processing. Various file format projects claim to be the binary version of the XML format, but no proposal has been accepted as a binary XML standard. { Hierarchical Data Format (HDF) [1] is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data. Originally developed at the National Center for Supercomputing Applications, the formats are now supported by the HDF Group. It has libraries for C,C++,Java, MATLAB, Scilab, Octave, Mathematica, IDL, Python, R, Fortran, and Julia. The main goal of its development is to create portable scientific data format. HDF files are self-describing allowing an application to interpret the structure and contents of a file without an external specification. One HDF4 file can hold a mix of related objects which can be accessed as a group or as individual objects. Users can create their own grouping structures called "vgroups". HDF5 simplifies the file structure to include only two major types of object: • datasets, which are multidimensional arrays of a homogeneous type; • groups, which are container structures that can hold datasets and other groups. { Protocol Buffers (by Google) [2] is a method of serializing structured data. The method involves an interface description language that describes the structure of some data and a program that generates source code from that description for writing or reading a stream of bytes that represents the struc- tured data. In contrast to HDF the resulting files are not self-contained and require the structure specification (or, better, the code generated from it) to read the files, corresponding to this specification. "Universal" data formats are often used to represent scientific data. Not to mention XML, which is used everywhere, HDF files, for example, are used for representation of remote sensing data, while the Protocol Buffers are the main way to represent the weights of deep neural networks. The main reason for this is the availability of tools and libraries that simplify the development and processing of "Universal" formats at the cost of following their prescripts. Our approach makes it possible to do nearly the same with an almost arbitrary binary file format. 4 Binary file format specifications in FlexT The main goal of the language FlexT is to provide the instrument, that can help us to explore and understand the contents of the binary files, for which there exists a specification of their format. It can also help to check whether the format specification is correct using the samples of data in this format. FlexT specifications can be used to check, view, process, and document both well-known and custom file formats. Our experience shows that the vast majority of format specifications written in natural language contain errors and ambiguities, which can be detected and fixed by trying to apply various versions of the specification to the sample data in order to find the correct variant of understanding of the format description. The information about a file format may also be obtained from the source code of a program that reads or writes the data.