An Open HDF5-Based Data Format to Represent Quantitative
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 1 Research Article 2 BD5: an open HDF5-based data format to represent 3 quantitative biological dynamics data 4 Koji Kyoda1,2, Kenneth H. L. Ho1,2, Yukako Tohsato1,2,3, Hiroya Itoga1 and Shuichi 5 Onami1,2,* 6 1Laboratory for Developmental Dynamics, RIKEN Center for Biosystems Dynamics 7 Research, Kobe 650-0047, Japan. 8 2Laboratory for Developmental Dynamics, RIKEN Quantitative Biology Center, Kobe 9 650-0047, Japan. 10 3Department of Information Science and Engineering, Ritsumeikan University, Shiga 11 525-8577, Japan. 12 *To whom correspondence should be addressed 13 14 Short title: BD5 data format for representing quantitative biological dynamics data 15 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 2 16 Abstract 17 BD5 is a new binary data format based on HDF5 (hierarchical data format version 5). It 18 can be used for representing quantitative biological dynamics data obtained from 19 bioimage informatics techniques and mechanobiological simulations. Biological 20 Dynamics Markup Language (BDML) is an XML(Extensible Markup Language)-based 21 open format that is also used to represent such data; however, it becomes difficult to 22 access quantitative data in BDML files when the file size is large because parsing XML- 23 based files requires large computational resources to first read the whole file 24 sequentially into computer memory. BD5 enables fast random (i.e., direct) access to 25 quantitative data on disk without parsing the entire file. Therefore, it allows practical 26 reuse of data for understanding biological mechanisms underlying the dynamics. 27 28 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 3 29 Introduction 30 Recent advances in bioimage informatics and mechanobiological simulation techniques 31 have led to the production of a large amount of quantitative data of spatiotemporal 32 dynamics of biological objects ranging from molecules to organisms [1]. A wide variety 33 of such data can be described in an open unified data format Biological Dynamics 34 Markup Language (BDML), an Extensible Markup Language (XML)-based format [2]. 35 BDML enables efficient development and evaluation of software tools for a wide range 36 of applications. 37 The XML-based BDML format has the advantages of machine/human readability, 38 and extensibility. However, it is often problematic for accessing and retrieving data 39 when the size of the BDML file becomes too large (e.g., our programs cannot load a 40 BDML file over 20 GB on a standard workstation). This problem arises because parsing 41 an XML-based file often requires large computational resources to first read the whole 42 file sequentially into computer memory. In fact, many sets of quantitative data stored in 43 the SSBD:database (Systems Science of Biological Dynamics database) [1] were 44 divided into a series of BDML files for each time point to allow software to read them 45 efficiently. One of the solutions to the above problem is to use another approach such as 46 the eXtensible Data Model and Format [3] or FieldML [4]. In these formats, the data 47 itself is described in HDF5 binary format and meta-information about the data is 48 described in XML format. HDF5 is a hierarchical data format for storing large scientific 49 data sets (http://www.hdfgroup.org/HDF5/). It is widely used for describing various 50 kinds of large-scale biological data [4-9]. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 4 51 Here, we describe the development of BD5 data format, based on HDF5, for 52 representing quantitative biological dynamics data in a manner that enables quick access 53 and retrieval. 54 Materials and Methods 55 Design and implementation 56 Here, we extended BDML to support HDF5-based storage of quantitative biological 57 dynamics data. In contrast to XML documents, HDF5 format can allow random (i.e., 58 direct) access to parts of the file without parsing the entire contents. Therefore, HDF5 is 59 a more efficient file format for accessing and retrieving the contents of the file. 60 We developed the BD5 data format based on HDF5 for representing quantitative 61 data. A BD5 file is organized into two primary structures, datasets and groups. Datasets 62 are array-like objects that store numerical data, whereas groups are hierarchical 63 containers that store datasets and other groups. Detailed information on BD5 is 64 available at http://ssbd.qbic.riken.jp/bdml/. Here, we summarize the BD5 major datasets 65 and groups. BD5 format has one container named data (Fig. 1). It includes 66 ● scaleUnit dataset for the definition of spatial and time scales and units, 67 ● objectDef dataset for the definition of biological objects, 68 ● featureDef dataset for features of interest, 69 ● numbered groups (0, 1, … , n) corresponding to an index number of a 70 time-ordered sequence, 71 ● trackInfo dataset for the information of tracking of one object to another. 72 Each of the numbered groups corresponds to an index of a time-ordered sequence 73 that has object and feature groups. For a fixed time interval, the index will correspond bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 5 74 to each sequential time point. For example, if the time interval is 2 minute, group 0 will 75 have t = 0 and group 1 will have t = 1 while tScale is 2 and tUnit is minute (Fig. 1). For 76 irregular time intervals, the index allows a time-ordered sequence to be saved and be 77 read in the correct order. If the first time is 0 minutes, the second time is 2 minutes, 78 while the third time is 7 minutes, then group 0 will have t = 0, group 1 will have t = 2 79 and group 2 will have t = 7. The tUnit is still minute, but the tScale in this case will be 80 1. 81 Each object group has numbered dataset(s) corresponding to the reference number 82 of the biological object(s) predefined under the objectDef dataset. Each row of the 83 numbered object includes an identifier of the object and its spatiotemporal information 84 such as time point and xyz-coordinates (Fig. 2). To represent biological objects such as 85 line and face entities that have an arbitrary number of xyz-coordinates in BD5 format, a 86 tabular dataset is used (Fig. 3). The multiple xyz-coordinates are represented by using a 87 sequential ID (sID) that allows us to connect the xyz-coordinates together to form a line 88 or a face within a biological object. 89 Each feature group has numbered dataset(s) corresponding to the reference 90 number of the object(s) predefined in the objectDef dataset. Each row of the numbered 91 object includes an identifier of the object, an identifier of the feature (fID) predefined in 92 featureDef, and the value of the feature (Fig. 4). This format allows objects that do not 93 possess all the features defined in featureDef to be recorded, because not all the features 94 can necessarily be measured in practical biological experiments. For example, in the 95 experiment in Fig. 4, an object may have information for fID = 1 (name: center-of-mass 96 GFP signal) but not fID = 0 (name: average GFP signal). bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 6 97 The trackInfo dataset enables information of the objects to be linked between 98 different time points or time frames (Fig. 2). For example, when a cell at t = 0 divides 99 into two daughter cells at t = 1, it has links from the parent cell to the daughter cells. 100 The trackInfo dataset can be used to represent not only phenomena such as cell division 101 but also those such as cell fusion. 102 To allow the use of BD5 to describe quantitative data, we needed to update the 103 BDML format so that it could be used to describe the corresponding meta-information.