bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1

1 Research Article

2 BD5: an open HDF5-based data format to represent 3 quantitative biological dynamics data

4 Koji Kyoda1,2, Kenneth H. L. Ho1,2, Yukako Tohsato1,2,3, Hiroya Itoga1 and Shuichi

5 Onami1,2,*

6 1Laboratory for Developmental Dynamics, RIKEN Center for Biosystems Dynamics

7 Research, Kobe 650-0047, Japan.

8 2Laboratory for Developmental Dynamics, RIKEN Quantitative Center, Kobe

9 650-0047, Japan.

10 3Department of Information Science and Engineering, Ritsumeikan University, Shiga

11 525-8577, Japan.

12 *To whom correspondence should be addressed

13

14 Short title: BD5 data format for representing quantitative biological dynamics data

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 2

16 Abstract

17 BD5 is a new binary data format based on HDF5 (hierarchical data format version 5). It

18 can be used for representing quantitative biological dynamics data obtained from

19 bioimage informatics techniques and mechanobiological simulations. Biological

20 Dynamics Markup Language (BDML) is an XML(Extensible Markup Language)-based

21 open format that is also used to represent such data; however, it becomes difficult to

22 access quantitative data in BDML files when the file size is large because parsing XML-

23 based files requires large computational resources to first read the whole file

24 sequentially into computer memory. BD5 enables fast random (i.e., direct) access to

25 quantitative data on disk without parsing the entire file. Therefore, it allows practical

26 reuse of data for understanding biological mechanisms underlying the dynamics.

27

28 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 3

29 Introduction

30 Recent advances in bioimage informatics and mechanobiological simulation techniques

31 have led to the production of a large amount of quantitative data of spatiotemporal

32 dynamics of biological objects ranging from molecules to organisms [1]. A wide variety

33 of such data can be described in an open unified data format Biological Dynamics

34 Markup Language (BDML), an Extensible Markup Language (XML)-based format [2].

35 BDML enables efficient development and evaluation of software tools for a wide range

36 of applications.

37 The XML-based BDML format has the advantages of machine/human readability,

38 and extensibility. However, it is often problematic for accessing and retrieving data

39 when the size of the BDML file becomes too large (e.g., our programs cannot load a

40 BDML file over 20 GB on a standard workstation). This problem arises because parsing

41 an XML-based file often requires large computational resources to first read the whole

42 file sequentially into computer memory. In fact, many sets of quantitative data stored in

43 the SSBD:database (Systems Science of Biological Dynamics database) [1] were

44 divided into a series of BDML files for each time point to allow software to read them

45 efficiently. One of the solutions to the above problem is to use another approach such as

46 the eXtensible Data Model and Format [3] or FieldML [4]. In these formats, the data

47 itself is described in HDF5 binary format and meta-information about the data is

48 described in XML format. HDF5 is a hierarchical data format for storing large scientific

49 data sets (http://www.hdfgroup.org/HDF5/). It is widely used for describing various

50 kinds of large-scale biological data [4-9]. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 4

51 Here, we describe the development of BD5 data format, based on HDF5, for

52 representing quantitative biological dynamics data in a manner that enables quick access

53 and retrieval.

54 Materials and Methods

55 Design and implementation

56 Here, we extended BDML to support HDF5-based storage of quantitative biological

57 dynamics data. In contrast to XML documents, HDF5 format can allow random (i.e.,

58 direct) access to parts of the file without parsing the entire contents. Therefore, HDF5 is

59 a more efficient file format for accessing and retrieving the contents of the file.

60 We developed the BD5 data format based on HDF5 for representing quantitative

61 data. A BD5 file is organized into two primary structures, datasets and groups. Datasets

62 are array-like objects that store numerical data, whereas groups are hierarchical

63 containers that store datasets and other groups. Detailed information on BD5 is

64 available at http://ssbd.qbic.riken.jp/bdml/. Here, we summarize the BD5 major datasets

65 and groups. BD5 format has one container named data (Fig. 1). It includes

66 ● scaleUnit dataset for the definition of spatial and time scales and units,

67 ● objectDef dataset for the definition of biological objects,

68 ● featureDef dataset for features of interest,

69 ● numbered groups (0, 1, … , n) corresponding to an index number of a

70 time-ordered sequence,

71 ● trackInfo dataset for the information of tracking of one object to another.

72 Each of the numbered groups corresponds to an index of a time-ordered sequence

73 that has object and feature groups. For a fixed time interval, the index will correspond bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 5

74 to each sequential time point. For example, if the time interval is 2 minute, group 0 will

75 have t = 0 and group 1 will have t = 1 while tScale is 2 and tUnit is minute (Fig. 1). For

76 irregular time intervals, the index allows a time-ordered sequence to be saved and be

77 read in the correct order. If the first time is 0 minutes, the second time is 2 minutes,

78 while the third time is 7 minutes, then group 0 will have t = 0, group 1 will have t = 2

79 and group 2 will have t = 7. The tUnit is still minute, but the tScale in this case will be

80 1.

81 Each object group has numbered dataset(s) corresponding to the reference number

82 of the biological object(s) predefined under the objectDef dataset. Each row of the

83 numbered object includes an identifier of the object and its spatiotemporal information

84 such as time point and xyz-coordinates (Fig. 2). To represent biological objects such as

85 line and face entities that have an arbitrary number of xyz-coordinates in BD5 format, a

86 tabular dataset is used (Fig. 3). The multiple xyz-coordinates are represented by using a

87 sequential ID (sID) that allows us to connect the xyz-coordinates together to form a line

88 or a face within a biological object.

89 Each feature group has numbered dataset(s) corresponding to the reference

90 number of the object(s) predefined in the objectDef dataset. Each row of the numbered

91 object includes an identifier of the object, an identifier of the feature (fID) predefined in

92 featureDef, and the value of the feature (Fig. 4). This format allows objects that do not

93 possess all the features defined in featureDef to be recorded, because not all the features

94 can necessarily be measured in practical biological experiments. For example, in the

95 experiment in Fig. 4, an object may have information for fID = 1 (name: center-of-mass

96 GFP signal) but not fID = 0 (name: average GFP signal). bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 6

97 The trackInfo dataset enables information of the objects to be linked between

98 different time points or time frames (Fig. 2). For example, when a cell at t = 0 divides

99 into two daughter cells at t = 1, it has links from the parent cell to the daughter cells.

100 The trackInfo dataset can be used to represent not only phenomena such as cell division

101 but also those such as cell fusion.

102 To allow the use of BD5 to describe quantitative data, we needed to update the

103 BDML format so that it could be used to describe the corresponding meta-information.

104 The latest version of BDML (version 3.0) can handle an external file by using the

105 extFile element (Fig. 5). The bd5File element that we introduced within the

106 extFile element can be used to point to an external BD5 file. In addition, this update

107 allows the designation of multiple contact persons and the use of a unique persistent

108 digital identifier, ORCID (https://orcid.org) in its format.

109 Results

110 Validation

111 To evaluate the performance of the BD5 format, we first compared time for accessing

112 the file between XML- and HDF5-based files (i.e., between pairs of BDML and BD5

113 files containing equivalent data). We measured the time for accessing coordinate data at

114 a randomly selected time point in the BDML and BD5 files (334 pairs of files) by using

115 a Python-based program (Fig. 6a). The results indicate that the access times of HDF5-

116 based files were consistently faster than those of the corresponding XML-based files.

117 Therefore, BD5, the new HDF5-based format, enables practical access to quantitative

118 data for further analysis. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 7

119 File size can be a critical benchmark for a data format because the transfer of large

120 files often fails. Therefore, we next compared disk space requirement between the

121 XML- and HDF5-based files by comparing the size of BDML and BD5 files (450 pairs

122 of files) (Fig. 6b). BD5 format reduced the file size by ~85% compared with the BDML

123 format when the data is large. When the data is small (< 300MB), the size of BD5 file is

124 close to, but still less than, that of the corresponding BDML file. Because the size of

125 HDF5-based files for large data is much less than that of the equivalent XML-based

126 files, the BD5 format enables, in theory, fast transfer of large quantitative data to and

127 from computers on the network and on the internet.

128 In addition, we determined the relationships between access time and file size for

129 BDML and BD5 files (Fig. 7). In BD5, we found fast access to the coordinate data even

130 when the file size was large. This fast data access in BD5 originated from its random

131 access to data. In BDML, the access time linearly increased with file size. This result

132 suggests that parsing of XML was the main bottleneck of data access. Quantitative

133 biological dynamics data tends to be large due to the advances in live-cell imaging

134 techniques and imaging equipment. We anticipate that BD5 will play a key role in fast

135 access to such large data sets.

136 Software tools and usage related to BD5

137 So that BD5-based tools can be used for data stored in older BDML files, we provide a

138 C++-based software tool named BDML2BD5. By using this tool, BDML files can be

139 converted into BD5 files. To compile the tool, the HDF5 library is required for HDF5

140 data writing, and CodeSynthesis XSD (http://www.codesynthesis.com/products/xsd/) is

141 required for the BDML schema to C++ data binding compiler. All source codes and the bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 8

142 executable file of BDML2BD5 are available at

143 https://github.com/openssbd/BDML2BD5/.

144 We also provide a program bd5lint for detecting bugs and inconsistencies in BD5

145 files. The program checks the structure of BD5 files, and checks that the ordered

146 numbered datasets in object and feature groups correspond to the reference numbers of

147 the objects and features predefined in the objectDef and featureDef dataset. It also

148 checks the consistency of the dimensions declared and the actual dimensions used

149 within the datasets. It provides type checking of the data and error warnings if the data

150 do not conform to the BD5 specification. The Python source code is available at

151 https://github.com/openssbd/bd5lint/.

152 We also provide several Python-based programs for data analysis using BD5 files.

153 These programs are available as Jupyter Notebook files at

154 https://github.com/openssbd/BDML-BD5/. An example is a program that counts the

155 number of biological objects in each numbered group of the time-ordered sequence. By

156 using this program, we can obtain the proliferation curve of Caenorhabditis elegans

157 embryogenesis. The program can be modified to obtain similar information for other

158 organisms such as Danio rerio and Drosophila melanogaster.

159 Discussion

160 In this study, we developed a new BD5 data format based on HDF5 for representing

161 quantitative biological dynamics data. Compared with BDML, which is based on XML,

162 the BD5 format has two advantages: (a) fast access and retrieval to quantitative data

163 because of random access to the HDF5-based file, and (b) fast transfer of files

164 containing large quantitative data because the file size is dramatically reduced. A bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 9

165 drawback of the BD5 is that human readability is low when compared with BDML

166 format. BD5 files cannot be opened by text editors because the file is binary formatted.

167 However, the HDF group provides a software tool named HDFView that enables the

168 user to open and read all HDF5-based files

169 (https://www.hdfgroup.org/downloads/hdfview/). This tool can compensate for the lack

170 of human readability.

171 BD5 format has already been used in the latest version of SSBD:database

172 (http://ssbd.qbic.riken.jp), which is one of the major databases for sharing bioimage data

173 and quantitative biological dynamics data [10]. Over 687 files, which include a wide

174 variety of quantitative biological dynamics data from molecules to cells to organisms,

175 are available. This demonstrates that the BD5 format has high functionality and

176 flexibility for representing quantitative biological dynamics data. SSBD:database also

177 provides a RESTful API (i.e., an API (application programming interface) that allows

178 applications to access data and interact with external software tools) through the use of

179 the webservice h5serv (https://github.com/HDFGroup/h5serv). This enables

180 SSBD:database to provide a web service for users to access quantitative data stored in

181 BD5 files (http://ssbd.qbic.riken.jp/restfulapi/). Because HDF5 and XML are supported

182 by many software platforms, BD5 is a promising data format for storing quantitative

183 biological dynamics data.

184 Like BDML, BD5 can represent quantitative biological dynamics data that is

185 associated with, but independent of, microscopy images. Such data has often been

186 represented as regions of interest (ROIs) on the corresponding microscopy images; for

187 example, the ROIs in the OME data model (https://docs.openmicroscopy.org/ome- bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 10

188 model/) and segmentation channels in Cell Feature Explorer (https://cfe.allencell.org).

189 However, not all data can be represented as an ROI on a microscopy image. For

190 example, in an automated cell lineage tracing study of Caenorhabditis elegans, each

191 nucleus was represented as a sphere with center and radius, independently of the z-stack

192 images [11]. Such flexible representation of BD5 (and also BDML) enables us to

193 represent quantitative biological dynamics data obtained not only from bioimage

194 informatics but also from mechanobiological simulation techniques.

195 Funding

196 This work was supported in part by the National Bioscience Database Center (NBDC)

197 of the Japan Science and Technology Agency (JST); Core Research for Evolutionary

198 Science and Technology (CREST) Grant Number JPMJCR1511, JST; JSPS KAKENHI

199 Grant Number JP18H05412; the Strategic Programs for R&D (President’s Discretionary

200 Fund) of RIKEN, Japan; and Open Science Platform, RIKEN, Japan.

201 Acknowledgements

202 We are grateful to the members of the Onami laboratory, RIKEN Center for Biosystems

203 Dynamics Research, Japan for feedback and discussions.

204 205 REFERENCES

206 1. Tohsato Y, Ho KH, Kyoda K, Onami S. SSBD: a database of quantitative data of

207 spatiotemporal dynamics of biological phenomena. . 2016;32:3471-

208 2479. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 11

209 2. Kyoda K, Tohsato Y, Ho KH, Onami S. Biological Dynamics Markup Language

210 (BDML): an open format for representing quantitative biological dynamics data.

211 Bioinformatics. 2015;31:1044-1052.

212 3. Clarke JA, Mark ER. Enhancements to the eXtensible Data Model and Format

213 (XDMF). Proceedings of the 2007 DoD High Performance Computing

214 Modernization Program Users Group Conference; 2007 Jun; Washington DC,

215 USA. IEEE Computer Society. pp. 322–327.

216 4. Britten RD, Christie GR, Little C, Miller AK, Bradley C, Wu A, et al. FieldML, a

217 proposed open standard fort the Physiome project for mathematical model

218 representation. Med Biol Eng Comput. 2013;51:1191-1207.

219 5. Baker M. Quantitative data: learning to share. Nature Meth. 2012;9:39-41.

220 6. Dougherty MT, Folk MJ, Zadok E, Bernstein HJ, Bernstein FC, Eliceiri KW, et al.

221 Unifying biological image formats with HDF5. Commun ACM. 2009;52:42-47.

222 7. Hoffman MM, Buske OJ, Noble WS. The Genomedata format for storing large-

223 scale functional genomics data. Bioinformatics. 2010;26:1458-1459.

224 8. Millard BL, Niepel M, Menden MP, Muhlich JL, Sorger PK. Adaptive informatics

225 for multifactorial and high-content biological data. Nature Meth. 2011;8:487-492.

226 9. Wilhelm M, Kirchner M, Steen JA, Steen H. mz5: space- and time-efficient storage

227 of mass spectrometry data sets. Mol Cell Proteomics. 2012;11:O111.011379.

228 10. Dance A. Find a home for every imaging data set. Nature. 2020;579:162-163. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 12

229 11. Bao Z, Murray JI, Boyle T, Ooi SL, Sandel MJ, Waterston RH. Automated cell

230 lineage tracing in Caenorhabditis elegans. Proc Natl Acad Sci USA.

231 2006;103:2707-2712.

232 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 13

233 Figure 1. Outline of the BD5 data format. The data group includes scaleUnit,

234 objectDef, featureDef, and trackInfo datasets; each data group is numbered to

235 correspond to the index number of the time-ordered sequence. Each numbered group

236 has spatial information about biological objects and numerical information about

237 features related to the objects. Solid and dashed boxes represent the required and

238 optional elements, respectively.

239 Figure 2. An example of the description of the spatiotemporal information of biological

240 objects and their tracking information. The dataset name in object group corresponds to

241 the identifier (ID) of the biological object (red). Each row in the dataset must have a

242 unique ID and its spatiotemporal information. A label can optionally be attached for

243 each object. The tracking information including object divisions and fusions can be

244 stored in trackInfo dataset.

245 Figure 3. An example of the description of the spatiotemporal information based on line

246 (a) and face entities (b). The sequential identifier (sID) represents a set of coordinates

247 that can be connected beginning at the top to describe an entity within one biological

248 object.

249 Figure 4. An example of the description of the feature information related to biological

250 objects. The example object is a nucleus expressing green fluorescent protein (GFP) at t

251 = 0 in a time series. The dataset name in feature group corresponds to the identifier (ID)

252 of the biological object. Each row in the dataset has object ID, feature fID (blue), and

253 the feature value. In this example, fID is 0 or 1 depending on whether the data is total or

254 average GFP signal, respectively. Object ID is 0 if the object is a nucleus. Feature value

255 is the fluorescence intensity expressed in a.u. (arbitrary units). bioRxiv preprint doi: https://doi.org/10.1101/2020.04.26.062976; this version posted April 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 14

256 Figure 5. A skeleton of a BDML version 3.0 file for describing meta-information. This

257 version allows the use of an external file for describing the data itself, designation of

258 multiple contact persons, and the use of ORCID, a unique persistent digital identifier of

259 the research scientist.

260 Figure 6. Comparison between BD5 and BDML data formats. a) Access times of the

261 BDML and BD5 files. Access time was measured as the time for accessing and

262 displaying xyz-coordinate data at a randomly selected time point stored in the BDML

263 and BD5 files. The time was measured on an Intel Xeon CPU 2.8 GHz processor with

264 32 GB of main memory. Each dot represents a biological quantitative data set. We used

265 334 biological quantitative data sets, each of which has coordinate data and is stored in

266 SSBD:database as a single BDML file. BD5 files were generated from the BDML files

267 by using the BDML2BD5 program. b) Size of the BDML and BD5 files. Each dot

268 represents a biological quantitative data set. In this comparison, we used 450 biological

269 quantitative data sets stored in SSBD:database as BDML files. As above, the BD5 files

270 were generated from the BDML files by using the BDML2BD5 program. The dashed

271 line represents the linear regression line for all dots. The data within the small rectangle

272 near the origin in the large graph is plotted on expanded axes in the insert.

273 Figure 7. Relationship between access time and file size for BDML and BD5 files. The

274 time for accessing and displaying coordinate data at a randomly selected time point is

275 plotted against file size. Each cross represents a BDML file; each dot represents a BD5

276 file. In the comparison, we used BDML and BD5 files of the 334 biological quantitative

277 data sets described in Fig. 6a. data scaleUnit scaleUnit dimension xScale yScale zScale sUnit tScale tUnit 3D+T 0.09 0.09 1.0 micrometer 2.0 minute objectDef

featureDef objectDef oID name 0 object 0 0 nucleus

feature 0 featureDef

1 … fID name fUnit … time-ordered 0 average GFP signal a.u. 2 sequence 1 center of mass GFP signal a.u. …

trackInfo

Figure 1: Kyoda et al. 001003 B C objectDef 000003 oID name 001002 D 000002 0 nucleus 001004 A A t = 0 t = 1

data ID t entity x y z radius label 000002 0 sphere 380 366 16.1 3.6 A 0 object 0 000003 0 sphere 387 153 16.6 3.87 B

ID t entity x y z radius label 001002 1 sphere 380 366 16.1 3.6 A 1 object 0 001003 1 sphere 387 153 19.6 3.87 C 001004 1 sphere 386 151 13.1 3.87 D

trackInfo 2 … from to … 000002 001002 000003 001003 trackInfo 000003 001004

Figure 2: Kyoda et al. a (2, 1, 0) (8, 4, 1) ID t entity sID x y z label 000001 0 line 0 2 1 0 E 000001 0 line 0 2 10 0 E E 000001 0 line 0 9 13 0 E 000001 0 line 0 8 4 0 E (2, 10, 0) 000001 0 line 0 2 1 0 E (9, 13, 0)

b ID t entity sID x y z label (1, 1, 1) 000100 0 face 0 1 1 1 F (4, 5, 4) 000100 0 face 0 2 3 0 F 000100 0 face 0 3 1 2 F F 000100 0 face 1 2 3 0 F 000100 0 face 1 4 5 4 F 000100 0 face 1 3 1 2 F (2, 3, 0) (3, 1, 2) … … … … … … … …

Figure 3: Kyoda et al. featureDef

fID name fUnit GFP signal 0 average GFP signal a.u. A 1 center of mass GFP signal a.u. t = 0 data object 0 featureDef ID t entity x y z radius label 000002 0 sphere 380 366 16.1 3.6 A 0 object 0

feature 0 feature 0

ID fID value 1 … 000002 0 153 … 000002 1 214

Figure 4: Kyoda et al. ...

... Shuichi Onami 0000-0002-8255-1724 ... ... wt_N2_030116_02_bd5.h5 Figure 5: Kyoda et al. a 120 b 3E+10

100

80 2E+10

60

40 1E+10 BD5 file size (bytes) size file BD5 Access time for BD5 file (s) file for time Access BD5 20

0 0 0 20 40 60 80 100 120 0 1E+10 2E+10 3E+10 Access time for BDML file (s) BDML file size (bytes)

Figure 6: Kyoda et al. 120

BDML files 100 BD5 files

80

60

Access time (s) time Access 40

20

0 0.E+00 5.E+07 1.E+08 2.E+08 2.E+08 3.E+08

File size (bytes)

Figure 7: Kyoda et al.