1 SeisIO: a fast, efficient geophysical data architecture for

2 the Julia language

1∗ 2 2 2 3 Joshua P. Jones , Kurama Okubo , Tim Clements , and Marine A. Denolle

1 4 4509 NE Sumner St., Portland, OR, USA

2 5 Department of Earth and Planetary Sciences, Harvard University, MA, USA

∗ 6 Corresponding author: Joshua P. Jones ([email protected])

1 7 Abstract

8 SeisIO for the Julia language is a new geophysical data framework that combines the intuitive

9 syntax of a high-level language with performance comparable to FORTRAN or . Benchmark

10 comparisons with recent versions of popular programs for seismic data download and analysis

11 demonstrate significant improvements in file read speed and orders-of-magnitude improvements

12 in memory overhead. Because the Julia language natively supports parallel computing with an

13 intuitive syntax, we benchmark test parallel download and processing of multi-week segments of

14 contiguous data from two sets of 10 broadband seismic stations, and find that SeisIO outperforms

15 two popular Python-based tools for data downloads. The current capabilities of SeisIO include file

16 read support for several geophysical data formats, online data access using FDSN web services,

17 IRIS web services, and SeisComP SeedLink, with optimized versions of several common data

18 processing operations. Tutorial notebooks and extensive documentation are available to improve

19 the user experience (UX). As an accessible example of performant scientific computing for the

20 next generation of researchers, SeisIO offers ease of use and rapid learning without sacrificing

21 computational performance.

2 22 1 Introduction

23 The dramatic growth in the volume of collected geophysical data has the potential to lead to

24 tremendous advances in the science (https://ds.iris.edu/data/distribution/). Leveraging the data rev-

25 olution to gain knowledge that is useful for earthquake science, hydrology, industry, and climate

26 science requires new tools to help Earth scientists extract meaningful information from arbitrarily

27 large data sets. High-performance computing is necessary to manage the scale of these prob-

28 lems; however, this requires specialized training at the undergraduate and graduate levels, which is

29 rarely taught in undergraduate-level science curricula. On the other hand, open-source computing

30 languages (Python) and codes (e.g., ObsPy; Beyreuther et al (2010)) have standardized seismic

31 data processing and improved access to seismic data analysis for a new generation of seismolo-

32 gists. However, these tools suffer from slow computation time and inefficient memory allocation

33 at scale. Therefore, the geophysics community is in need of a computational framework that is

34 simultaneously easy to learn and efficient.

35 The Julia language combines the syntactic ease of high-level languages like MATLAB and Python

36 with the performance of FORTRAN and C. Developed for fast, efficient numerical computing,

37 Julia version 1.0.0 was released August 2018, while the first beta version appeared February

38 2012 (Bezanson et al., 2017, 2018). The language is known for impressive speed and compu-

39 tational efficiency: while still in beta testing, Julia became the fourth programming language

40 to achieve a petaflop, after FORTRAN, C, and C++ (Reiger et al., 2018; Perkel, 2019). De-

41 spite its relative youth, Julia supports a growing collection of open-source modules for numer-

42 ical and scientific computing. Julia wrappers to C, FORTRAN, , and Python allow seamless

43 execution of external code, and third-party packages (https://github.com/JuliaInterop) extend in-

44 teroperability to C++, Java, Mathematica, and MATLAB, including the ability to read .mat files

45 (https://github.com/JuliaIO/MAT.jl).

3 46 2 SeisIO

47 The SeisIO package was created in May 2016 with the goal of rapid, efficient analysis of univariate

48 geophysical data in the Julia language, using comprehensible, uniform syntax, and simple but

49 powerful commands. Its design allows users to read univariate data from arbitrary instruments

50 (e.g., seismic, geodetic, gas flux) into a single structure, including gapped and irregularly-sampled

51 data. In the subsections below, we describe the capabilities of SeisIO, conduct benchmark tests,

52 and introduce tutorials.

53 2.1 Capabilities

54 SeisIO includes well-tested read support for many geophysical time-series formats (Table 1). Read-

55 ers for all formats but ASDF strictly use the Julia language; ASDF uses wrappers to libhdf5, written

56 in C. Current data processing operations include filling time gaps, removing the mean and linear

57 trend, band-pass filtering, instrument response translation and removal (i.e., flattening to DC),

58 resampling, cosine tapering, merging, seismogram differentiation/integration, and time synchro-

59 nization. Tools for online acquisition support FDSN services (station, event, and dataselect), IRIS

60 time-series requests, FDSN SeedLink, and the IRIS TauP interface (Crofwell et al., 1999).

61 SeisIO has been officially listed in the Julia package ecosystem since early 2019. Automated

62 testing with Travis-CI (https://travis-ci.org/) and AppVeyor (https://www.appveyor.com/) supports

63 Linux, Mac OS, and Windows installations. Code coverage estimates of 97-98% on Codecov

64 (https://codecov.io/) and Coveralls (https://coveralls.io/) exceed the 95% coverage threshold typical

65 of enterprise-level commercial software releases, yet both Julia and SeisIO are free.

4 66 2.2 Installation

67 Typical installation of the Julia language, SeisIO, and all dependencies requires three total steps:

68 1. Download and install the Julia language from https://julialang.org/downloads/

69 • The Julia install directory will be denoted (juliaroot) hereafter.

70 • (juliaroot) is typically a pattern like /home/username/julia-v.v.v/ in

71 Linux, e.g., /home/josh/julia-1.1.0/.

72 2. Start the Julia command-line interface (CLI) with (juliaroot)/bin/julia

73 3. Type or copy: using Pkg; Pkg.add("SeisIO"); using SeisIO

74 Julia installs package dependencies automatically when Pkg.add is invoked. There is no need

75 for dedicated environments or session-specific user settings; however, FFT performance can some-

76 times be improved by starting Julia in parallel-ready mode with (juliaroot)/bin/julia

77 --procs auto. Total disk space required is typically under 4 GB: 300-400 MB for Julia; 4.2

78 MB for SeisIO v0.4.1; 300 MB for optional test and benchmark data; and 1-3 GB for a typical

79 set of scientific computing packages. The last space requirement is much lower for non-Windows

80 users who manually link existing libraries and software (e.g., BLAS, Conda, FFTW) to Julia, but

81 this is only recommended for experienced Linux users.

82 2.3 SeisIO Data Structure

83 SeisIO is designed around easy, fluid, and fast data access. For example, a complete sequence

84 of commands to download and process channel data can be executed in one function call with

85 keywords:

5 86 87 julia> S = get_data("FDSN", "UW.LON..BH?", src="IRIS", s="2019-01-01", t=3600, detrend=true, rr= 88 true, w=true) 89 90 SeisData with 3 channels (2 shown) 91 ID: UW.LON..BHE UW.LON..BHN ... 92 NAME: Longmire CREST broad-band Longmire CREST broad-band ... 93 LOC: 46.7506 N, -121.81 E, 853.0 m 46.7506 N, -121.81 E, 853.0 m ... 94 FS: 40.0 40.0 ... 95 GAIN: 7.51485e8 7.51485e8 ... 96 RESP: a0 1.0, f0 1.0, 1z, 1p a0 1.0, f0 1.0, 1z, 1p ... 97 UNITS: m/s m/s ... 98 SRC: http://service.iris.edu/fdsnws/da http://service.iris.edu/fdsnws/da ... 99 MISC: 4 entries 4 entries ... 100 NOTES: 2 entries 2 entries ... 101 T: 2019-01-01T00:00:00.010 (0 gaps) 2019-01-01T00:00:00.010 (0 gaps) ... 102 X: -1.511e+03 +4.669e+03 ... 103 -1.512e+03 +4.699e+03 ... 104 ...... 105 +1.540e+03 +7.483e+02 ... 106 (nx = 144000) (nx = 144000) ... 107 C: 0 open, 0 total 108

110109

111 This example downloads 3600 seconds of data beginning 2019-01-01 00:00:00 (UTC) using FDSN

112 dataselect with the IRIS DMC server. The keyword ”detrend” removes the linear trend after down-

113 load; ”rr” removes (flattens to DC) the instrument response and replaces the .resp field of each

114 channel with an all-pass filter. The keyword ”w” writes the download directly to disk before pro-

115 cessing. Access to data properties is straightforward and intentionally simple: for example, in all

116 timeseries-data structures, the field .x holds univariate data.

117 2.4 Tutorials

118 A SeisIO tutorial is available from the project GitHub site, with three short, interactive Jupyter

119 notebooks designed to take 5-10 minutes each. A few additional commands in the Julia CLI are

120 required to run interactive notebooks:

6 using Pkg Pkg.add(["Dates", "IJulia"])

121 using IJulia cd(dirname(pathof(SeisIO))*"/../tutorial/") jupyterlab(dir=pwd())

122 The three tutorials are:

123 Part_1-Basic.pynb: introduction to SeisIO

124 Part_2-Data_Acquisition.pynb: downloading data & reading files

125 Part_3-Processing.pynb: data processing

126 Researchers familiar with MATLAB/Octave or Python will find Julia syntax intuitive and may

127 need only the language’s official documentation to begin coding. However, many Julia-language

128 tutorials can be downloaded from https://julialang.org/learning/ .

129 3 Benchmarking

130 We conduct a series of benchmark tests on a 64-bit personal computer equipped with an Intel

131 DH67CL motherboard, i7-2600 (3.4 GHz) CPU, and 16 GB Kingston DDR3 RAM, running Julia

132 v1.1.0 on 64-bit Ubuntu Linux 18.04.3 (kernel 5.0.0-29). File read tests (Table 2) use SeisIO v0.4.1

133 and BenchmarkTools.jl with 100 samples per benchmark and one evaluation per sample. Because

134 Julia uses a JIT compiler, an initial compile run precedes each test. The results shown in Fig. 1

135 suggest that read time and memory use scale quasi-linearly with file size.

7 136 3.1 File Reads

137 We now compare SeisIO read speeds with those of two popular, well-established seismic data

138 packages: ObsPy for Python (Beyreuther et al, 2010; Megies et al., 2011) and SAC (Goldstein et

139 al., 2003; Goldstein and Snoke, 2005). Comparative memory usage is shown in Fig. 2 and median

140 read times for 100-trial test sets are shown in Fig. 3. For these tests, ObsPy v1.1.1 uses a dedicated

141 Python 3.7.3 environment created with Conda 4.7.12; benchmarks use timeit.py and memory-

142 profiler 0.55.0 with child processes included in memory estimates. ASDF files are benchmarked

143 with pyasdf v0.5.1. SAC v106.a is compiled from source on the test machine and benchmarked

144 with perf v5.0.21 and time -v; the median time and memory required to start and exit SAC without

145 executing commands are subtracted from the test values.

146 We compare programs for all tests in Table 2 with file readers. Comparisons with SAC are limited

147 because SAC only reads two of these formats. ObsPy has no reader for PASSCAL, SUDS, or UW,

148 and the ObsPy ASCII reader is incompatible with GeoCSV variants on time-series pair (tspair,

149 ASCII) data. The ObsPy WIN reader couldn’t read our test files, even though our data were

150 downloaded directly from Hinet and integrity-checked by comparing with output from wintosac

151 (http://wwweic.eri.u-tokyo.ac.jp/cgi-bin/show man en?wintosac). Thus, all possible comparisons

152 with our benchmarks are shown in Figs. 2 & 3.

153 SeisIO uses less memory and read files more quickly than both SAC and ObsPy; the former is

154 especially noteworthy due to SAC’s low-level coding. With the exception of ASDF read times,

155 which differ by < 4%, performance differences cannot be explained by random variations in sys-

156 tem background activity. Fig. 2 suggests that ObsPy has a considerable amount of static memory

157 overhead associated with each file read, which may explain some read time differences (e.g. Fig.

158 3). The closest read times to SeisIO are obtained with ASDF, for which pyasdf also uses wrappers

159 to libhdf5. The larger of the two mini-SEED benchmarks is also roughly comparable; notably, be-

8 160 cause the ObsPy mini-SEED reader is a wrapper to libmseed for C (Trabant, 2019), both the ObsPy

161 and SAC comparisons strongly support the claim that well-optimized Julia code can outperform

162 well-optimized C, even with Julia’s high-level syntax, undaunting UX, and JIT compiler.

163 3.2 Download Throughput

164 With the data requirements of modern analysis techniques, download throughput is an increas-

165 ingly important consideration when choosing data acquisition software. We benchmark down-

166 load througput using SeisIO and two popular Python tools: ObsPy and ROVER v1.0.4 (devel-

167 oped by IRIS-DMC and available at https://iris-edu.github.io/rover/). ROVER has built-in op-

168 tions for multi-worker SQL requests. We use mpi4py with the NoisePy noise-correlation toolbox

169 (https://github.com/mdenolle/NoisePy, Jiang et al., in prep) to parallelize ObsPy downloads. For

170 SeisIO, we use the SeisDownload.jl module (https://github.com/kura-okubo/SeisDownload.jl, ver-

171 sion 1.2.0, last accessed 2019/10/02), developed to leverage Julia’s built-in parallelization function

172 pmap.

173 This benchmark test uses publicly-available data from three-component broadband seismograph

174 stations archived at the IRIS DMC and the Northern California Earthquake Data Center (NCEDC).

175 Each test uses 10 stations; download sizes are 7 GB for the TA network and 17 GB for the BP

176 network. For the IRIS-DMC test, we use 8 worker CPUs to match server-side connection limits

177 and the maximum workers available in NoisePy. The request comprises 16 days of continuous data

178 sampled at 40 Hz. For the NCEDC test, we requested 3-month segments of seismic data sampled

179 at 20 Hz from stations in the Berkeley Parkfield (BP) High Resolution Seismic Network using

180 SeisIO and Obspy. Tests were performed using a 32-core Intel(R) Xeon(R) Platinum 8268 CPU

181 @ 2.90 GHz with 64 GB RAM.

182 The computation time for the tests includes the data request from the remote server and conversion

9 183 to mseed format. The download efficiency is defined as the total amount of downloaded data / total

184 computational time [MB/s]. No preprocessing (e.g., detrending, tapering, filtering) is applied.

185 Figure 4 shows the download efficiency. The download efficiency of SeisIO can reach 3.3× that of

186 ObsPy, in agreement with standard microbenchmarks of the Julia language (Bezanson et al., 2017).

187 In the IRIS-DMC benchmark, the scaling of download speed with number of workers follows a

188 power law with an exponent of 1.06 for ROVER, 0.97 for ObsPy, and 0.92 for SeisIO with the TA

189 network (Figure 4a); in the NCEDC benchmark, the scaling exponents are 0.92 for ObsPy and 0.96

190 for SeisIO, respectively (Figure 4b). In larger downloads where the computational time required

191 for the allocation of workers is negligible compared to that of the data download itself, we report

192 that the scaling exponent converges to 1.0 . Therefore, the Julia language appears well-optimized

193 for parallel computation using only built-in functions (pmap).

194 3.3 Processing Example: Instrument Response Removal

195 The removal of an instrument response function is a general processing operation that converts

196 recorded counts or Volts to the approximate physical units of measure, such as ground velocity

197 (m/s), at frequencies from DC to the Nyquist frequency. This is a common preprocessing step

198 in seismic data analysis, e.g., when comparing and/or cross-correlating waveforms recorded by

199 different instruments (e.g. Bensen et al., 2007). We use the computational efficiency of response

200 removal as an example processing operation and perform comparative benchmark tests using Ob-

201 sPy and SeisIO.

202 The test data comprise a one-day digital seismogram from channel TA.121A.HHZ, network TA and

203 station name 121A, sampled at 100 Hz. Data are bandpass filtered before removing the instrument

204 response, with a 4-corner cosine taper in ObsPy and a Butterworth filter in SeisIO. To ensure

205 that the test measures a single processing step, the bandpass operation is not timed. We test on a

10 206 single-core computer with an Intel(R) Core i5 CPU @ 3.4 GHz with 8 GB RAM.

207 Figure 5a shows computation times for file read and response removal. We conducted 100 trials

208 of each process; mean values are shown, with standard deviations as error bars. The speedup of

209 SeisIO is 1.6x relative to ObsPy for reading data, consistent with the results of test MSEED-1 in

210 Figure 3; the speedup is 6.8x for instrument response removal. Figure 5b shows a graphical com-

211 parison of output waveforms, demonstrating the agreement between ObsPy and SeisIO. Although

212 the differences near the edges of each trace are large compared to the middle, the artifacts can be

213 adequately suppressed by cosine tapering before removing instrumental response (Figure 5b top).

214 In this test, the first and last 0.2% of samples in each window are tapered with both Obspy and

215 SeisIO. The small misfit in amplitude and/or phase arises from differences in filtering strategies.

216 4 Conclusions and Future Directions

217 The SeisIO data framework is the first of its kind: high-level, easy, performant software that in-

218 troduces the next generation of geophysics researchers to cutting-edge scientific computing in the

219 Julia language. We have shown that SeisIO’s speed and efficiency can outperform specialized

220 precompiled C-language software. The benefits are lower computing requirements and costs.

221 The intent of SeisIO is to provide an efficient framework for geophysical data while maintaining

222 comprehensible syntax. Core functionality will expand to additional data formats and acquisition

223 methods based on demand; APIs and guides are available on the project homepage for potential

224 contributors. Analysis programs based on SeisIO are in development, particularly for ambient-

225 noise seismology (Bryan et al., 2019; Clements and Denolle, 2019). A SeisIO variant for GPU

226 computing is in development and support for multiparametric volcano monitoring data is planned.

227 As SeisIO is refined, and its scope expands to include GPU, cloud, and heterogeneous computing,

11 228 we expect support to increase among seismologists and other geophysics researchers, many of

229 whom find themselves spending valuable research time teaching new students to compile arcane

230 (and sometimes, antique) programs.

231 Acknowledgments

232 The authors thank Andy Nowacki (University of Leeds, UK) for discussions on the Julia lan-

233 guage; Douglas Neuhauser (University of California Seismological Laboratory, Berkeley, CA,

234 USA) and David Shelly (US Geological Survey, Golden, CO, USA) for discussions on SAC and

235 other data formats, which helped motivate the creation of SeisIO. J. Jones is thankful to Chad

236 Trabant and Robert Casey (Incorporated Research Institutions for Seismology, Seattle, WA, USA)

237 for assistance with IRIS web protocols. M. Denolle and J. Jones thank Ellen Yu and Aparna

238 Bhaskaran (California Institute of Technology, Pasadena, California, USA) for assistance with

239 SCSN FDSN and correspondence. J. Jones extends additional thanks to Wendy McCausland

240 (USGS-VDAP, USA) and Ken Creager (University of Washington, USA) for contributing test data,

241 and R. Carniel (Universita di Udine, Italy) for extensive early testing. mini-SEED handling was

242 originally based on rdmseed.m for MATLAB by Francois Beauducel (Institut de Physique du

243 Globe de Paris, France); SAC routines were originally based on SacIO for Julia by Ben Postleth-

244 waite (https://github.com/bpostlethwaite/SacIO). This research was supported by a grant from the

245 Packard Foundation.

246 Author Contributions

247 J. Jones created SeisIO, is the sole developer of the core package, and happily rules with an iron fist

248 over its development and maintenance. T. Clements created the SeisIO notebook tutorial, devel-

12 249 oped a number of packages based on SeisIO, and created the prototypes of several data processing

250 routines. K. Okubo wrote and conducted the benchmarks of download efficiency and instrumen-

251 tal response removal, and has developed a parallel downloader prototype, SeisDownload.jl, as an

252 example of the many SeisIO applications created by M. Denolle’s research group; its functionality

253 is currently being integrated into SeisIO core. M. Denolle contributed to application development,

254 research direction, and manuscript editing, and provides management and financial support for

255 ongoing development.

256 Data and Resources

257 Data used in benchmark tests (Table 2) can be found in the SeisIO GitHub repository, with redistri-

258 bution restrictions as noted below. Benchmarking scripts are available on the SeisIO GitHub page.

259 Data sources in Table 2 use the following key:

260 1. Contributed by Prof. K. Creager, University of Washington, Seattle, WA, USA 261 ([email protected]).

262 2. Retrieved with IRIS FDSN dataselect; to duplicate a data request, please contact the corre- 263 sponding author for exact parameters. Each binary data file has a single data channel; each 264 file name gives the time length and sampling frequency.

265 3. File is from the IRIS Mt. St. Helens 1980 special data set (IRIS virtual network 266 STHELENS-1980). Original data are available by request from Incorporated Research 267 Institutions for Seismology, Seattle, WA, USA.

268 4. File data are from the vertical-component channel of station EA3 in Jones et al. (2006). The 269 original recording format was the SLIST variant of Lennartz MarsLite portable stations; the 270 first line of text was manually edited to match SLIST syntax for this test.

271 5. Redistribution restricted; to request this file please contact Dr. W. McCausland, USGS- 272 VDAP, Vancouver, WA, USA ([email protected]). Data file comprises five minutes 273 of 100 Hz data on 22 channels beginning 2008-10-08T17:01:06.06 (UTC -6).

274 6. Available upon request from the corresponding author. Event data extracted from Pacific 275 Northwest Seismic Network archives; data are fully described in Jones and Malone (2005).

13 276 7. Data from HiNet (NIED, 2019); redistribution prohibited. Request comprises one hour of 277 100 Hz data beginning 2014-09-27T09:00:00 (UTC+9) from 8 total channels (seismometer 278 + infrasound at stations V.ONTA and V.ONTN). Benchmark uses the NIED channel file.

279 A standalone repository to reproduce the benchmark tests for download efficiency presented in

280 section 3.2 is available on GitHub. The required software, computational environment, data sets,

281 and commands to execute the benchmark tests are documented in the repository.

282 The NoisePy module for ObsPy is part of a separate manuscript, currently in preparation. The

283 repository is private until publication, but code is available upon request from its creator (Dr. C.

284 Jiang., Harvard University, MA, USA, chengxin [email protected]).

285 Addendum

286 The SeisIO package presented in this work is the only official Julia package by this name. We

287 recently learned of another, newer package that borrows the name SeisIO, consisting of reflection

288 seismology software for SEGY data, whose code has migrated to another project. This other

289 SeisIO is not part of the Julia registry and is completely unrelated to this work, but can be found

290 on GitHub and via. Google search, and packages that depend on it exist in the Julia registry. To

291 minimize potential confusion, please follow the installation instructions in this manuscript or on

292 our Github page.

293 References

294 Ahern, T., Casey, R., Barnes, ., Benson, R., & Knight, T. (2007). Seed standard for the exchange of earthquake data

295 reference manual format version 2.4. Incorporated Research Institutions for Seismology (IRIS), Seattle.

296 Bensen, G. D., Ritzwoller, M. H., Barmin, M. P., Levshin, A. L., Lin, F., Moschetti, M. P., Shapiro, N. M. and Yang, Y.

14 297 (2007) Processing seismic ambient noise data to obtain reliable broad-band surface wave dispersion measurements,

298 Geophysical Journal International, 169(3), 1239-1260.

299 M. Beyreuther, R. Barsch, L. Krischer, T. Megies, Y. Behr and J. Wassermann (2010), ObsPy: A Python Toolbox for

300 Seismology, SRL, 81(3), 530-533. DOI: 10.1785/gssrl.81.3.530

301 Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. B. (2017). Julia: A fresh approach to numerical computing.

302 SIAM review, 59(1), 65-98.

303 Bezanson, J., Chen, J., Chung, B., Karpinski, S., Shah, V. B., Vitek, J., & Zoubritzky, L. (2018). Julia: dynamism and

304 performance reconciled by design. Proceedings of the ACM on Programming Languages, 2(OOPSLA), 120.

305 Bryan, J. T., Okubo, K., Yuan, C., & Denolle, M. (2019) Improving the resolution of co-seismic velocity change

306 monitoring at active fault zones using the ambient seismic field, Poster Presentation at 2019 SCEC Annual Meeting.

307 Clements, T. & Denolle, M. (2019, 08) Cactus to Clouds: Processing the SCEDC Open Data Set on AWS, Poster

308 Presentation at 2019 SCEC Annual Meeting.

309 Crotwell, H. P., T. J. Owens, and J. Ritsema (1999). The TauP Toolkit: Flexible seismic travel-time and ray-path

310 utilities, Seismological Research Letters 70, 154-160.

311 Goldstein, P., A. Snoke, (2005), ”SAC Availability for the IRIS Community”, Incorporated Institutions for Seismology

312 Data Management Center Electronic Newsletter.

313 Goldstein, P., D. Dodge, M. Firpo, Lee Minner (2003) SAC2000: Signal processing and analysis tools for seismolo-

314 gists and engineers, Invited contribution to ”The IASPEI International Handbook of Earthquake and Engineering

315 Seismology”, Edited by WHK Lee, H. Kanamori, P.C. Jennings, and C. Kisslinger, Academic Press, London.

316 Hagelund, Rune; Stewart A. Levin, eds. (2017). SEG-Y r2.0: SEG-Y revision 2.0 Data Exchange format (PDF). Tulsa,

317 OK: Society of Exploration Geophysicists.

318 Jones, J.P., Carniel, R., Harris, A.J., & Malone, S.D. (2006). Seismic characteristics of variable convection at Erta ’Ale

319 lava lake, Ethiopia. J. Volcanol. Geotherm. Res., 153(1), 64–79.

320 Jones, J.P., & Malone, S. D. (2005). Mount Hood earthquake activity: Volcanic or tectonic origins?. Bulletin of the

321 Seismological Society of America, 95(3), 818-832.

322 Lion Krischer, James Smith, Wenjie Lei, Matthieu Lefebvre, Youyi Ruan, Elliott Sales de Andrade, Norbert Pod-

323 horszki, Ebru Bozdag,¨ Jeroen Tromp, An Adaptable Seismic Data Format, Geophysical Journal International, Vol-

324 ume 207, Issue 2, November, 2016, Pages 1003?1011, https://doi.org/10.1093/gji/ggw319.

15 325 T. Megies, M. Beyreuther, R. Barsch, L. Krischer, J. Wassermann (2011), ObsPy ? What can it do for data centers and

326 observatories?, Annals Of Geophysics, 54(1), 47-58, DOI: 10.4401/ag-4838.

327 National Research Institute for Earth Science and Disaster Resilience (2019), NIED Hi-net, National Research Institute

328 for Earth Science and Disaster Resilience, doi:10.17598/NIED.0003.

329 Perkel, Jeffrey M. (2019). Julia: come for the syntax, stay for the speed, Nature 572, 141-142, doi: 10.1038/d41586-

330 019-02310-3.

331 Regier, J., Fischer, K., Pamnany, K., Noack, A., Revels, J., Lam, M., Howard, S., Giordano, R., Schlegel, D.,

332 McAuliffe, J. and Thomas, R., 2019. Cataloging the visible universe through Bayesian inference in Julia at petas-

333 cale. Journal of Parallel and Distributed Computing, 127, pp.89-104.

334 Schorlemmer, D., Euchner, F., Kstli, P., & Saul, J. (2011). QuakeML: status of the XML-based seismological data

335 exchange format. Annals of Geophysics, 54(1), 59-65.

336 Trabant, C. (2019), libmseed - The miniSEED library. https://github.com/iris-edu/libmseed, last accessed 2019-09-24.

337 Ward, Peter L. (1989). SUDS; seismic unified data system, USGS Open-File Report 89-188, doi:10.3133/ofr89188.

16 Table 1: Data format support in SeisIO v0.4.1. Columns: ”RW” is read/write support (”r” = read, ”w” = write); column ”Cov” is the lesser of % code coverage on CodeCov.io and Coveralls.io. Notes use the key below. 1. coverage reflects only supported blockette/packet types 2. support for Provenance not yet implemented (NYI) 3. supports IEEE-Float and integer data in SEGY rev 0 and rev 1 formats

Format Name SeisIO Name rw Cov Notes Reference SEED Ahern et al. (2007) Dataless SEED dataless r 96 1 mini-SEED mseed r 96 1 SEED resp resp r 96

SAC e.g. Goldstein et al. (2003) SAC data file sac rw 97 SAC pole-zero file sacpz rw 97

OTHER Ad Hoc (v1, v2) ah1, ah2 r 96 Advanced Seismic Data Format asdf rw 100 2 Krischer et al. (2016) GeoCSV sample list geocsv.slist r 98 GeoCSV time-sample pair geocsv r 98 QuakeML qml r 100 e.g. Schorlemmer et al. (2012) SEG Y (rev 0, rev 1) segy r 93 3 Hagelund et al. (2017) PASSCAL (SEG Y variant) passcal r 96 Sample List ASCII slist r 100 (SeisIO low-level format) seisio rw 100 this work FDSN Station XML sxml rw 100 Seismic Unified Data System suds r 94 1 Ward (1989) UNAVCO Bottle bottle r 100 University of Washington uw r 98 WIN (32-bit, v1) win32 r 96 NIED (2019)

17 Table 2: Benchmark tests. Columns: Test Name is how the test is referenced in this manuscript; Filename is the name or search pattern in SeisIO/test/SampleFiles/; Format corresponds to column 2 of Table 1; SzF is file size on disk; SzO is object size in memory; Mem is peak memory usage; %Ovh ≡ 100 × (Mem/SzO − 1.0)%; T˜ is median read time in milliseconds for 100 trials. All memory and file size values are in MB. In column Notes, numeric values are data sources (see Data and Resources); lowercase letters denote special benchmark parameters:

a test uses read asdf b test uses read data with keywords nx new=36000, nx add=1400000 c test uses read data with keyword full=true

Test Name File Format SzF SzO Mem %Ov T˜ Notes AH 1day-1hz.ah ah1 0.33 0.33 0.33 1.11 0.49 1 ASDF 2days-40hz.h5 asdf 21.96 26.37 26.49 0.45 92.74 2,a GeoCSV-tspair geo-tspair.csv geocsv 3.31 0.39 0.44 12.30 204.01 2 MSEED-1 1day-100hz.mseed mseed 19.09 32.96 32.96 0.01 71.46 2 MSEED-2 SHW.UW.mseed mseed 1.79 5.35 6.19 15.75 9.33 3,b PASSCAL 1day-100hz.segy passcal 32.96 32.96 32.99 0.08 22.30 2,c SAC 1day-100hz.sac sac 32.96 32.96 32.97 0.04 13.02 2,c SLIST 1h-62.5hz.slist slist 2.44 0.86 0.87 1.85 30.09 4 SUDS 10081701.WVP suds 1.26 2.53 2.59 2.43 1.36 5 UW 99011116541W uw 23.15 37.66 40.29 6.98 26.71 6 WIN32 2014092709*.cnt win32 4.49 10.99 11.25 2.33 22.88 7

18 Figure 1: Benchmarks tests (Table 2) in Julia v1.1.0 with SeisIO v0.4.1. Left: file read times. Right: peak memory use in SeisIO and file size on disk.

19 Figure 2: Memory use and overhead for all benchmarks in Table 2 that were testable in at least two of ObsPy, SAC, and SeisIO. (top) Memory usage and file sizes on disk. (bottom) Memory overhead. The y-axis is logarithmic. A missing bar with text label NR” indicates no reader.

20 Figure 3: Read times in milliseconds for all benchmarks in Table 2 that were testable in at least two of ObsPy, SAC, and SeisIO. A missing bar with text label NR indicates no reader. Most read times fall in the range 10-100 ms. SeisIO AH and SUDS benchmarks are labeled with their respective values because the bars themselves are difficult to see. ObsPy SLIST benchmark is labeled with its value because the full bar vastly exceeds the upper bound of the y-axis.

21 Figure 4: Download efficiency as a function of number of workers from (a) the IRIS-DMC server and (b) the Northern California Earthquake Data Center (NCEDC). The markers indicate individual speed tests. The dashed lines indicate the best-fit line (with logarithmic y-axis scaling) associated with each tool; the slope of each line is a proxy measure of the scaling performance.

Figure 5: Benchmark tests of instrument response removal. (a) Time benchmarks of data read and instrument response removal. Solid bar heights correspond to the mean times of each benchmark; 1σ error bars are shown as thin black lines. (b) Waveforms with their respective instrument responses removed are shown to demonstrate that the methods produce nearly identical output. For ease of visualization, lines are plotted every 20 points.

22