Seisio: a Fast, Efficient Geophysical Data Architecture for the Julia
Total Page:16
File Type:pdf, Size:1020Kb
1 SeisIO: a fast, efficient geophysical data architecture for 2 the Julia language 1∗ 2 2 2 3 Joshua P. Jones , Kurama Okubo , Tim Clements , and Marine A. Denolle 1 4 4509 NE Sumner St., Portland, OR, USA 2 5 Department of Earth and Planetary Sciences, Harvard University, MA, USA ∗ 6 Corresponding author: Joshua P. Jones ([email protected]) 1 7 Abstract 8 SeisIO for the Julia language is a new geophysical data framework that combines the intuitive 9 syntax of a high-level language with performance comparable to FORTRAN or C. Benchmark 10 comparisons with recent versions of popular programs for seismic data download and analysis 11 demonstrate significant improvements in file read speed and orders-of-magnitude improvements 12 in memory overhead. Because the Julia language natively supports parallel computing with an 13 intuitive syntax, we benchmark test parallel download and processing of multi-week segments of 14 contiguous data from two sets of 10 broadband seismic stations, and find that SeisIO outperforms 15 two popular Python-based tools for data downloads. The current capabilities of SeisIO include file 16 read support for several geophysical data formats, online data access using FDSN web services, 17 IRIS web services, and SeisComP SeedLink, with optimized versions of several common data 18 processing operations. Tutorial notebooks and extensive documentation are available to improve 19 the user experience (UX). As an accessible example of performant scientific computing for the 20 next generation of researchers, SeisIO offers ease of use and rapid learning without sacrificing 21 computational performance. 2 22 1 Introduction 23 The dramatic growth in the volume of collected geophysical data has the potential to lead to 24 tremendous advances in the science (https://ds.iris.edu/data/distribution/). Leveraging the data rev- 25 olution to gain knowledge that is useful for earthquake science, hydrology, industry, and climate 26 science requires new tools to help Earth scientists extract meaningful information from arbitrarily 27 large data sets. High-performance computing is necessary to manage the scale of these prob- 28 lems; however, this requires specialized training at the undergraduate and graduate levels, which is 29 rarely taught in undergraduate-level science curricula. On the other hand, open-source computing 30 languages (Python) and codes (e.g., ObsPy; Beyreuther et al (2010)) have standardized seismic 31 data processing and improved access to seismic data analysis for a new generation of seismolo- 32 gists. However, these tools suffer from slow computation time and inefficient memory allocation 33 at scale. Therefore, the geophysics community is in need of a computational framework that is 34 simultaneously easy to learn and efficient. 35 The Julia language combines the syntactic ease of high-level languages like MATLAB and Python 36 with the performance of FORTRAN and C. Developed for fast, efficient numerical computing, 37 Julia version 1.0.0 was released August 2018, while the first beta version appeared February 38 2012 (Bezanson et al., 2017, 2018). The language is known for impressive speed and compu- 39 tational efficiency: while still in beta testing, Julia became the fourth programming language 40 to achieve a petaflop, after FORTRAN, C, and C++ (Reiger et al., 2018; Perkel, 2019). De- 41 spite its relative youth, Julia supports a growing collection of open-source modules for numer- 42 ical and scientific computing. Julia wrappers to C, FORTRAN, R, and Python allow seamless 43 execution of external code, and third-party packages (https://github.com/JuliaInterop) extend in- 44 teroperability to C++, Java, Mathematica, and MATLAB, including the ability to read .mat files 45 (https://github.com/JuliaIO/MAT.jl). 3 46 2 SeisIO 47 The SeisIO package was created in May 2016 with the goal of rapid, efficient analysis of univariate 48 geophysical data in the Julia language, using comprehensible, uniform syntax, and simple but 49 powerful commands. Its design allows users to read univariate data from arbitrary instruments 50 (e.g., seismic, geodetic, gas flux) into a single structure, including gapped and irregularly-sampled 51 data. In the subsections below, we describe the capabilities of SeisIO, conduct benchmark tests, 52 and introduce tutorials. 53 2.1 Capabilities 54 SeisIO includes well-tested read support for many geophysical time-series formats (Table 1). Read- 55 ers for all formats but ASDF strictly use the Julia language; ASDF uses wrappers to libhdf5, written 56 in C. Current data processing operations include filling time gaps, removing the mean and linear 57 trend, band-pass filtering, instrument response translation and removal (i.e., flattening to DC), 58 resampling, cosine tapering, merging, seismogram differentiation/integration, and time synchro- 59 nization. Tools for online acquisition support FDSN services (station, event, and dataselect), IRIS 60 time-series requests, FDSN SeedLink, and the IRIS TauP interface (Crofwell et al., 1999). 61 SeisIO has been officially listed in the Julia package ecosystem since early 2019. Automated 62 testing with Travis-CI (https://travis-ci.org/) and AppVeyor (https://www.appveyor.com/) supports 63 Linux, Mac OS, and Windows installations. Code coverage estimates of 97-98% on Codecov 64 (https://codecov.io/) and Coveralls (https://coveralls.io/) exceed the 95% coverage threshold typical 65 of enterprise-level commercial software releases, yet both Julia and SeisIO are free. 4 66 2.2 Installation 67 Typical installation of the Julia language, SeisIO, and all dependencies requires three total steps: 68 1. Download and install the Julia language from https://julialang.org/downloads/ 69 • The Julia install directory will be denoted (juliaroot) hereafter. 70 • (juliaroot) is typically a pattern like /home/username/julia-v.v.v/ in 71 Linux, e.g., /home/josh/julia-1.1.0/. 72 2. Start the Julia command-line interface (CLI) with (juliaroot)/bin/julia 73 3. Type or copy: using Pkg; Pkg.add("SeisIO"); using SeisIO 74 Julia installs package dependencies automatically when Pkg.add is invoked. There is no need 75 for dedicated environments or session-specific user settings; however, FFT performance can some- 76 times be improved by starting Julia in parallel-ready mode with (juliaroot)/bin/julia 77 --procs auto. Total disk space required is typically under 4 GB: 300-400 MB for Julia; 4.2 78 MB for SeisIO v0.4.1; 300 MB for optional test and benchmark data; and 1-3 GB for a typical 79 set of scientific computing packages. The last space requirement is much lower for non-Windows 80 users who manually link existing libraries and software (e.g., BLAS, Conda, FFTW) to Julia, but 81 this is only recommended for experienced Linux users. 82 2.3 SeisIO Data Structure 83 SeisIO is designed around easy, fluid, and fast data access. For example, a complete sequence 84 of commands to download and process channel data can be executed in one function call with 85 keywords: 5 86 87 julia> S = get_data("FDSN", "UW.LON..BH?", src="IRIS", s="2019-01-01", t=3600, detrend=true, rr= 88 true, w=true) 89 90 SeisData with 3 channels (2 shown) 91 ID: UW.LON..BHE UW.LON..BHN ... 92 NAME: Longmire CREST broad-band Longmire CREST broad-band ... 93 LOC: 46.7506 N, -121.81 E, 853.0 m 46.7506 N, -121.81 E, 853.0 m ... 94 FS: 40.0 40.0 ... 95 GAIN: 7.51485e8 7.51485e8 ... 96 RESP: a0 1.0, f0 1.0, 1z, 1p a0 1.0, f0 1.0, 1z, 1p ... 97 UNITS: m/s m/s ... 98 SRC: http://service.iris.edu/fdsnws/da http://service.iris.edu/fdsnws/da ... 99 MISC: 4 entries 4 entries ... 100 NOTES: 2 entries 2 entries ... 101 T: 2019-01-01T00:00:00.010 (0 gaps) 2019-01-01T00:00:00.010 (0 gaps) ... 102 X: -1.511e+03 +4.669e+03 ... 103 -1.512e+03 +4.699e+03 ... 104 ... ... ... 105 +1.540e+03 +7.483e+02 ... 106 (nx = 144000) (nx = 144000) ... 107 C: 0 open, 0 total 108 110109 111 This example downloads 3600 seconds of data beginning 2019-01-01 00:00:00 (UTC) using FDSN 112 dataselect with the IRIS DMC server. The keyword ”detrend” removes the linear trend after down- 113 load; ”rr” removes (flattens to DC) the instrument response and replaces the .resp field of each 114 channel with an all-pass filter. The keyword ”w” writes the download directly to disk before pro- 115 cessing. Access to data properties is straightforward and intentionally simple: for example, in all 116 timeseries-data structures, the field .x holds univariate data. 117 2.4 Tutorials 118 A SeisIO tutorial is available from the project GitHub site, with three short, interactive Jupyter 119 notebooks designed to take 5-10 minutes each. A few additional commands in the Julia CLI are 120 required to run interactive notebooks: 6 using Pkg Pkg.add(["Dates", "IJulia"]) 121 using IJulia cd(dirname(pathof(SeisIO))*"/../tutorial/") jupyterlab(dir=pwd()) 122 The three tutorials are: 123 Part_1-Basic.pynb: introduction to SeisIO 124 Part_2-Data_Acquisition.pynb: downloading data & reading files 125 Part_3-Processing.pynb: data processing 126 Researchers familiar with MATLAB/Octave or Python will find Julia syntax intuitive and may 127 need only the language’s official documentation to begin coding. However, many Julia-language 128 tutorials can be downloaded from https://julialang.org/learning/ . 129 3 Benchmarking 130 We conduct a series of benchmark tests on a 64-bit personal computer equipped with an Intel 131 DH67CL motherboard, i7-2600 (3.4 GHz) CPU, and 16 GB Kingston DDR3 RAM, running Julia 132 v1.1.0 on 64-bit Ubuntu Linux 18.04.3 (kernel 5.0.0-29). File read tests (Table 2) use SeisIO v0.4.1 133 and BenchmarkTools.jl with 100 samples per benchmark and one evaluation per sample.