Fast Calculation of the Lomb–Scargle Periodogram Using Graphics Processing Units R
Total Page:16
File Type:pdf, Size:1020Kb
The Astrophysical Journal Supplement Series, 191:247–253, 2010 December doi:10.1088/0067-0049/191/2/247 C 2010. The American Astronomical Society. All rights reserved. Printed in the U.S.A. FAST CALCULATION OF THE LOMB–SCARGLE PERIODOGRAM USING GRAPHICS PROCESSING UNITS R. H. D. Townsend Department of Astronomy, University of Wisconsin-Madison, Sterling Hall, 475 N. Charter Street, Madison, WI 53706, USA; [email protected] Received 2010 June 29; accepted 2010 October 19; published 2010 November 23 ABSTRACT I introduce a new code for fast calculation of the Lomb–Scargle periodogram that leverages the computing power of graphics processing units (GPUs). After establishing a background to the newly emergent field of GPU computing, I discuss the code design and narrate key parts of its source. Benchmarking calculations indicate no significant differences in accuracy compared to an equivalent CPU-based code. However, the differences in performance are pronounced; running on a low-end GPU, the code can match eight CPU cores, and on a high-end GPU it is faster by a factor approaching 30. Applications of the code include analysis of long photometric time series obtained by ongoing satellite missions and upcoming ground-based monitoring facilities, and Monte Carlo simulation of periodogram statistical properties. Key words: methods: data analysis – methods: numerical – techniques: photometric – stars: oscillations Online-only material: Supplemental data file (tar.gz) 1. INTRODUCTION and performance of the code are presented in Section 5.The findings and future outlook are then discussed in Section 6. Astronomical time-series observations are often character- ized by uneven temporal sampling (e.g., due to transformation to the heliocentric frame) and/or non-uniform coverage (e.g., 2. BACKGROUND TO GPU COMPUTING from day/night cycles or radiation belt passages). This compli- 2.1. Pre-2006: Initial Forays cates the search for periodic signals, as a fast Fourier transform (FFT) algorithm cannot be employed. A variety of alternatives The past decade has seen remarkable increases in the abil- have been put forward, the most oft-used being the eponymous ity of computers to render complex three-dimensional scenes at Lomb–Scargle (L-S) periodogram developed by Lomb (1976) movie frame rates. These gains have been achieved by progres- and Scargle (1982). At the time of writing, NASA’s Astrophysics sively shifting the graphics pipeline—the algorithmic sequence Data System (ADS) lists 735 and 1810 publications (respec- of steps that converts a scene description into an image—from tively) that cite these two papers, highlighting how important the CPU to dedicated hardware within the GPU. To address the the L-S periodogram has proven for the analysis of time series. inflexibility that can accompany such hardware acceleration, Recent applications include the search for a link between so- GPU vendors introduced so-called programmable shaders, pro- lar rotation and nuclear decay rates (Sturrock et al. 2010); the cessing units that apply a simple sequence of transformations to study of pulsar timing noise (Lyne et al. 2010); the characteriza- input elements such as image pixels and mesh vertices. NVIDIA tion of quasi-periodic oscillations in blazars (Rani et al. 2010); Corporation were the first to implement programmable shader and the measurement of rotation periods in exoplanet host stars functionality, with their GeForce 3 series of GPUs (released (Simpson et al. 2010). 2001 March) offering one vertex shader and four (parallel) pixel Unfortunately, a drawback of the L-S periodogram is a shaders. The release in the following year of ATI Corporation’s O 2 computational cost scaling as (Nt ), where Nt is the number R300 series brought not only an increase in the number of of measurements in the time series; this contrasts with the shaders (up to four vertex and eight pixel), but also capabilities O far-more-efficient (Nt log2 Nt ) scaling of the FFT algorithm such as floating-point arithmetic and looping constructs that popularized by Cooley & Tukey (1965). One approach to laid the foundations for what ultimately would become GPU reducing this cost has been proposed by Press & Rybicki (1989), computing. based on constructing a uniformly sampled approximation to Shaders are programmed using a variety of specialized the observations via “extirpolation” and then evaluating its languages, such as the OpenGL Shading Language (GLSLE; FFT. The present paper introduces a different approach, not e.g., Rost 2006) and Microsoft’s High-Level Shading Language through algorithmic development but rather by leveraging the (HLSL). The designs of these languages are strongly tied computing power of graphics processing units (GPUs)—the to their graphics-related purpose, and thus early attempts at specialized hardware at the heart of the display subsystem in GPU computing using programmable shaders had to map each personal computers and workstations. Modern GPUs typically calculation into a sequence of equivalent graphical operations comprise a number of identical programmable processors, and (see, e.g., Owens et al. 2005, and references therein). In an effort in recent years there has been significant interest in applying to overcome this awkward aspect, Buck et al. (2004) developed these parallel-computing resources to problems across a breadth BrookGPU—a compiler and run-time implementation of the of scientific disciplines. In the following section, I give a brief Brook stream programming language for GPU platforms. With history of the newly emergent field of GPU computing; then, BrookGPU, the computational resources of shaders are accessed Section 3 reviews the formalism defining the L-S periodogram, through a stream processing paradigm: a well-defined series of and Section 4 presents a GPU-based code implementing this operations (the kernel) are applied to each element in a typically formalism. Benchmarking calculations to evaluate the accuracy large homogeneous sequence of data (the stream). 247 248 TOWNSEND Vol. 191 2.2. Post-2006: Modern Era where G is some function. In the classification scheme intro- duced by Barsdell et al. (2010), this follows the form of an in- GPU computing entered its modern era in 2006, with the teract algorithm. Generally speaking, such algorithms are well release of NVIDIA’s Compute Unified Device Architecture suited to GPU implementation, since they are able to achieve (CUDA)—a framework for defining and managing GPU compu- a high arithmetic intensity. However, a straightforward imple- tations without the need to map them into graphical operations. mentation of Equations (1) and (2) involves two complete runs CUDA-enabled devices (see Appendix A of NVIDIA 2010)are through the time series to calculate a single P (f ), which is distinguished by their general-purpose unified shaders, which n wasteful of memory bandwidth and requires Nf (4Nt + 1) costly replace the function-specific shaders (pixel, vertex, etc.) present trigonometric function evaluations for the full periodogram. in earlier GPUs. These shaders are programmed using an ex- Press et al. (1992) address this inefficiency by calculating the tension to the C language, which follows the same stream- trig functions from recursion relations, but this approach is dif- processing paradigm pioneered by BrookGPU. Since the launch ficult to map onto stream processing concepts, and moreover of CUDA, other vendors have been quick to develop their becomes inaccurate in the limit of large Nf . An alternative strat- own GPU computing offerings, most notably Advanced Micro egy, which avoids these difficulties while still offering improved Devices with their Stream framework and Microsoft with their performance, comes from refactoring the equations as DirectCompute interface. Abstracting away the graphical roots of GPUs has made them 1 (c XC + s XS )2 accessible to a very broad audience, and GPU-based computa- P (f ) = τ τ n 2 c2CC +2c s CS + s2SS tions are now being undertaken in fields as diverse as molecu- τ τ τ τ 2 lar biology, medical imaging, geophysics, fluid dynamics, eco- (cτ XS − sτ XC ) nomics, and cryptography (see Pharr 2005; Nguyen 2007). + , (5) c2SS − 2c s CS + s2CC Within astronomy and astrophysics, recent applications include τ τ τ τ N-body simulations (Belleman et al. 2008), real-time radio cor- relation (Wayth et al. 2009), gravitational lensing (Thompson and 2 CS et al. 2010), adaptive-mesh hydrodynamics (Schive et al. 2010), tan 2ωτ = . (6) and cosmological reionization (Aubert & Teyssier 2010). CC − SS Here, 3. THE LOMB–SCARGLE PERIODOGRAM cτ = cos ωτ, sτ = sin ωτ, (7) This section reviews the formalism defining the Lomb– Scargle periodogram. For a time series comprising Nt measure- while the sums, ≡ = ments Xj X(tj ) sampled at times tj (j 1,...,Nt ), assumed throughout to have been scaled and shifted such that its mean is XC = Xj cos ωtj , zero and its variance is unity, the normalized L-S periodogram j at frequency f is XS = Xj sin ωtj , 2 1 Xj cos ω(tj − τ) j P (f ) = j n 2 − 2 j cos ω(tj τ) = 2 2 CC cos ωtj , (8) Xj sin ω(tj − τ) + j . (1) j 2 − j sin ω(tj τ) SS = sin2 ωt , Here and throughout, ω ≡ 2πf is the angular frequency and j j all summations run from j = 1toj = Nt . The frequency- dependent time offset τ is evaluated at each ω via CS = cos ωtj sin ωtj , sin 2ωtj tan 2ωτ = j . (2) j j cos 2ωtj can be evaluated in a single run through the time series, giving a As discussed by Schwarzenberg-Czerny (1998), Pn in the total of Nf (2Nt +3) trig evaluations for the full periodogram—a case of a pure-Gaussian noise time series is drawn from a beta factor of ∼2 improvement. 1 distribution. For a periodogram comprising Nf frequencies, the false-alarm probability (FAP)—that some observed peak occurs culsp due to chance fluctuations—is 4.