Fast Calculation of the Lomb-Scargle Periodogram Using Graphics Processing Units 3

Draft version August 7, 2018 A Preprint typeset using LTEX style emulateapj v. 11/10/09 FAST CALCULATION OF THE LOMB-SCARGLE PERIODOGRAM USING GRAPHICS PROCESSING UNITS R. H. D. Townsend Department of Astronomy, University of Wisconsin-Madison, Sterling Hall, 475 N. Charter Street, Madison, WI 53706, USA; [email protected] Draft version August 7, 2018 ABSTRACT I introduce a new code for fast calculation of the Lomb-Scargle periodogram, that leverages the computing power of graphics processing units (GPUs). After establishing a background to the newly emergent field of GPU computing, I discuss the code design and narrate key parts of its source. Benchmarking calculations indicate no significant differences in accuracy compared to an equivalent CPU-based code. However, the differences in performance are pronounced; running on a low-end GPU, the code can match 8 CPU cores, and on a high-end GPU it is faster by a factor approaching thirty. Applications of the code include analysis of long photometric time series obtained by ongoing satellite missions and upcoming ground-based monitoring facilities; and Monte-Carlo simulation of periodogram statistical properties. Subject headings: methods: data analysis — methods: numerical — techniques: photometric — stars: oscillations 1. INTRODUCTION of the newly emergent field of GPU computing; then, Astronomical time-series observations are often charac- Section 3 reviews the formalism defining the L-S peri- terized by uneven temporal sampling (e.g., due to trans- odogram, and Section 4 presents a GPU-based code im- formation to the heliocentric frame) and/or non-uniform plementing this formalism. Benchmarking calculations coverage (e.g., from day/night cycles, or radiation belt to evaluate the accuracy and performance of the code passages). This complicates the search for periodic sig- are presented in Section 5. The findings and future out- nals, as a fast Fourier transform (FFT) algorithm can- look are then discussed in Section 6. not be employed. A variety of alternatives have been put 2. BACKGROUND TO GPU COMPUTING forward, the most oft-used being the eponymous Lomb- Scargle (L-S) periodogram developed by Lomb (1976) 2.1. Pre-2006: Initial Forays and Scargle (1982). At the time of writing, NASA’s As- The past decade has seen remarkable increases in the trophysics Data System (ADS) lists 735 and 1,810 pub- ability of computers to render complex 3-dimensional lications (respectively) that cite these two papers, high- scenes at movie frame-rates. These gains have been lighting how important the L-S periodogram has proven achieved by progressively shifting the graphics pipeline for the analysis of time series. Recent applications in- — the algorithmic sequence of steps that converts a scene clude the search for a link between solar rotation and description into an image — from the CPU to dedicated nuclear decay rates (Sturrock et al. 2010); the study of hardware within the GPU. To address the inflexibility pulsar timing noise (Lyne et al. 2010); the characteriza- that can accompany such hardware acceleration, GPU tion of quasi-periodic oscillations in blazars (Rani et al. vendors introduced so-called programmable shaders, pro- 2010); and the measurement of rotation periods in exo- cessing units that apply a simple sequence of transforma- planet host stars (Simpson et al. 2010). tions to input elements such as image pixels and mesh Unfortunately, a drawback of the L-S periodogram is vertices. NVIDIA Corporation were the first to im- 2 a computational cost scaling as O(Nt ), where Nt is the plement programmable shader functionality, with their arXiv:1007.1658v2 [astro-ph.SR] 19 Oct 2010 number of measurements in the time series; this contrasts GeForce 3 series of GPUs (released March 2001) offer- with the far-more-efficient O(Nt log2 Nt) scaling of the ing one vertex shader and four (parallel) pixel shaders. FFT algorithm popularized by Cooley & Tukey (1965). The release in the following year of ATI Corporation’s One approach to reducing this cost has been proposed R300 series brought not only an increase in the num- by Press & Rybicki (1989), based on constructing a uni- ber of shaders (up to 4 vertex and 8 pixel), but also ca- formly sampled approximation to the observations via pabilities such as floating-point arithmetic and looping ‘extirpolation’ and then evaluating its FFT. The present constructs that laid the foundations for what ultimately paper introduces a different approach, not through algo- would become GPU computing. rithmic development but rather by leveraging the com- Shaders are programmed using a variety of special- puting power of graphics processing units (GPUs) — the ized languages, such as the OpenGL Shading Language specialized hardware at the heart of the display sub- (GLSL; e.g., Rost 2006) and Microsoft’s High-Level system in personal computers and workstations. Mod- Shading Language (HLSL). The designs of these lan- ern GPUs typically comprise a number of identical pro- guages are strongly tied to their graphics-related pur- grammable processors, and in recent years there has been pose, and thus early attempts at GPU computing using significant interest in applying these parallel-computing programmable shaders had to map each calculation into resources to problems across a breadth of scientific dis- a sequence of equivalent graphical operations (see, e.g., ciplines. In the following section, I give a brief history Owens et al. 2005, and references therein). In an effort 2 Townsend to overcome this awkward aspect, Buck et al. (2004) de- As discussed by Schwarzenberg-Czerny (1998), Pn in the veloped BrookGPU — a compiler and run-time imple- case of a pure Gaussian-noise time series is drawn from mentation of the Brook stream programming language a beta distribution. For a periodogram comprising Nf for GPU platforms. With BrookGPU, the computational frequencies1, the false-alarm probability (FAP) — that resources of shaders are accessed through a stream pro- some observed peak occurs due to chance fluctuations — cessing paradigm: a well-defined series of operations (the is − Nf kernel) are applied to each element in a typically-large 2P (Nt 3)/2 homogeneous sequence of data (the stream). Q =1 − 1 − 1 − n . (3) N " t # 2.2. Post-2006: Modern Era Equations (1) and (2) can be written schematically as GPU computing entered its modern era in 2006, with the release of NVIDIA’s Compute Unified Device Ar- Pn(f)= G[f, (tj ,Xj)], (4) j chitecture (CUDA) — a framework for defining and X managing GPU computations without the need to map where G is some function. In the classification scheme in- them into graphical operations. CUDA-enabled devices troduced by Barsdell et al. (2010), this follows the form (see Appendix A of NVIDIA 2010) are distinguished by of an interact algorithm. Generally speaking, such al- their general-purpose unified shaders, which replace the gorithms are well-suited to GPU implementation, since function-specific shaders (pixel, vertex, etc.) present in they are able to achieve a high arithmetic intensity. How- earlier GPUs. These shaders are programmed using an ever, a straightforward implementation of equations (1) extension to the C language, which follows the same and (2) involves two complete runs through the time se- stream-processing paradigm pioneered by BrookGPU. ries to calculate a single Pn(f), which is wasteful of mem- Since the launch of CUDA, other vendors have been quick ory bandwidth and requires Nf (4Nt + 1) costly trigono- to develop their own GPU computing offerings, most no- metric function evaluations for the full periodogram. tably Advanced Micro Devices (AMD) with their Stream Press et al. (1992) address this inefficiency by calculat- framework, and Microsoft with their DirectCompute in- ing the trig functions from recursion relations, but this terface. approach is difficult to map onto stream processing con- Abstracting away the graphical roots of GPUs has cepts, and moreover becomes inaccurate in the limit of made them accessible to a very broad audience, and large Nf . An alternative strategy, which avoids these dif- GPU-based computations are now being undertaken in ficulties while still offering improved performance, comes fields as diverse as molecular biology, medical imaging, from refactoring the equations as geophysics, fluid dynamics, economics and cryptogra- phy (see Pharr 2005; Nguyen 2007). Within astron- 1 (c XC + s XS)2 P (f)= τ τ + omy and astrophysics, recent applications include N- n 2 c2 CC +2c s CS + s2 SS body simulations (Belleman et al. 2008), real-time ra- τ τ τ τ 2 dio correlation (Wayth et al. 2009), gravitational lens- (cτ XS − sτ XC) 2 2 , (5) ing (Thompson et al. 2010), adaptive-mesh hydrody- cτ SS − 2cτ sτ CS + sτ CC namics (Schive et al. 2010) and cosmological reionization and (Aubert & Teyssier 2010). 2 CS tan2ωτ = . (6) CC − SS 3. THE LOMB-SCARGLE PERIODOGRAM Here, This section reviews the formalism defining the Lomb- cτ = cos ωτ, sτ = sin ωτ, (7) Scargle periodogram. For a time series comprising Nt measurements Xj ≡ X(tj) sampled at times tj (j = while the sums 1,...,N ), assumed throughout to have been scaled and t XC = X cos ωt , shifted such that its mean is zero and its variance is unity, j j j the normalized L-S periodogram at frequency f is X XS = Xj sin ωtj , 2 j 1 j Xj cos ω(tj − τ) X P (f)= + 2 n h 2 i CC = cos ωtj , (8) 2 P j cos ω(tj − τ) j X P 2 2 SS = sin ωtj, j Xj sin ω(tj − τ) . (1) j h 2 i X P j sin ω(tj − τ) CS = cos ωtj sin ωtj , P j X Here and throughout, ω ≡ 2πf is the angular frequency can be evaluated in a single run through the time series, and all summations run from j = 1 to j = Nt.

Fast Calculation of the Lomb-Scargle Periodogram Using Graphics Processing Units 3

High-Performance and Energy-Efficient Irregular Graph Processing on GPU Architectures

GPU Accelerated Approach to Numerical Linear Algebra and Matrix Analysis with CFD Applications

Background on GPGPU Programming

Time Complexity Parallel Local Binary Pattern Feature Extractor on a Graphical Processing Unit

E6895 Advanced Big Data Analytics Lecture 7: GPU and CUDA

NVIDIA's Fermi: the First Complete GPU Computing Architecture

SOUNDVISION 3.5.1 Readme - V.1.0

Vysoke´Ucˇenítechnicke´V Brneˇ Zobrazenístínu˚Ve Sce

HP Z400 Workstation Overview

Oral Presentation

Cuda by Example.Book.Pdf

Accelerating Convolutional Neural Network by Exploiting Sparsity on Gpus Weizhi Xu, Shengyu Fan, Hui Yu and Xin Fu, Member, IEEE