arXiv:1007.1658v2 [astro-ph.SR] 19 Oct 2010 ilns ntefloigscin ieabifhistory brief a give I dis- section, scientific following of the breadth In a across ciplines. problems parallel-computing these to applying resources been in pro- has there interest identical years significant of recent sub- in Mod- number and display a processors, workstations. grammable the comprise typically and of GPUs computers heart ern personal the in the system at — (GPUs) com- hardware units the processing specialized leveraging graphics of by power rather algo- puting through but not development approach, rithmic present different The via a FFT. introduces observations its paper evaluating the uni- then a to and proposed constructing ‘extirpolation’ approximation been on based sampled has (1989), formly cost Rybicki & this (1965). Press reducing Tukey by & to Cooley approach by One popularized algorithm FFT ihtefar-more-efficient contrasts this the series; time with the in measurements of number opttoa otsaigas scaling cost computational a 2010). exo- al. in et periods (Simpson al. rotation stars et of host (Rani measurement planet blazars the characteriza- in and the of oscillations 2010); 2010); study and quasi-periodic al. the rotation of et 2010); (Lyne tion solar al. noise between in- et timing link applications (Sturrock pulsar Recent a rates decay for series. nuclear search time the of proven clude has analysis periodogram the high- L-S for papers, the pub- two important 1,810 these how cite and lighting As- 735 that NASA’s lists (respectively) writing, of lications (ADS) (1976) time System Lomb the Data At trophysics by (1982). developed Lomb- Scargle eponymous and periodogram the (L-S) put being can- been Scargle oft-used have algorithm most alternatives of the (FFT) variety forward, A transform sig- Fourier employed. periodic be fast not for a belt search as radiation the nals, or complicates This cycles, day/night passages). from non-uniform and/or (e.g., trans- frame) to coverage heliocentric due the (e.g., to sampling formation temporal uneven by terized ATCLUAINO H OBSAGEPROORMUIGGRA USING PERIODOGRAM LOMB-SCARGLE THE OF CALCULATION FAST notntl,adabc fteLSproormis periodogram L-S the of drawback a Unfortunately, charac- often are observations time-series Astronomical rpittpstuigL using 2018 typeset 7, Preprint August version Draft eateto srnm,Uiest fWisconsin-Madison of University Astronomy, of Department P,tecd a ac P oe,ado ihedGUi sfast an is headings: facilities; Subject it t monitoring GPU photometric properties. high-end ground-based long statistical a of upcoming periodogram on analysis and pr and include missions are cores, code satellite CPU the narr performance 8 of and in match Applications design can differences thirty. code code the accur the However, the in GPU, differences discuss code. significant I no CPU-based computing, indicate esta calculations GPU After Benchmarking (GPUs). of units field processing emergent graphics of power computing nrdc e oefrfs aclto fteLm-crl per Lomb-Scargle the of calculation fast for code new a introduce I 1. A INTRODUCTION T E tl mltajv 11/10/09 v. emulateapj style X ehd:dt nlss—mtos ueia ehius photo techniques: — numerical methods: — analysis data methods: oscillations O ( N t O log ( N 2 t 2 N ,where ), t cln fthe of scaling ) rf eso uut7 2018 7, August version Draft N [email protected] .H .Townsend D. H. R. t sthe is ABSTRACT trigHl,45N hre tet aio,W 30,US 53706, WI Madison, Street, Charter N. 475 Hall, Sterling , htcnacmaysc adaeaclrto,GPU acceleration, inflexibility hardware so-called the such introduced address vendors accompany To dedicated can to GPU. CPU that the the from within — been scene hardware image a have converts an that into pipeline steps gains description of graphics sequence These the algorithmic the shifting — 3-dimensional progressively frame-rates. complex by movie achieved render at to scenes computers of ability r rsne nScin5 h nig n uueout- 6. future Section code and in the findings discussed The then of are 5. performance look Section and in presented accuracy are calculations the im- Benchmarking evaluate peri- code to L-S GPU-based formalism. then, a the this presents computing; defining plementing 4 formalism GPU Section and the of odogram, reviews field 3 emergent Section newly the of wn ta.20,adrfrne hri) na effort an In therein). references e.g., and (see, 2005, operations al. graphical et into equivalent Owens calculation of using each sequence computing map pur- a GPU to graphics-related at had lan- attempts their these early programmable to High-Level thus of tied and Microsoft’s designs pose, strongly and The are guages (HLSL). 2006) Language Language Rost Shading Shading OpenGL e.g., the as (GLSL; such languages, ized ultimately ca- computing. what also GPU for looping but become foundations and num-would pixel), the laid 8 arithmetic the that and floating-point in constructs vertex as 4 increase such to an Corporation’s (up pabilities ATI only shaders of of not shaders. year ber brought following pixel series (parallel) the offer- R300 four in 2001) and release March im- The (released vertex to GPUs one first of ing their series the with mesh 3 functionality, were and GeForce shader pixels Corporation programmable image plement as such elements transforma- vertices. input of sequence to simple tions a apply that units cessing h atdcd a enrmral nrae nthe in increases remarkable seen has decade past The hdr r rgamduigavreyo special- of variety a using programmed are Shaders 2. AKRUDT P COMPUTING GPU TO BACKGROUND lsigabcgon otenewly the to background a blishing 2.1. nucd unn nalow-end a on running onounced; c oprdt nequivalent an to compared acy m eisotie yongoing by obtained series ime r-06 nta Forays Initial Pre-2006: oorm htlvrgsthe leverages that iodogram, ot-al iuainof simulation Monte-Carlo d t e at fissource. its of parts key ate rb atrapproaching factor a by er HC RCSIGUNITS PROCESSING PHICS rgambeshaders programmable erc—stars: — metric A; pro- , 2 Townsend to overcome this awkward aspect, Buck et al. (2004) de- As discussed by Schwarzenberg-Czerny (1998), Pn in the veloped BrookGPU — a compiler and run-time imple- case of a pure Gaussian-noise time series is drawn from mentation of the Brook stream programming language a beta distribution. For a periodogram comprising Nf for GPU platforms. With BrookGPU, the computational frequencies1, the false-alarm probability (FAP) — that resources of shaders are accessed through a stream pro- some observed peak occurs due to chance fluctuations — cessing paradigm: a well-defined series of operations (the is − Nf kernel) are applied to each element in a typically-large 2P (Nt 3)/2 homogeneous sequence of data (the stream). Q =1 − 1 − 1 − n . (3) N "  t  # 2.2. Post-2006: Modern Era Equations (1) and (2) can be written schematically as GPU computing entered its modern era in 2006, with the release of NVIDIA’s Compute Unified Device Ar- Pn(f)= G[f, (tj ,Xj)], (4) j chitecture (CUDA) — a framework for defining and X managing GPU computations without the need to map where G is some function. In the classification scheme in- them into graphical operations. CUDA-enabled devices troduced by Barsdell et al. (2010), this follows the form (see Appendix A of NVIDIA 2010) are distinguished by of an interact algorithm. Generally speaking, such al- their general-purpose unified shaders, which replace the gorithms are well-suited to GPU implementation, since function-specific shaders (pixel, vertex, etc.) present in they are able to achieve a high arithmetic intensity. How- earlier GPUs. These shaders are programmed using an ever, a straightforward implementation of equations (1) extension to the C language, which follows the same and (2) involves two complete runs through the time se- stream-processing paradigm pioneered by BrookGPU. ries to calculate a single Pn(f), which is wasteful of mem- Since the launch of CUDA, other vendors have been quick ory bandwidth and requires Nf (4Nt + 1) costly trigono- to develop their own GPU computing offerings, most no- metric function evaluations for the full periodogram. tably Advanced Micro Devices (AMD) with their Stream Press et al. (1992) address this inefficiency by calculat- framework, and Microsoft with their DirectCompute in- ing the trig functions from recursion relations, but this terface. approach is difficult to map onto stream processing con- Abstracting away the graphical roots of GPUs has cepts, and moreover becomes inaccurate in the limit of made them accessible to a very broad audience, and large Nf . An alternative strategy, which avoids these dif- GPU-based computations are now being undertaken in ficulties while still offering improved performance, comes fields as diverse as molecular biology, medical imaging, from refactoring the equations as geophysics, fluid dynamics, economics and cryptogra- phy (see Pharr 2005; Nguyen 2007). Within astron- 1 (c XC + s XS)2 P (f)= τ τ + omy and astrophysics, recent applications include N- n 2 c2 CC +2c s CS + s2 SS body simulations (Belleman et al. 2008), real-time ra-  τ τ τ τ 2 dio correlation (Wayth et al. 2009), gravitational lens- (cτ XS − sτ XC) 2 2 , (5) ing (Thompson et al. 2010), adaptive-mesh hydrody- cτ SS − 2cτ sτ CS + sτ CC namics (Schive et al. 2010) and cosmological reionization  and (Aubert & Teyssier 2010). 2 CS tan2ωτ = . (6) CC − SS 3. THE LOMB-SCARGLE PERIODOGRAM Here, This section reviews the formalism defining the Lomb- cτ = cos ωτ, sτ = sin ωτ, (7) Scargle periodogram. For a time series comprising Nt measurements Xj ≡ X(tj) sampled at times tj (j = while the sums 1,...,N ), assumed throughout to have been scaled and t XC = X cos ωt , shifted such that its mean is zero and its variance is unity, j j j the normalized L-S periodogram at frequency f is X XS = Xj sin ωtj , 2 j 1 j Xj cos ω(tj − τ) X P (f)= + 2 n h 2 i CC = cos ωtj , (8) 2 P j cos ω(tj − τ)  j  X P 2 2  SS = sin ωtj,  j Xj sin ω(tj − τ) . (1) j h 2 i  X P j sin ω(tj − τ)  CS = cos ωtj sin ωtj , P j  X Here and throughout, ω ≡ 2πf is the angular frequency can be evaluated in a single run through the time series, and all summations run from j = 1 to j = Nt. The giving a total of Nf (2Nt + 3) trig evaluations for the full frequency-dependent time offset τ is evaluated at each ω periodogram — a factor ∼ 2 improvement. via 1 sin 2ωtj The issue of ‘independent’ frequencies is briefly discussed in tan2ωτ = j . (2) Section 6.2. Pj cos2ωtj P Fast Calculation of the Lomb-Scargle Periodogram Using Graphics Processing Units 3

4. culsp: A GPU LOMB-SCARGLE PERIODOGRAM CODE where the frequency spacing and number of frequencies 4.1. Overview are determined from 1 This section introduces culsp, a Lomb-Scargle pe- ∆f = (10) riodogram code implemented within NVIDIA’s CUDA Fover(tNt − t1) framework. Below, I provide a brief technical overview of CUDA. Section 4.3 then reviews the general design and F F N of culsp, and Section 4.4 narrates an abridged version N = high over t , (11) of the kernel source. The full source, which is freely re- f 2 distributable under the GNU General Public License, is respectively. The user-specified parameters F and provided in the accompanying on-line materials. over Fhigh control the oversampling and extent of the peri- odogram; Fover = 1 gives the characteristic sampling es- 4.2. The CUDA Framework tablished by the length of the time series, while Fhigh =1 gives a maximum frequency equal to the mean Nyquist A CUDA-enabled GPU comprises one or more stream- frequency f = N /[2(t − t )]. ing multiprocessors (SMs), themselves composed of a Ny t Nt 1 2 The input time series is read from disk and pre- number of scalar processors (SPs) that are functionally processed to have zero mean and unit variance, before equivalent to processor cores. Together, the SPs allow being copied to GPU global memory. Then, the compu- an SM to support concurrent execution of blocks of up tational kernel is launched for Nf threads arranged into to 512 threads. Each thread applies the same computa- 3 blocks of size Nb ; each thread handles the periodogram tional kernel to an element of an input stream. Resources calculation at a single frequency. Once all calculations at a thread’s disposal include its own register space; are complete, the periodogram is copied back to CPU built-in integer indices uniquely identifying the thread; memory, and from there written to disk. shared memory accessible by all threads in its parent The sums in equation (8) involve the entire time se- block; and global memory accessible by all threads in all ries. To avoid a potential memory-access bottleneck, and blocks. Reading or writing shared memory is typically to improve accuracy, culsp partitions these sums into as fast as accessing a register; however, global memory chunks equal in size to the thread block size Nb. The is two orders of magnitude slower. time-series data required to evaluate the sums for a given CUDA programs are written in the C language with ex- chunk are copied from (slow) global memory into (fast) tensions that allow computational kernels to be defined shared memory, with each thread in a block transferring and launched, and the differing types of memory be al- a single (tj ,Xj) pair. Then, all threads in the block enjoy located and accessed. A typical program will transfer fast access to these data when evaluating their respective input data from CPU memory to GPU memory; launch per-chunk sums. one or more kernels to process these data; and then copy the results back from GPU to CPU. Executables are cre- 4.4. ated using the nvcc compiler from the CUDA software Kernel Source development kit (SDK). Figure 1 lists abridged source for the culsp computa- A CUDA kernel has access to the standard C mathe- tional kernel. This is based on the full version supplied matical functions. In some cases, two versions are avail- in the on-line materials, but special-case code (handling able (‘library’ and ‘intrinsic’), offering different trade-offs situations where Nt is not an integer multiple of Nb) has between precision and speed (see Appendix C of NVIDIA been removed to facilitate the discussion. 2010). For the sine and cosine functions, the library ver- The kernel accepts five arguments (lines 2–3 of the sions are accurate to within 2 units of last place, but are listing). The first three are array pointers giving the very slow because the range-reduction algorithm — re- global-memory addresses of the time-series (d_time and quired to bring arguments into the (−π/4,π/4) interval d_data) and the output periodogram (d_P). The remain- — spills temporary variables to global memory. The in- ing two give the frequency spacing of the periodogram trinsic versions do not suffer this performance penalty, (df) and the number of points in the time series (N_t). as they are hardware-implemented in two special func- The former is used on line 11 to evaluate the frequency tion units (SFUs) attached to each SM. However, they from the thread and block indices; the macro BLOCK_SIZE become inaccurate as their arguments depart from the is expanded by the pre-processor to the thread block size (−π, π) interval. As discussed below, this inaccuracy can Nb. be remedied through a very simple range-reduction pro- Lines 27–70 construct the sums of equation (8), fol- cedure. lowing the chunk partitioning approach described above (note, however, that the SS sum is not calculated explic- itly, but reconstructed from CC on line 72). Lines 31– 4.3. Code Design 36 are responsible for copying the time-series data The culsp code is a straightforward CUDA imple- for a chunk from global memory to shared memory; mentation of the L-S periodogram in its refactored form the __syncthreads() instructions force synchronization (equations 6–8). A uniform frequency grid is assumed, across the whole thread block, to avoid potential race conditions. The inner loop (lines 41–58) then evaluates \#pragma unroll fi = i ∆f (i =1,...,Nf ), (9) the per-chunk sums; the directive on

3 Set to 256 throughout the present work; tests indicate that 2 Eight, for the GPUs considered in the present work. larger or smaller values give a slightly reduced performance. 4 Townsend

1 __global__ void 47 2 culsp_kernel(float *d_t, float *d_X, float *d_P, 48 float c; 3 float df, int N_t) 49 float s; 4 { 50 5 51 __sincosf(TWOPI*ft, &s, &c); 6 __shared__ float s_t[BLOCK_SIZE]; 52 7 __shared__ float s_X[BLOCK_SIZE]; 53 XC_chunk += s_X[k]*c; 8 54 XS_chunk += s_X[k]*s; 9 // Calculate the frequency 55 CC_chunk += c*c; 10 56 CS_chunk += c*s; 11 float f = (blockIdx.x*BLOCK_SIZE+threadIdx.x+1)*df; 57 12 58 } 13 // Calculate the various sums 59 14 60 XC += XC_chunk; 15 float XC = 0.f; 61 XS += XS_chunk; 16 float XS = 0.f; 62 CC += CC_chunk; 17 float CC = 0.f; 63 CS += CS_chunk; 18 float CS = 0.f; 64 19 65 XC_chunk = 0.f; 20 float XC_chunk = 0.f; 66 XS_chunk = 0.f; 21 float XS_chunk = 0.f; 67 CC_chunk = 0.f; 22 float CC_chunk = 0.f; 68 CS_chunk = 0.f; 23 float CS_chunk = 0.f; 69 24 70 } 25 int j; 71 26 72 float SS = (float) N_t - CC; 27 for(j = 0; j < N_t; j += BLOCK_SIZE) { 73 28 74 // Calculate the tau terms 29 // Load the chunk into shared memory 75 30 76 float ct; 31 __syncthreads(); 77 float st; 32 78 33 s_t[threadIdx.x] = d_t[j+threadIdx.x]; 79 __sincosf(0.5f*atan2(2.f*CS, CC-SS), &st, &ct); 34 s_X[threadIdx.x] = d_X[j+threadIdx.x]; 80 35 81 // Calculate P 36 __syncthreads(); 82 37 83 d_P[blockIdx.x*BLOCK_SIZE+threadIdx.x] = 38 // Update the sums 84 0.5f*((ct*XC + st*XS)*(ct*XC + st*XS)/ 39 85 (ct*ct*CC + 2*ct*st*CS + st*st*SS) + 40 #pragma unroll 86 (ct*XS - st*XC)*(ct*XS - st*XC)/ 41 for(int k = 0; k < BLOCK_SIZE; k++) { 87 (ct*ct*SS - 2*ct*st*CS + st*st*CC)); 42 88 43 // Range reduction 89 // Finish 44 90 45 float ft = f*s_t[k]; 91 } 46 ft -= rintf(ft); 92

Fig. 1.— Abridged source for the culsp computation kernel.

ing two Intel 2.33 GHz Xeon E5345 quad-core processors TABLE 1 and 8 GB of RAM. The workstation also hosts a pair Specifications for the two GPUs used in the benchmarking. of NVIDIA GPUs: a Tesla C1060 populating the single PCI Express (PCIe) ×16 slot, and a GeForce 8400 GS GPU SMs SPs Clock (GHz) Memory (MB) in the single legacy PCI slot. These devices are broadly representative of the opposite ends of the GPU market. GeForce 8400 GS 1 8 1.4 512 Tesla C1060 30 240 1.3 4096 The 8400 GS is an entry-level product based on the older G80 hardware architecture (the first to support CUDA), line 40 instructs the compiler to completely unroll this and contains only a single SM. The C1060 is built on loop, conferring a significant performance increase. the newer GT200 architecture (released 2008/2009), and The sine and cosine terms in the sums are evalu- with 30 SMs represents one of the most powerful GPUs ated simultaneously with a call to CUDA’s intrinsic in NVIDIA’s portfolio. The technical specifications of __sincosf() function (line 51). To maintain accuracy, a each GPU are summarized in Table 1. simple range reduction is applied to the phase ft by sub- The CPU code used for comparison is lsp, a straight- tracting the nearest integer [as calculated using rintf(); forward port of culsp to ISO C99 with a few modifi- line 46]. This brings the argument of __sincosf() into cations for performance and language compliance. The the interval (−π, π), where its maximum absolute error sine and cosine terms are calculated via separate calls is 2−21.41 for sine and 2−21.19 for cosine (see Table C-3 to the sinf() and cosf() functions, since there is no of NVIDIA 2010). sincosf() function in standard C99. The argument re- duction step uses an integer cast instead of rintf(); this 5. BENCHMARKING CALCULATIONS allows the compiler to vectorize the inner loops, greatly 5.1. Test Configurations improving performance while having a negligible impact This section compares the accuracy and performance of on results. Finally, the outer loop over frequency is triv- culsp against an equivalent CPU-based code. The test ially parallelized using an OpenMP directive, so that all lsp platform is a Dell Precision 490 workstation, contain- available CPU cores can be utilized. Source for is Fast Calculation of the Lomb-Scargle Periodogram Using Graphics Processing Units 5 provided in the accompanying on-line materials. As the validation dataset for comparing the accuracy The Precision 490 workstation runs 64-bit Gentoo of culsp and lsp, I use the 150-day photometric time Linux. GPU executables are created with the 3.1 re- series of the β Cephei pulsator V1449 Aql (HD 180642) lease of the CUDA SDK, which relies on GNU gcc 4.4 obtained by the CoRoT mission (Belkacem et al. 2009). as the host-side compiler. CPU executables are created The observations comprise 382,003 flux measurements with Intel’s icc 11.1 compiler, using the -O3 and -xHost (after removal of points flagged as bad), sampled un- optimization flags. evenly (in the heliocentric frame) with an average sepa- ration of 32 s. 5.2. Accuracy Figure 2 plots the periodogram of V1449 Aql evaluated using lsp, over a frequency interval spanning the star’s dominant 0.18 d pulsation mode (see Waelkens et al. 6 1998). Also shown in the figure is the absolute devi- culsp lsp ation |Pn − Pn | of the corresponding periodogram evaluated using culsp (running on either GPU — the 4 results are identical). The figure confirms that, at least over this particular frequency interval, the two codes are in good agreement with one another; the relative error is 2 on the order of 10−6. To explore accuracy over the full frequency range, lsp culsp n Fig. 3 shows a scatter plot of Pn against Pn . Very P 0 few of the Nf = 1, 528, 064 points in this plot de- log part to any significant degree from the diagonal line lsp culsp Pn = Pn . Those that do are clustered in the Pn ≪ 1 -2 corner of the plot, and are therefore associated with the noise in the light curve rather than any periodic signal. Moreover, the maximum absolute difference in the pe- -4 riodogram FAPs (equation 3) across all frequencies is 4.1 × 10−5, which is negligible. 5.3. -6 Performance 5.4 5.5 5.6 -1 f (d ) 4

Fig. 2.— Part the L-S periodogram for V1449 Aql, evaluated using the lsp code (thick curve). The thin curve shows the ab- GeForce 8400 GS culsp lsp solute deviation |Pn − Pn | of the corresponding periodogram Tesla C1060 evaluated using the culsp code. The strong peak corresponds to the star’s dominant 0.18-d pulsation mode. 2 CPU (1 thread) CPU (8 threads) (s)

4 〉 calc

t 0 〈

2 log

-2 0 CULSP n

P -2 -4 log 3 4 5 log N -4 t

Fig. 4.— Mean calculation times htcalci for the L-S periodogram, evaluated using the culsp (triangles) and lsp (circles) codes. The -6 dashed line, with slope d loghtcalci/d log Nt = 2, indicates the asymptotic scaling of the periodogram algorithm.

-8 Code performance is measured by averaging the -8 -6 -4 -2 0 2 4 V1449 Aql periodogram calculation time tcalc over five LSP log Pn executions. These timings exclude the overheads in- curred by disk input/output and from rectifying light Fig. 3.— A scatter plot of the L-S periodogram for V1449 Aql, curves to zero mean an unit variance. Table 2 lists the evaluated using the lsp (abscissa) and culsp (ordinate) codes. mean htcalci and associated standard deviation σ(tcalc) 6 Townsend

PCIe and PCI slots hosting the GPUs, neither makes any TABLE 2 appreciable contribution to the execution times listed in Periodogram calculation times. Table 2.

Code Platform htcalci (s) σ(tcalc) (s) To round off the present discussion, I explore how culsp and lsp perform with different-sized datasets. culsp GeForce 8400 GS 570 0.0093 The analysis in Section 3 indicates a periodogram work- culsp Tesla C1060 20.3 0.00024 lsp CPU (1 thread) 4570 14 load scaling as O(Nf Nt). With the number of frequen- lsp CPU (8 threads) 566 6.9 cies following Nf ∝ Nt (equation 11), tcalc should there- 2 fore scale proportionally with Nt — as in fact already for culsp running on both GPUs, and for lsp running claimed in Introduction. To test this expectation, Fig. 4 with a single OpenMP thread (equivalent to a purely se- shows a log-log plot of htcalci as a function of Nt, for the rial CPU implementation), and with 8 OpenMP threads same configurations as in Table 2. The light curve for a (one per workstation core). given Nt is generated from the full V1449 Aql light curve With just one thread, lsp is significantly outperformed by uniform down-sampling. by culsp on either GPU. Scaling up to 8 threads short- In the limit of large Nt, all curves asymptote toward ens the calculation time by a factor ∼ 8, indicating near- a slope d loghtcalci/d log Nt = 2, confirming the hypoth- ideal parallelization; nevertheless, the two CPUs working 2 esized Nt scaling. At smaller Nt, departures from this together only just manage to beat the GeForce 8400 GS, scaling arise from computational overheads that are not and are still a factor ∼ 28 slower than the Tesla C1060. directly associated with the actual periodogram calcula- Perhaps surprisingly, the latter ratio is greater than sug- tion. These are most clearly seen in the lsp curve for 8 gested by comparing the theoretical peak floating-point threads, which approaches a constant loghtcalci≈−1.5 performance of the two platforms — 74.6 GFLOPS (bil- independent of Nt — perhaps due to memory cache con- lions of floating-point operations per second) for all 8 tention between the different threads. CPU cores, versus 936 GFLOPS for the C1060. This clearly warrants further investigation. 6. DISCUSSION Profiling with the GNU gprof tool indicates that the 6.1. Cost/Benefit Analysis major bottleneck in lsp, accounting for 80% of htcalci, is the __svml_sincosf4() function from Intel’s Short To establish a practical context for the results of the Vector Math Library. This function evaluates four sine/- preceding sections, I briefly examine the price vs. per- cosine pairs at once by leveraging the SSE2 instructions formance of the CPU and GPU platforms. At the time of modern x86-architecture CPUs. Microbenchmarking of writing, the manufacturer’s bulk (1,000-unit) pricing reveals that a __svml_sincosf4() call costs ∼ 45.6 clock for a pair of Xeon E5345 CPUs is 2 × $455, while a cycles, or ∼ 11.4 cycles per sine/cosine pair. In contrast, Tesla C1060 retails for around $1,300 and a GeForce 8400 thanks to its two special function units, a GPU SM can GS for around $50. First considering the C1060, the evaluate a sine/cosine pair in a single cycle (see Appendix 50% greater cost of this device (relative to the CPUs) G.3.1 of NVIDIA 2010). Scaling these values by the brings almost a factor thirty reduction in periodogram appropriate clock frequencies and processor counts, the calculation time — an impressive degree of leveraging. sine/cosine throughput on all 8 CPU cores is 1.6 billion However, its hefty price tag together with demanding operations per second (GOPS), whereas on the 30 SMs infrastructure requirements (dedicated PCIe power con- of the C1060 it is 39 GOPS, around 23 times faster. Of nectors, supplying up to 200 W), means that it may not course, it should be recalled that the GPU __sincosf() be the ideal GPU choice in all situations. function operates at a somewhat-reduced precision (see The 8400 GS offers a similar return-on-investment at Section 4.4). In principle, the CPU throughput could be a much-more affordable price: almost the same perfor- improved by developing a similar reduced-precision func- mance as the two CPUs at one-twentieth of the cost. tion to replace __svml_sincosf4(). However, it seems This heralds the possibility of budget GPU computing, unlikely that a software routine could ever approach the where a low-end desktop computer is paired with an throughput of the dedicated hardware in the SFUs. entry-level GPU, to give performance exceeding high- The disparity between sine/cosine throughput ac- end, multi-core workstations for a price tag of just a few counts for most of the factor ∼ 28 performance difference hundred dollars. Indeed, many desktop computers to- between culsp and lsp, noted above. The remainder day, or even laptops, are already well equipped to serve comes from the ability of an SM to execute instructions in this capacity. simultaneously on its SFUs and scalar processors. That 6.2. Applications is, the sine/cosine evaluations can be undertaken in par- allel with the other arithmetic operations involved in the An immediate application of culsp is analysis of the periodogram calculation. photometric time series obtained by ongoing satellite Looking now at the memory performance of culsp, missions such as MOST (Walker et al. 2003), CoRoT NVIDIA’s cudaprof profiling tool indicates that almost (Auvergne 2009), and Kepler (Koch 2010). These 5 all reads from global memory are coalesced, and that no datasets are typically very large (Nt & 10 ), lead- bank conflicts arise when reading from shared memory. ing to a significant per-star cost for calculating a pe- Thus, the GPU memory accesses patterns can be consid- riodogram. When this cost is multiplied by the num- ered close to optimal. The combined time spent copying ber of targets being monitored (in the cast of Ke- data from CPU to GPU and vice versa is ∼ 6 ms on the pler, again & 105), the overall computational bur- C1060, and ∼ 29 ms on the 8400 GS; while these val- den becomes very steep. Looking into the near ues clearly reflect the bandwidth difference between the future, similar issues will be faced with ground- Fast Calculation of the Lomb-Scargle Periodogram Using Graphics Processing Units 7 based time-domain facilities such as Pan-STARRS Zechmeister & K¨urster (2009) generalize the L-S peri- (Kaiser 2002) and the Large Synoptic Survey Telescope odogram to allow for the fact that the time-series mean is (LSST Science Collaborations and LSST Project 2009). typically not known a priori, but instead estimated from It is hoped that culsp, or an extension to other related the data themselves. The expressions derived by these periodograms (see below), will help resolve these chal- authors involve sums having very similar forms to equa- lenges. tion (8); thus, it should prove trivial to develop GPU An additional application of culsp is in the interpreta- implementations of the generalized periodograms. The tion of periodograms. Equation (3) presumes that the Pn multi-harmonic periodogram of Schwarzenberg-Czerny at each frequency in the periodogram is independent of (1996) and the SigSpec method of Reegen (2007) also ap- the others. This is not necessarily the case, and the expo- pear promising candidates for implementation on GPUs, nent in the equation should formally be replaced by some although algorithmically they are rather more-complex. number Nf,ind representing the number of independent Looking at the bigger picture, while the astronomical frequencies. Horne & Baliunas (1986) pioneered the use theory and modeling communities have been quick to rec- of simulations to estimate Nf,ind empirically, and similar ognize the usefulness of GPUs (see Section 1), progress Monte-Carlo techniques have since been applied to ex- has been more gradual in the observational community; plore the statistical properties of the L-S periodogram in radio correlation is the only significant application to detail (see Frescura et al. 2008, and references therein). date (Wayth et al. 2009). It is my hope that the present The bottleneck in these simulations are the many peri- paper will help illustrate the powerful data-analysis ca- odogram evaluations, making them strong candidates for pabilities of GPUs, and demonstrate strategies for using GPU acceleration. these devices effectively.

6.3. Future Work Recognizing the drawbacks of being wedded to one I thank Dr. Gordon Freeman for the initial inspira- particular hardware/software vendor, a strategically im- tion to explore this line of research, and the anonymous portant future project will be to port culsp to Open referee for many helpful suggestions that improved the CL (Open Computing Language) — a recently devel- manuscript. I moreover acknowledge support from NSF oped standard for programming devices such as multi- Advanced Technology and Instrumentation grant AST- core CPUs and GPUs in a platform-neutral manner 0904607. The Tesla C1060 GPU used in this study was (see, e.g., Stone et al. 2010). There is also consider- donated by NVIDIA through their Professor Partnership able scope for applying the lessons learned herein to Program, and I have made extensive use of NASA’s As- other spectral analysis techniques. Shrager (2001) and trophysics Data System bibliographic services.

REFERENCES Aubert D., Teyssier R., 2010, ApJ Press W. H., Rybicki G. B., 1989, ApJ, 338, 277 Auvergne M. e. a., 2009, A&A, 506, 411 Press W. H., Teukolsky S. A., Vetterling W. T., Flannery B. P., Barsdell B. R., Barnes D. G., Fluke C. J., 2010, MNRAS, p. 1200 1992, Numerical Recipes in Fortran, 2 edn. Cambridge Belkacem K., Samadi R., Goupil M., Lef`evre L., Baudin F., University Press, Cambridge Deheuvels S., Dupret M., Appourchaux T., Scuflaire R., Rani B., Gupta A. C., Joshi U. C., Ganesh S., Wiita P. J., 2010, Auvergne M., Catala C., Michel E., Miglio A., Montalban J., ApJ, 719, L153 Thoul A., Talon S., Baglin A., Noels A., 2009, Science, 324, Reegen P., 2007, A&A, 467, 1353 1540 Rost R. J., 2006, OpenGL Shading Language, 2 edn. Belleman R. G., B´edorf J., Portegies Zwart S. F., 2008, New Addison-Wesley Professional Astronomy, 13, 103 Scargle J. D., 1982, ApJ, 263, 835 Buck I., Foley T., Horn D., Sugerman J., Fatahalian K., Houston Schive H., Tsai Y., Chiueh T., 2010, ApJS, 186, 457 M., Hanrahan P., 2004, ACM Transactions on Graphics, 23, 777 Schwarzenberg-Czerny A., 1996, ApJ, 460, 107 Cooley J. W., Tukey J. W., 1965, Mathematics of Computation, Schwarzenberg-Czerny A., 1998, MNRAS, 301, 831 19, 297 Shrager R. I., 2001, Ap&SS, 277, 519 Frescura F. A. M., Engelbrecht C. A., Frank B. S., 2008, Simpson E. K., Baliunas S. L., Henry G. W., Watson C. A., 2010, MNRAS, 388, 1693 MNRAS, p. 1209 Horne J. H., Baliunas S. L., 1986, ApJ, 302, 757 Stone J. E., Gohara D., Shi G., 2010, Computing in Science and Kaiser N. e. a., 2002, in Tyson J. A., Wolff S., eds, SPIE Conf. Engineering, 12, 66 Ser. 4836 p. 154 Sturrock P. A., Buncher J. B., Fischbach E., Gruenwald J. T., Koch D. G. e. a., 2010, ApJ, 713, L79 Javorsek II D., Jenkins J. H., Lee R. H., Mattes J. J., Newport Lomb N. R., 1976, Ap&SS, 39, 447 J. R., 2010 LSST Science Collaborations and LSST Project 2009, The LSST Thompson A. C., Fluke C. J., Barnes D. G., Barsdell B. R., 2010, Science Book, Version 2.0 New Astronomy, 15, 16 Lyne A., Hobbs G., Kramer M., Stairs I., Stappers B., 2010, Waelkens C., Aerts C., Kestens E., Grenon M., Eyer L., 1998, Science, 329, 408 A&A, 330, 215 Nguyen H., 2007, GPU Gems 3. Addison-Wesley Professional Walker G., Matthews J., Kuschnig R., Johnson R., Rucinski S., NVIDIA 2010, NVIDIA CUDA Programming Guide 3.1 Pazder J., Burley G., Walker A., Skaret K., Zee R., Grocott S., Owens J. D., Luebke D., Govindaraju N., Harris M., Kr¨uger J., Carroll K., Sinclair P., Sturgeon D., Harron J., 2003, PASP, Lefohn A. E., Purcell T. J., 2005, in Eurographics 2005: State 115, 1023 of the Art Reports, p. 21 Wayth R. B., Greenhill L. J., Briggs F. H., 2009, PASP, 121, 857 Pharr M., 2005, GPU Gems 2. Addison-Wesley Professional Zechmeister M., K¨urster M., 2009, A&A, 496, 577