A Parallel Model for the Heterogeneous

A PARALLEL MODEL FOR THE

HETEROGENEOUS COMPUTATION OF

RADIO ASTRONOMY SIGNAL CORRELATION

Christopher John Harris

B.Sc.(Hons)

This thesis is presented for the degree of

Doctor of Philosophy

at the School of Physics

July 2009 c 2009

Christopher J. Harris Abstract

The computational requirements of scientiﬁc research are constantly growing. In the

ﬁeld of radio astronomy, observations have evolved from using single telescopes, to interferometer arrays of many telescopes, and there are currently arrays of massive scale under development. These interferometers use signal and image processing to produce data that is useful to radio astronomy, and the amount of processing required scales quadratically with the scale of the array. Traditional computational approaches are unable to meet this demand in the near future.

This thesis explores the use of heterogeneous parallel processing to meet the computational demands of radio astronomy. In heterogeneous computing, multiple hardware architectures are used for processing. In this work, the Graphics Process- ing Unit (GPU) is used as a co-processor along with the Central Processing Unit (CPU) for the computation of signal processing algorithms. Speciﬁcally, the suitability of the GPU to accelerate the correlator algorithms used in radio astronomy is investigated.

This work ﬁrst implemented a FX correlator on the GPU, with a performance increase of one to two orders of magnitude over a serial CPU approach. The FX correlator algorithm combines pairs of telescope signals in the Fourier domain. Given

iii iv Abstract

N telescope signals from the interferometer array, N 2 conjugate multiplications must be calculated in the algorithm. For extremely large arrays (N >> 30), this is a huge computational requirement. Testing will show that the GPU correlator produces results equivalent to that of a software correlator implemented on the CPU. However, the algorithm itself is adapted in order to take advantage of the processing power of the GPU. Research examined how correlator parameters, in particular the number of telescope signals and the Fast Fourier Transform (FFT) length, aﬀected the results.

The conjugate multiply and accumulation (CMAC) stage of the correlator, requires computation that increases quadratically with the number of telescope signals in the interferometer array. Because the other stages of the correlator scale linearly, this becomes the bottleneck for large radio telescope arrays. This work investigates a number of potential parallel approaches, in order to determine which is the most optimal. This research will show that of those approaches, two are superior.

An important consideration in the design of radio telescope infrastructure is the ongoing power usage of compute systems. The increasing processing requirements are causing the cost of electricity to be an important budgeting concern. Thus power eﬃcient compute architectures are now desired for scientiﬁc research. This work has investigated the power usage of both the GPU and CPU. The processing power per watt of energy for the parallel correlator implementation is shown to be lower than the serial implementation by up to a factor of 30.

Finally, the addition of a parallel polyphase filter front end demonstrates the adaptability of the GPU correlator implementation. This filter is commonly used in signal processing to efficiently reduce the effect of spectral leakage in the Fourier transform. However, it comes at the cost of additional processing and memory access within the algorithm. The implementation on the graphics processing unit supports v

1, 2, 4 and 8 filter taps, and a filter length of 128. A tap is the number of signal phases combined in the decimation part of the polyphase filter. This research shows the increase in processing time for the filter stage is a quarter of a direct scaling with additional computation as the number of taps increase. vi Abstract Acknowledgements

I acknowledge the efforts of my supervisors Karen Haines, Lister Staveley-Smith and David Blair. I have appreciated Karen’s friendship, motivation, and excellent advice since I first walked into her office at the beginning of my honours year. I thank her for consistently getting me to a plethora of conferences directly related to my work around the world, introducing me to the leading people in my field, and always providing me with the most cutting edge hardware. Lister’s knowledge and experience of the radio astronomy field is of great benefit to my research. I have valued his insight into the possible directions for my work. I am grateful for David offering his supervision when I began in the Physics department. David’s drive and passion for research is an inspiration to his students.

In the ﬁeld of radio astronomy, I thank all whom have assisted my learning over the past four years. This includes Chris Phillips for his time spent testing live data streaming of VLBI data on to the GPU; Steve Tingay and Adam Deller for introducing me to software correlation; Katherine Blundell and Ben Mort for hosting me at Oxford; Peter Quinn for his advice; Frank Briggs for his Fortran correlator code which served as a standard for my own implementations both serial and parallel; and Randall Wayth for his insightful discussions.

vii viii Acknowledgements

In the ﬁeld of graphics hardware, I thank Mark Harris of NVIDIA and Justin Hensley of AMD for sharing their knowledge of the graphics products of their re- spective companies. I am grateful to Ed Buckingham for hosting me on a day trip to visit the GPU compute group at AMD.

I thank all the staﬀ at the Western Australian Supercomputer Program. Jason, Akos, and Khanh have been stalwart in maintaining a high standard of computing support. Paul’s insight in visualisation has been of great help. I appreciate the work

Rosie and Anna-Lee have put in organising conference visits.

I appreciate the support and patience of my family and friends. In particular, I thank my parents for encouraging my interests in science and technology. Finally, I thank my teachers and mentors across all ﬁelds of endeavour. Contents

Abstract iii

Acknowledgements vii

List of Figures xi

List of Variables xvii

1 Introduction 1

2 Background 7

2.1 RadioAstronomy ...... 7

2.1.1 RadioSpectrometry ...... 12

2.1.2 RadioInterferometry ...... 14

2.1.3 ApertureSynthesis ...... 21

2.2 SignalCorrelation...... 22

2.2.1 DigitalFXCorrelator...... 26

2.3 Polyphase Filter Techniques ...... 33

2.4 ParallelProcessing ...... 39

2.5 ProgrammableGraphics ...... 48

2.5.1 Development of the Programmable GPU ...... 51

2.5.2 Compute Uniﬁed Device Architecture (CUDA) ...... 59

ix x CONTENTS

3 Literature Review 67

3.1 GPUProgrammingLanguages...... 69

3.2 FastFourierTransform ...... 71

3.3 GPUs in Astronomy and Astrophysics ...... 73

4 Model 77

4.1 CMACStageOptimisation...... 83

4.2 MemoryManagement...... 89

4.3 PolyphaseFilter...... 93

5 Testing 95

5.1 PreliminaryTesting...... 100

5.2 CMACStageTesting ...... 106

5.3 GPUCorrelatorResults ...... 111

5.4 PolyphaseFilterTesting ...... 124

6 Discussion 127

6.1 PreliminaryAnalysis ...... 128

6.2 OptimisationAnalysis ...... 129

6.3 GPUFXCorrelatorAnalysis ...... 132

6.4 PowerandCostAnalysis ...... 137

6.5 AdaptabilityAnalysis...... 139

7 Conclusion 141

7.1 ThesisSummary ...... 142

7.2 FutureResearch...... 144

References 147

A Code 157 xi

A.1 UnpackStageKernel ...... 157 A.2 CMACStageKernels...... 159

A.3 PolyphaseFilterKernel ...... 167 xii CONTENTS List of Figures

2.1 Janskyaerialsystem ...... 10

2.2 Parkesradiotelescope ...... 11

2.3 Angularresolutionofatelescope ...... 18

2.4 Resolutionofpointsources...... 19

2.5 Atwo-elementinterferometer ...... 20

2.6 Correlatorarchitectures ...... 25

2.7 SerialFXcorrelatoralgorithm ...... 30

2.8 Fouriertransformspectralleakage ...... 31

(a) Alignedfrequencysignal ...... 31

(b) Non-alignedfrequencysignal...... 31

2.9 Fouriertransformspectralresponse ...... 32

2.10 Polyphase Fourier transform response ...... 36

2.11 Rectangularpolyphaseﬁlter ...... 37

2.12 Polyphasesincﬁlter...... 38

(a) Alignedfrequencysignal ...... 38

(b) Non-alignedfrequencysignal...... 38

2.13 Algorithmicstructuretree ...... 45

2.14 Flynn’staxonomy...... 46

2.15 Parallelperformance ...... 47

(a) Amdahl’slaw ...... 47

xiii xiv LIST OF FIGURES

(b) Gustafson’slaw...... 47

2.16 Programmable rendering pipeline ...... 50

2.17 Development of the GPU pipeline ...... 55

2.18 A model of the graphics processing unit ...... 57

2.19 EvolutionoftheGPU ...... 58

(a) Processorperformance ...... 58

(b) Memorybandwidth...... 58

2.20Threadtopology ...... 64

2.21 Memory locations available to CUDA threads ...... 65

2.22 GPU-enabled system architecture ...... 66

4.1 GPUFXcorrelatorpipeline ...... 82

4.2 Parallelismoftheapproaches ...... 87

(a) Frequencyparallelapproach ...... 87

(b) Streamparallelapproach...... 87

(d) Pairparallelapproach ...... 87

4.3 GPUFXcorrelatordataﬂow ...... 92

5.1 Testdata ...... 99

5.2 Bandwidthtesting ...... 102

5.3 PCI-expressdatatransferrates ...... 103

5.4 FastFouriertransformtesting ...... 104

5.5 GPUfastFouriertransform ...... 105

5.6 CMACstagetesting ...... 108

5.7 CMAC stage results for a varying number of signals ...... 109

(a) High L = 1024, varying N ...... 109

(b) Low L = 128, varying N ...... 109 xv

5.8 CMAC stage results for diﬀerent transform lengths...... 110 (a) High N = 64, varying L ...... 110

(b) Low N = 4, varying L ...... 110 5.9 GPUcorrelatortesting ...... 114 5.10Testoutput ...... 115 5.11 Overviewofstreambandwidth...... 117

5.12 Thevariationofstream bandwidthwith N ...... 118 5.13 ThevariationofstreambandwidthwithL ...... 119 5.14 Totaldatathroughput ...... 120 5.15 CorrelatorFLOPS ...... 121 5.16 Performanceperwatt...... 123

(a) L =128Fouriertransforms ...... 123 (b) L =1024Fouriertransforms...... 123 5.17 Polyphaseﬁltertesting ...... 125 5.18 Polyphaseﬁlterperformance ...... 126

(a) Performancebystream...... 126 (b) Performancebytap...... 126 xvi LIST OF FIGURES List of Variables

Units are given in square brackets. Variables with no units are dimensionless.

a accumulation index within A A size of accumulation b longest array baseline [m] b[k] polyphase ﬁlter output [V] c speed of light [m/s] C[ν] complex visibility [W s] d diﬀerence in path length [m] D telescope dish diameter [m] δ(ω) Dirac delta function acting on ω E(t) induced electromagnetic potential [V] ǫ computational complexity per stream element f frequency index within F f(t) a time varying function G size of stream groups j iterator index k discrete time [s] K number of stream groups l length index within L L length of fast Fourier transform (FFT)

xvii xviii List of Variables

LC length of complex to complex FFT

LR length of real to complex FFT λ wavelength [m] n index of telescope signal m index of second telescope signal µ substitution for 2π(ν − ν′)/L N number of signals from the telescope array N number of units of execution in a parallel application ν discrete frequency [Hz] ν′ a particular discrete frequency [Hz] ω continuous frequency [Hz] p tap index within P p parallel instruction proportion by Amdahl’s law p′ parallel instruction proportion by Gustafson’s law

pauto number of autocorrelations

ptotal total number of correlations

pcross number of crosscorrelations φ angle of signal [rad] Φ(ν) power spectral density [W s] r performance factor in Amdahl’s law r′ performance factor in Gustafson’s law r[k] quantised, discrete-time sampling of E(t) [V] ρ[l] polyphase ﬁlter weight s serial instruction proportion by Amdahl’s law s′ serial instruction proportion by Gustafson’s law S(ω) continuous signal spectra [V s] S(ν) discrete signal spectra [V s] xix

t continuous time [s] T number of taps in the polyphase ﬁlter τ sampling period [s] θ angular resolution [rad]

uS unpack scaling factor

uB unpack bias compensation U unpack operator x(t) time-varying telescope signal [V] x[k] unpacked signal stream [V] y(τ) spatially correlated signals [V 2] xx List of Variables Chapter 1

Introduction

Computation is now essential to scientific endeavours. Where previously science has analysed models, or simple systems of models, now large complex systems are of interest. This is true across a broad variety of fields. In the field of climate modelling, simulations once dealt with a localised area and single aspects such as air pressure or ocean currents. Now, they strive for a complex system accounting for all such aspects on a global scale [11]. In the field of neural networking, computation has progressed from simulations of single neurons, to simple systems of neurons [37], and is now progressing to complex systems on the scale of mammalian brains [54]. In the field of radio astronomy, observations have evolved from using single telescopes, to interferometer arrays consisting of many telescopes, and are now looking to develop an array of massive scale [38]. The emergence of these scientific endeavours exceed the capabilities of traditional compute methods.

The growth of single-core central processing unit (CPU) technology has become limited by a power wall [53]. The power wall limits serial CPU processing because higher processing speeds would require additional power consumption, which in-

1 2 CHAPTER 1: Introduction creases the heat generation beyond what can physically be dissipated. As a result, software can no longer rely on an improving processor speeds to improve performance. To overcome these problems, parallel architectures have been developed. These architectures use many processing cores to obtain more processing performance than a single core. However, serial code must be rewritten in parallel to take advantage of these performance gains.

There is a wide range of parallel architectures currently available, and the number of processing cores vary. Serial single-core CPUs have been replaced by multicore CPUs, which contain several processing cores. Aside from the multicore CPU, other parallel architectures exist in the form of a coprocessor. A coprocessor is a second processor that assists the primary processor, in order to improve computational performance. The Cell is one such architecture, containing one main core called a Power Processing Element (PPE) and eight assisting cores called Synergistic Pro- cessing Elements (SPE) [81]. The GPU is another example, located at the other end of the parallel architecture spectrum with hundreds of processing cores.

There are two signiﬁcant advantages to hardware-accelerated parallel approaches.

Firstly, the additional processing performance will facilitate the growth to petascale compute systems that will be required for future science. The improvements in computational performance span orders of magnitude over legacy serial architectures, effectively enabling the implementation of algorithms in real-time that would be otherwise unattainable due to their computational cost. Secondly, the parallel approaches are more cost effective because they use processor resources more efficiently, resulting in both lower initial purchasing costs and lower ongoing power requirements for a desired level of performance. Low power consumption is also relevant in minimising the environmental impact of research. Both of these advantages are essential to the development of affordable large scale compute systems that will 3 ultimately be required for future science.

However, scientific algorithms must be parallelised in order to realise the performance of these new architectures. This research explores how heterogeneous parallel computing can be applied to radio astronomy signal correlators. Heterogeneous computing is the strategy of deploying multiple types of processing elements within a single workflow, allowing each to perform the tasks to which it is best suited [94]. This work utilises GPU computing techniques, in which the CPU and GPU are used together to perform real-time signal correlation on a significantly larger scale than that achievable by the CPU alone.

To observe the radio universe, signals from multiple telescopes can be correlated to form an interferometer array. Data collected from the telescopes is used to obtain images with an angular resolution, the ability to resolve distant objects, greater than would be achievable with a single dish. In order to achieve superior images, additional array elements are required to increase the collecting area and to provide more unique viewpoints on the sky. However, increasing the size of the array also increases the amount of computation necessary to correlate the signals. Given the size of next generation telescope interferometers such as the Square Kilometre array, which consist of hundreds of telescopes, this computation is on a massive exaﬂop scale [18].

Of the correlator stages, the conjugate multiply and accumulate (CMAC) stage is a computational bottleneck. It scales quadratically with the number of elements in a radio telescope array. Thus, it is this part of the algorithm that is immediately of interest for optimisation. This work demonstrates that of several potential parallel approaches, two are superior depending on the size of the telescope array. One is suited to small arrays, and a second to large arrays. For further performance 4 CHAPTER 1: Introduction improvement, the acceleration of the other stages of the algorithm is also addressed.

Using the GPU as a co-processor, a digital radio astronomy signal correlator is developed and tested. The model deﬁnes both the transfer of data between the

CPU host and GPU device, as well as the parallelisation of the correlator stages on the GPU. This research demonstrates that this approach increases computational performance by one to two orders of magnitude compared to a serial CPU approach.

Additional features are required to allow the GPU correlator to be used in a wider range of radio astronomy applications. In order to investigate the reduction of systematic noise known as spectral leakage or ringing [10], a polyphase filter is implemented on the GPU. Testing examined the filter performance on the GPU for a variety of filter lengths. It shall be shown that the additional operations for the filter implementation are partially hidden by existing memory latency in the GPU correlator. The addition of this feature to the parallel correlator demonstrates the adaptability of the model.

Through the implementation and analysis of the GPU digital radio astronomy signal correlator, this work will show that GPU computing is well suited to acceler- ating digital radio astronomy correlation and related polyphase ﬁltering techniques.

This is signiﬁcant because the next generation of radio telescopes, such as the Square Kilometre Array, will require correlator performance on a massive scale. The factor of a hundred performance improvement shown in this work makes signiﬁcant progress to meeting this scale.

Following is Chapter 2, which covers a background to the ﬁeld of radio astronomy, with an emphasis on the digital correlation algorithm. It also provides a background to the use of the GPU for scientiﬁc computation. This is followed by a literature 5 review in Chapter 3, which discusses research relevant to that presented in this work. In Chapter 4, my heterogeneous computing model for the implementation of correlation algorithms on the GPU is presented. This model was implemented and tested, and the results are presented in Chapter 5. An analysis of these results follows in Chapter 6. Chapter 7 concludes this work, and presents potential areas for further research. The accompanying appendices contain code samples from my implementation. 6 CHAPTER 1: Introduction Chapter 2

Background

The work presented in this thesis is multidisciplinary. It involves concepts from several different fields of research, including radio astronomy, signal processing, parallel computing, and computer graphics. To make this research accessible to a broad audience, this chapter introduces the key concepts of these fields. This begins with an introduction to radio astronomy.

2.1 Radio Astronomy

In 1931, while working to reduce radio frequency interference in telecommunications, Karl Jansky discovered radio signals emanating from beyond the Earth [43]. These signals originated from our galaxy, the Milky Way, and this observation by Jan- sky heralded a new ﬁeld of astronomical research called radio astronomy. Up until Jansky’s discovery, astronomy had been limited for the most part to observation through the optical window, a range of frequencies in the visible electromagnetic

7 8 CHAPTER 2: Background spectrum that can penetrate the Earth’s atmosphere from space. A second window through the atmosphere exists in the radio frequencies of the electromagnetic spectrum. This radio window is bounded by absorption due to water vapour and oxygen molecules at higher frequencies around 1.5 THz, and by absorption and reﬂection due to the ionosphere at lower frequencies near 15MHz [86]. The boundaries of the radio window vary with time, geographical location, and the sensitivity of the observing radio telescope.

The observation of radio waves from the universe beyond the Earth is of great benefit to astronomy. This is because the radio window offers a different view of the universe. Electromagnetic radiation in the radio frequencies is generated by different physical processes, and interacts with matter differently to that of the optical frequencies [10].

This view of the radio universe has led to the discovery of a number of new physical phenomenon. The cosmic microwave background radiation was discovered in 1964 by Penzias and Wilson. This radiation is an almost uniform radiation found in any direction, that is not associated with any particular celestial bodies [79]. The radiation, and the pattern of variations it contains, supports the big bang model for the creation of the universe. Another radio phenomenon discovered by Bell Burnell and Hewish in 1967 is the pulsar. These rotating neutron stars emit beams of radiation in the radio spectrum, in a manner similar to that of a lighthouse [41]. Only radiation corresponding to when the pulsar beam is pointed toward the observer is received, resulting in a regular pattern of pulses. Radio galaxies are another phenomenon revealed by observation in the radio spectrum [50]. These observations have not only revealed additional structural information regarding known optical galaxies, but have also enabled the detection of galaxies that have no optical coun- terpart. 9

These discoveries have shown that radio astronomy is a significant branch of astronomy, which studies the universe using radio techniques [97]. This is typically achieved through the use of either a radio aerial or radio telescope. The Jansky aerial system, shown in Figure 2.1, is an example of a radio aerial. Electromagnetic radio waves passing through the aerial induce an electric potential that is measured by receiver equipment. The Parkes telescope, shown in Figure 2.2, is an example of a single dish radio telescope. The dish reflects incoming electromagnetic radio waves to a receiving aerial located at the focus. This increases the amount of electromagnetic flux passing through the aerial, and thus amplifies the observed signal. As a result, the telescope can detect weaker signals than that of the aerial alone. The science of radio spectrometry measures and quantifies these radio signals, as discussed in the following section. 10 CHAPTER 2: Background

Figure 2.1: Jansky aerial system. Shown is an example of a radio aerial. Observed electromagnetic radio waves induce an electric potential in the aerial that is measured by receiver equipment. This particular aerial was used by Jansky in his discovery of the ﬁrst radio signals emanating from beyond the earth. Image courtesy of NRAO/AUI [73, 74]. 11

Figure 2.2: Parkes radio telescope. Shown is an example of a radio telescope dish. The dish reﬂects incoming electromagnetic radio waves to a receiving element located at the focus. This 64m telescope is located in New South Wales, Australia. Image courtesy of CSIRO [19, 20]. 12 CHAPTER 2: Background

2.1.1 Radio Spectrometry

A radio spectrometer is designed to measure the power spectral density of a radio signal [86]. Power spectral density is the distribution of power over the frequency, ω, of a signal detected by the antenna and receiver, denoted Φ(ω). The spectral energy distribution of a radio source is of interest in radio astronomy because it provides insights into the physical processes that produce the radio emissions in the source. It also allows more sophisticated imaging techniques.

The initial approach to radio spectrometry was the swept frequency receiver, also called a radio spectrograph [108]. In operation, this analog receiver is repeatedly tuned across the same wide band of frequencies by a mechanical or electronic device that generally produces several frequency sweeps per second. The speed of these sweeps must be fast enough to reveal variation in the intensity of the signal between sweeps [49]. Recording across an entire band of frequencies produces the power spectral density [7]. The advantage of the swept frequency receiver is that it provides a very high frequency resolution due to the sweeping mechanism producing output for a continuum of frequencies. However, in order to achieve this, the antenna must have a response that is nearly constant over this frequency range [96]. The main drawback to this approach is that such scanning prevents signiﬁcant integration that would increase the signal to noise ratio. Additionally, only a single frequency is observed at any given time, and thus signal features in the unobserved frequencies are lost.

The solution to these drawbacks is the multichannel receiver. In this approach the entire band is analysed at the same time [10]. This is achieved by having multiple instances of the previous approach, each of which monitor a set frequency channel in the observation bandwidth [65]. Thus the multichannel receiver has a lower 13 frequency resolution than the swept frequency receiver. However, it has a much higher sensitivity, as it does not require wide band aerials with even ampliﬁcation over the entire frequency range [108].

The next evolution of radio spectrometry was the autocorrelation spectrometer. Consider the incoming time-varying electromagnetic signal E(t). The spectra of this signal, S(ω), can be obtained via the Fourier transform:

∞ S(ω)= E(t)e−i2πωtdt (2.1) −∞ Z

The spectra is then squared to obtain the power spectral density of the signal, Φ(ω):

Φ(ω)= S(ω)S∗(ω) (2.2)

This approach surpasses the limitations of the previous approaches. In theory, it obtains the frequency continuum results of the swept frequency receiver, except the entire spectrum is obtained simultaneously by the autocorrelation spectrometer.

However, in practice this approach can only be implemented by applying a discrete Fourier transform of length L to a digital sampling of the signal, E[k]:

L−1 S[ν]= E[k]e−i2πνk (2.3) Xk=0

The power spectral density, Φ[ν], is obtained for the discrete range of frequencies, ν using the equation: Φ[ν]= S[ν]S∗[ν] (2.4)

Like all digital signal processing techniques, this approach is limited by the rate and precision of the digital sampling of the signals. Assuming the required sampling rates 14 CHAPTER 2: Background and precision can be physically achieved, the limitation for real-time observations becomes the rate at which the digital hardware can process the signals. All of these methods are also limited by the speciﬁcation of the telescope used for the observation. This can be improved by the use of multiple telescopes, which is next discussed.

2.1.2 Radio Interferometry

Radio interferometry is the use of multiple telescopes to make observations with a superior angular resolution compared to those made by a single telescope. The angular resolution of a receiving instrument in radio astronomy is a measure of its ability to separate two neighbouring sources, such as those shown in Figure 2.3. This is desirable because superior angular resolutions reveal more detail of the observed source. When observing radio waves of wavelength λ with a single dish of diameter D, the angular resolution, θ, of the telescope is given by the Rayleigh Criterion [86]:

λ θ = (2.5) D

As seen in Figure 2.4, two sources signiﬁcantly closer than this distance are indistinguishable from a single source. Given the longer wavelengths of radio waves, the size of a single dish required to resolve the detail can surpass what can realistically be constructed. The current largest single-aperture radio telescope is Arecibo, located on the island of Puerto Rico. It has a diameter of 305 metres [48], forming a limit on the obtainable angular resolution with a single telescope.

Radio interferometry can be applied in order to address this restriction [60]. Speciﬁcally, using two or more telescopes in an array with a maximum separation 15 b, referred to as a baseline, results in the resolving power

λ θ = (2.6) b in the direction of the baseline projected onto the source. This resolving power can be extended by rotating the baseline, or by acquiring baselines of a diﬀerent direction via additional telescopes. Thus two or more smaller telescopes with a baseline, b, can be equivalent in resolving power as a large single dish with a diameter D equal to that baseline distance. The collecting area of the array is equal to the sum of the collecting area of the component telescopes.

Shown in Figure 2.5 is a simple single-baseline interferometer. The incoming signal travelling at the speed of light, c, is measured by both receivers. However due to the angle of the signal, φ, there is a difference in path length, d, between the two receivers given by d = b sin(φ). This path difference results in a time lag, tlag, defined by tlag = d/c. This, along with any other lag in the signal processing system effected by cabling distances, is removed by delay compensation shown in Figure 2.5.

A local oscillator signal is mixed with the signal to perform a frequency conversion of the observed bandwidth. In radio astronomy this frequency conversion is used to lower the frequency at which further processing occurs. This is because it is technically easier to process signals at a set lower frequency, than to design specialised equipment for any given observation frequency range [87]. The phase shifter adjusts the phase of the local oscillator to account for any delay between the two incoming signals. The resulting stream is then fed into a correlator for further processing. 16 CHAPTER 2: Background

These principles were employed in the ﬁrst radio interferometer, constructed by Ryle and Vonburg with results published in 1946 [90]. The interferometer was designed to observe the sun at 175 MHz (1.7m wavelength) and consisted of two aerial systems with a horizontal separation of several wavelengths. This allowed discrimination between the galactic background radiation and the signal from the sun. The signals from the two aerials were combined and the sun produced an oscillating signal as the minima and maxima of the interferometer moved across the sun during the day.

To reduce the amount of noise, an early improvement to the radio interferometer was the addition of phase switching by Ryle in 1952 [89]. This allowed a weak point source to be recorded independently from an extended source of greater intensity. It was also used to improve the accuracy in the determination of the source position, as well as the measurement of angular diameter and polarisation of weak sources. The system added a phase switcher to one of the antenna cables. When activated, the phase switcher created an additional delay in the signal of that antenna, corresponding to half a wavelength in the observed signal. Rapidly alternating the phase switcher on and off caused a square wave component to be created in the signal. An amplifier that responds to this square wave signal, as well as a phase sensitive rectifier, was used to convert the square wave to a direct current [97]. In this way, the noise generated in the receiver and interference from background radiation and extended sources near the observed point source was greatly reduced [96].

In order to achieve ﬁner angular resolution, radio telescopes situated thousands of kilometres from each other were used for interferometry. This was referred to as Very Long Baseline Interferometry (VLBI) [63]. The distance between the telescopes had additional challenges over smaller arrays, as the exact location and timing of each of the observations at the telescopes is needed to combine the signals and form 17 an image. This was resolved using accurate timing instruments and reference sources in the sky.

While these interferometry techniques provide superior angular resolution, they also introduce the complication of having multiple telescope signals. To produce two-dimensional images of the observed radio source, the one-dimensional signals from the telescopes of an interferometer must be combined. This process is called aperture synthesis, and is next discussed. 18 CHAPTER 2: Background

sources

= telescope D

resolved

unresolved

Figure 2.3: Angular resolution of a telescope. Shown are two sources with an angular separation of ψ. When observed on a wavelength of λ, the angular resolution of a telescope with dish diameter D is θ = λ/D. When the angular resolution is ﬁner than the angular separation of the sources, θ<φ, the telescope can resolve the two objects as separate sources. 19

1.2 resolved unresolved

0.8

0.6

relative amplitude 0.4

0.2

0 distant medium close angular separation

Figure 2.4: Resolution of point sources. Shown is a series of two point sources, sequentially moved closer together. As the angular distance, ψ, between any two sources decreases, they appear to overlap. Should two sources be suﬃciently close such that ψ < θ, they are indistinguishable from a single source. 20 CHAPTER 2: Background

Signal Axis

Local Phase Mixer Mixer Oscillator Shifter

Delay Delay Compensation Compensation

Correlator

Figure 2.5: A two-element interferometer. Shown is a simple interferometry system [87]. Two receivers are used in order to increase the angular resolution of the observation. The signal frequency is lowered via mixing with a local oscillator signal of the correct phase. The signals are offset by a lag created by the physical path difference, d, between the two signals. There could potentially exist additional electronic delays such as those arising from differing cable lengths. These delays are removed via the delay compensation shown. A correlator is then used to combine the signals. 21

2.1.3 Aperture Synthesis

Aperture synthesis is the process used by radio interferometer arrays to obtain two- dimensional images. The correlator forms the ﬁrst stage of processing. In this stage, signals from the individual telescopes are unpacked, transformed to the frequency domain, and then conjugate multiplied to form complex visibilities for each non- redundant pairing of signals [87]. The complex visibilities are a correlation of the two signals in each pair, expressed in the frequency domain. The resulting complex visibilities are next calibrated.

Calibration removes eﬀects of instrumental and atmospheric factors in the measurements [99]. Correct calibration will result in an isolated point source, producing an ideal point spread function in the image space. This is the radio equivalent to focusing an optical telescope [10]. Calibration values can be precalculated and then repeatedly applied to a series of complex visibilities. The resulting output is called a calibrated visibility.

The next stage of the image synthesis pipeline is to convert the calibrated visibilities to the spatial domain. The calibrated visibilities are ﬁrst converted into a two-dimensional grid. This process is referred to as gridding, and typically involves the use of interpolation functions. Gridding is necessary because a two-dimensional grid is required for the inverse Fast Fourier Transform (FFT) that is subsequently used to convert to the spatial domain. The resulting transformed image still contains artifacts produced by the synthesis algorithm, and is thus referred to as a dirty image.

Following aperture synthesis, the artifacts in the dirty image are reduced via deconvolution techniques. The two most common of these are the CLEAN algo- 22 CHAPTER 2: Background rithm [42] and the maximum entropy method (MEM) [2]. The CLEAN method iteratively removes point sources and their associated side lobe noise, and then re- turns the sources to the image without the subtracted noise. The maximum entropy method constructs a function to deﬁne the lack of information in an observation, and then selects the outcome that corresponds to the maximum of this function. The output of these noise reduction techniques is referred to as a clean image. Of the stages used in aperture synthesis, the focus of this research is digital signal correlation, which is next discussed.

2.2 Signal Correlation

A correlator combines the N signals of a radio telescope array to produce complex visibility spectra. These spectra are produced for each of the pcross baseline pairs, given by N(N − 1) p = (2.7) cross 2 as well as each of the pauto = N autocorrelation pairs in which signals are correlated with themselves. Thus the total number of output spectra ptotal is given by

ptotal = pcross + pauto (2.8) N(N − 1) = + N (2.9) 2 N(N + 1) = (2.10) 2

th There are two processes required to obtain the p complex visibility spectra, Cp(ω), from the N input signals, x(t): the correlation of each pair of signals denoted X, and the conversion to the frequency domain denoted F. As shown in Figure 2.6, there 23 are two types of correlator: the XF correlator and the FX correlator. The letters indicate the order of the correlation and the frequency transform.

XF correlators ﬁrst calculates the temporal correlation function, yp(τ), using

∞ ∗ yp(τ)= xm(t)xn(t − τ)dt (2.11) −∞ Z for the p ∈ [0,ptotal −1] corresponding to the pairing of signals m, n ∈ [0, N −1]. The result is then transformed to the frequency domain to obtain the complex visibilities with ∞ −i2πωτ Cp(ω)= yp(τ)e dτ (2.12) −∞ Z This order of processing is seen in the lower left path in Figure 2.6. This type of correlator is also called a lag correlator, which involves multiplication of the two signals in series of incremental oﬀsets referred to as lags.

In contrast, the FX correlator ﬁrst transforms the signals into the frequency domain to obtain the signal spectra using

∞ −i2πωt Sn(ω)= xn(t)e dt (2.13) −∞ Z then performs the correlation via conjugate multiplication in the frequency domain [13, 104, 107, 12] using

∗ Cp(ω)= Sn(ω)Sn(ω) (2.14)

This order of processing is seen in the upper right path in Figure 2.6.

Note that because a multiplication in the frequency domain is equivalent to a 24 CHAPTER 2: Background convolution in the spatial domain, these two methods produce equivalent results. However, the FX correlator requires fewer operations than the XF correlator when operating on more than a couple of streams [87]. This is because the XF correlator requires N 2 FFTs compared to the N needed by the FX correlator. Due to the increasing trend in radio telescope array size, and the challenging scale of computation required for larger arrays, the FX correlator is the focus of this research and next detailed. 25

x (t) FFT S (w) Data streams Spectra (time) (frequency)

FX Conjugate Correlation Multiplication XF

y ( ) C (w) Correlated pairs Complex Visibilities (time) FFT (frequency)

Figure 2.6: Correlator architectures. The role of the correlator is to covert the input time series (upper left) into the output visibility data (lower right). Shown are two approaches: the XF (or Lag) and the FX correlator. The XF correlator ﬁrst performs a correlation in the spatial domain for each pair of input data streams, and then performs a fast Fourier transform (FFT) for each pair. The FX correlator performs an FFT for each data stream, and then a conjugate multiplication in the frequency domain for each pair of streams. The two approaches are mathematically equivalent. This diagram is similar to that of an autocorrelation [86], except there are two separate data streams undergoing correlation. 26 CHAPTER 2: Background

2.2.1 Digital FX Correlator

Prior to reaching the FX correlator each of the N radio frequency signals, x(t), from the telescope receivers are digitally sampled and quantised to an integer bit representation, rn[k]. This is represented by

rn[k]=ˆxn(k∆τ) (2.15)

for k ∈ Z and wherex ˆn(t) is the quantisation of x(t). The sampling period is deﬁned by ∆τ, the amount of time between consecutive samples. The reciprocal of the sampling period gives the sampling frequency. The sampling frequency must be at least greater than the Nyquist rate, which is double the highest frequency representable by a real-valued discrete time series. A more thorough analysis of the eﬀects of discrete sampling and quantisation is outside the scope of this background, and the reader is referred to Oppenheim and Schafer for further details [75].

The digital FX correlator processes the digitised signals in the three stages shown in Figure 2.7: the unpack, the frequency transform, and the conjugate multiply and accumulate (CMAC) stages. The first stage unpacks the digitally sampled signals, rn[k], to a floating point representation, xn[k]. The nature of the unpacking operation is dependent on the data packing scheme used by the interferometer receiver hardware. Additional data shuffling must occur if the signals are organised to be contiguous for a given time, the data must be shuffled such that the signals for each stream are contiguous. This latter organisation is required for the fast Fourier transform. This shuffle operation is referred to as corner turning. This work assumes corner turning is not required, and only unpacks signals that are stream-contiguous in this stage. 27

The second stage transforms the ﬂoating point signals to the frequency domain, typically utilising a discrete Fourier transform, given by

L−1 −i2πνl/L Sa,n[ν]= xn[l]e (2.16) Xl=0

This produces the output spectra Sa,n[ν] over a discrete range of frequencies ν for the ath spectra in the nth telescope stream. For the F desired frequency channels in the complex visibilities output by the FX correlator a FFT of length L is used.

This can be either a real to complex FFT of length LR = 2F − 2, or a complex to complex FFT of length LC = F , depending if the telescope data is real or complex.

The third stage of the FX correlator is the conjugate multiply and accumulate (CMAC) stage. Each m-n pair of frequency spectra are conjugate multiplied, and then accumulated across A transforms using

A−1 ∗ Cm,n[ν]= Sa,m[ν]Sa,n[ν] (2.17) a=0 X The size of the accumulation, A, is dependent on the interferometer specification. The integration used in an accumulation reduces the effects of noise as the length of the accumulation increases. This occurs because the frequency components of the observed source are assumed to remain constant over the period of accumulation, while the noise is a random process. However, accumulations must be sufficiently short that effects due to the rotation of the earth are negligible. Furthermore, variations in the observed signal that are significantly shorter than the accumulation length are lost. This accumulation produces the complex visibilities Cm,n[ν] for the ptotal pairs of signals that were defined previously in Equation 2.8. For the pauto autocorrelation pairs, the complex visibility is an accumulation of the signal power spectral density. Once an accumulation is complete, the results are output and the 28 CHAPTER 2: Background next accumulation begins.

Because the FX approach is reliant on the Fourier transform, the algorithm contains spectral leakage, which is also referred to as ringing [10]. This phenomenon is due to frequencies not directly aligned with the output spectral channels appearing to some extent in every other spectral channel. The FFT of an aligned, and a non- aligned single frequency complex sinusoid signal is shown in Figure 2.8.

To demonstrate the cause of this eﬀect, consider a complex sinusoid timeseries

′ xν′ [k] of frequency ν :

i2πkν′/L xν′ [k]= e (2.18) substituting this into the discrete Fourier transform results in:

L−1 −i2πνk/L Xν′ [ν]= xν′ [k]e k=0 XL−1 ′ = ei2πkν /Le−i2πνk/L k=0 XL−1 ′ = ei2πk(ν −ν)/L k=0 XL−1 = e−ikµ,µ =2π(ν − ν′)/L Xk=0

this general equation can be evaluated to the following [52]:

−iµ(L−1)/2 sin (µL/2) X ′ [ν]= e ν sin (µ/2) ′ ′ ′ sin (π(ν − ν )) = e−i(π(ν−ν )−π(ν−ν )/L) sin (π(ν − ν′)/2L)

Thus, the ringing is caused by the discrete spectral channel output sampling the 29

continuous frequency response, Xν′ [ν], shown in Figure 2.9. In particular, it is the sampling of the side-lobes that are present each side of the main lobe that causes the leakage. The frequency response of the FFT is caused by using a finite time series to represent a continuous signal. Weighting functions, such as hamming and hanning windows, are commonly used to efficiently reduce the leakage. A polyphase filterbank is another technique used in radio astronomy to reduce the leakage effect, and is used in combination with weighting functions. The polyphase filterbank is next detailed. 30 CHAPTER 2: Background

Initialise

Read digital samples

Stage 1 Unpack

Stage 2 Fast Fourier transform

Stage 3

CMAC

YES Accumulation complete?

Write accumulated NO complex visibilities

Is there more YES data to process?

Finalise

Figure 2.7: Serial FX correlator algorithm. For each of the N signal streams, a time series of data corresponding to the fast Fourier transform (FFT) length L is processed during each pass of the algorithm. Each correlation pass consists of three stages: unpack, frequency transform, and the CMAC. After suﬃcient passes complete an accumulation, a complex visibility for each antenna pair is produced and the subsequent accumulation begins. 31

freq = 4

0.75

0.5 amplitude

0.25

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 frequency channel

(a) Aligned frequency signal

0.5

freq = 4.5

0.4

0.3

amplitude 0.2

0.1

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 frequency channel

(b) Non-aligned frequency signal

Figure 2.8: Fourier transform spectral leakage. Shown in the ﬁrst graph is the Fourier transform of a complex sinusoid with a frequency aligned to one of the spectral channels. Shown in the second graph is the Fourier transform of another complex sinusoid with frequency directly between two of the spectral channels. When the frequency is aligned, the other spectral channels have no response. How- ever, for a non-aligned frequency there is a response in every other spectral channel. This response is called spectral leakage, and is also referred to as ringing [10]. 32 CHAPTER 2: Background

0.75

0.5 amplitude

0.25

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 frequency channel

Figure 2.9: Fourier transform spectral response. Shown in the graph is the spectral response for a Fourier transform of a complex sinusoid of a single frequency. The response is zero at values that diﬀer from the driving frequency by a multiple of the spectral channel width. Between these zeros, there are side lobes, which cause of spectral leakage for non-aligned frequencies. 33

2.3 Polyphase Filter Techniques

The use of a Fourier transform that weights all lags evenly leads to a result that is the true autocorrelation multiplied by a weighting function that has a Fourier transform corresponding to a sinc function. This produces side-lobes either side of the main peak that are responsible for the spectral leakage. Reduction of such lobes is addressed using weighting functions and polyphase ﬁlter approaches.

A polyphase filter bank (PFB) efficiently implements a bank of evenly spaced, digital Finite Impulse Response (FIR) filters. This approach effectively improves the

filter response of each channel in the Fourier transform [78, 85]. Consider a digitally sampled and unpacked time series x[n], with F frequency channels in the desired FFT output spectra, S[ν]. The polyphase filter then consists of T FIR filters, or taps, each of which are the same length as the FFT transform.

An FIR filter of length L is represented by a series of L weights: ρ0...ρl...ρL. In a traditional filter approach, the filter would be applied to the timeseries and the spectra obtained via an FFT

L−1 S[ν]= ρ[n]x[n]e−2πiνn/L (2.19) n=0 X As described by Bunton [9], the polyphase ﬁlter reduces the number of transform values calculated by a factor of the number of taps, T . Instead of ν ∈ [0, N − 1], output is calculated for ν′ ∈ [0,N/T − 1] using

L−1 ′ S[ν′]= ρ[n]x[n]e−2πiν nT/L (2.20) n=0 X 34 CHAPTER 2: Background

This is rearranged using M = L/T to reduce redundant operations

T −1 M−1 ′ S[ν′]= ρ[n + mM]x[n + mM]e−2πiν (n+mM)T/L (2.21) m=0 n=0 XT −1 MX−1 ′ ′ = ρ[n + mM]x[n + mM]e−2πiν nT/Le−2πiν m (2.22) m=0 n=0 XT −1 MX−1 ′ = ρ[n + mM]x[n + mM]e−2πiν n/M (2.23) m=0 n=0 MX−1 XT −1 ′ = ρ[n + mM]x[n + mM] e−2πiν n/M (2.24) n=0 "m=0 # MX−1 X ′ = b[n]e−2πiν n/M (2.25) n=0 X Thus a smaller FFT of length M is used along with a ﬁlter b[n] deﬁned as

T −1 b[n]= ρ[n + mM]x[n + mM] (2.26) m=0 X

The choice of weighting function for the polyphase filter is crucial. While using a rectangular weighting function reduces the sidelobes as shown in Figure 2.10, this occurs because the non-aligned frequency components in the Fourier transform are entirely removed as shown in Figure 2.11. For radio astronomy, loss of spectral information is undesirable. A hanning-windowed sinc filter defined as

1 1 ρ[l] = sinc ((L/2 − l)/F )( − cos(2πl/L)) (2.27) 2 2 where sin (πx)/πx ; x =06 sinc (x)=  (2.28)  1 ; x =0 is used to reduce ringing while retaining spectral features. Figure 2.12 shows that 35 the ringing is signiﬁcantly reduced, while the spectral response for non-aligned frequencies is retained when using a polyphase sinc ﬁlter.

The FX correlation algorithm predominantly consists of floating point calculations, which are suitable for data parallel processing. Thus, it is an ideal candidate for a GPU computing approach. This would utilise the parallelism of the GPU to obtain processing performance, while maintaining some of the flexibility of software correlation techniques traditionally applied to the CPU. To leverage the processing power of the GPU, the algorithm must be implemented in a parallel manner. Thus while still being mathematically identical, it fits optimally within the GPU computing paradigm. 36 CHAPTER 2: Background

1 T = 1 T = 2

0.75

0.5 amplitude

0.25

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 relative frequency

Figure 2.10: Polyphase Fourier transform response. Shown in the first graph is the spectral response of a polyphase filter and subsequent FFT applied to the same single frequency sinusoidal signal used in Figure 2.9. The responses of polyphase filters with taps T = 1 and T = 2 are present. As the number of taps increase, the response in the side lobes decreases significantly. 37

1 T=1

T=2 0.5 effective filter weight

T=4

T=8

0 -0.5 0 0.5 relative frequency Figure 2.11: Rectangular polyphase filter. This graph shows the effect of a unit rectangular polyphase filter with T taps. The domain of the graph is a single Fourier transform frequency channel, where the origin represents an aligned frequency, and ±0.5 represents half the distance to the subsequent frequency channel. Thus as T increases, the polyphase filter shown selects closer to the aligned frequencies, and significantly reduces the non-aligned frequencies. This graph is periodic, and repeats for all frequency channels. 38 CHAPTER 2: Background

freq = 4

0.75

0.5 amplitude

0.25

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 frequency channel

(a) Aligned frequency signal

freq = 4.5

0.75

0.5 amplitude

0.25

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 frequency channel

(b) Non-aligned frequency signal

Figure 2.12: Polyphase sinc filter. Shown in the first graph is the Fourier transform of a filtered complex sinusoid with a frequency aligned to one of the spectral channels. The other spectral channels have no response. Shown in the second graph is the Fourier transform of a filtered complex sinusoid with frequency directly between two of the spectral channels. There is only a response in the two spectral channels adjacent to the signal. 39

2.4 Parallel Processing

Parallel processing utilises multiple computational entities to solve a single problem [84]. In contrast, serial processing is limited to a single computational entity.

Following the conventions of Massingill, Mattson and Sanders [55], I will refer to a unit of execution (UE) for a single computational entity. Computer programs must make use of multiple UEs in order to utilise parallel hardware architectures.

Parallel architectures are classiﬁed as either homogeneous or heterogeneous. Ho- mogeneous computing architectures use identical processing cores. An example of a homogeneous parallel architecture is a multicore CPU. Homogeneous systems are typically easier to program, given their single processor type. In contrast, heterogeneous computing architectures use more than one type of processing core. An example of a heterogeneous parallel architecture is a machine that utilises both a

CPU and a GPU. The advantage of heterogeneous systems is that the diﬀerent processors can be used for the algorithms that they are best suited to. The resulting performance of the heterogeneous architectures can be worth the additional programming investment.

While parallel hardware has existed for decades, software has yet to adapt to such architectures. In the past such adaption was not necessary, due to the steady improvement of single core processing performance. However, the advent of the power wall has restricted growth in processing speed of serial architectures [53]. The power wall limits the speed of serial processor technology, because increasing power consumption beyond this wall causes the processors to overheat. In turn, hardware architectures have adopted parallelism in order to continue performance growth. Legacy serial codes must be adapted to use multiple UEs to take advantage 40 CHAPTER 2: Background of these new parallel architectures.

Parallelisation of legacy serial code is achieved via several design stages. Firstly, the application must be decomposed into a collection of tasks that can be processed by UEs. Those tasks that can be executed concurrently are identified, along with dependencies that prevent the parallel execution of some tasks. The tasks and their dependencies are classified with an algorithmic structure tree. Shown in Figure 2.13, the algorithmic structure tree organises different parallel algorithm approaches, or patterns. Similar patterns are grouped into three categories that form the three main branches of the tree: organisation by task, organisation by data decomposition, and organisation by data flow [56].

The organisation by task branch contains patterns in which the focus of the parallelism is on the tasks performed by a program. Patterns of this branch requires tasks within an application that can be processed independently of one another. If these tasks are all identical and independent, the pattern is referred to as embarrassingly parallel. Another pattern in this branch is divide and conquer. The divide and conquer pattern first splits a problem into smaller concurrent subtasks, and merges the results of these subtasks to solve the original problem. If the tasks are instead dissimilar, the structure is classified as task parallelism. The task parallelism pattern uses a single UE for each identified task.

The organisation by data decomposition branch of the algorithmic structure tree contains patterns in which data parallelism is optimal. This type of pattern requires the UEs to access the data in an independent manner. If the data access the data access of the UEs is geometrically structured, and the UEs communicate only with close neighbours, the structure is classiﬁed as geometric decomposition. This case is similar to the embarrassingly parallel case, however some data must be shared 41 between UEs. If instead the data is inherently recursive, such as that of a tree or graph data structure, the algorithmic structure is classiﬁed as a recursive data pattern. In this case, the level of parallelism possible can vary as the data structure is traversed.

The organisation by data flow branch of the algorithmic structure tree contains patterns in which the focus of the parallelism is the flow of the data through the program. In these algorithms, the data flows are processed by a series of tasks. UEs requiring one chunk of data may be computed in parallel with other UEs requiring other data chunks. If all of the data flows through the same sequence of tasks, this structure is classified as a pipeline. Each stage of the pipeline is executed concurrently as the data chunks make their way through. If instead the tasks are processed in an irregular manner, the structure is classified as event based. In this pattern, UEs generate tasks that are processed by other UEs. Where these tasks are independent, concurrent processing can occur.

Once an algorithm has been classified the actual implementation is developed on a particular parallel hardware architecture. The choice of hardware architecture depends on the pattern of parallelism in the algorithm. This is because different architectures are better suited to different types of parallelism. For example, task parallel algorithms typically suit the multicore architecture, while data parallel algorithms typically suit the GPU architecture. The architectures are classified to enable a match between parallel pattern and architecture.

Computing architectures are classiﬁed using Flynn’s taxonomy [30], as shown in Figure 2.14. There are four main classiﬁcations. Single Instruction Single Data (SISD) architectures are serial, performing a single instruction on a single data element at a time. Single Instruction Multiple Data (SIMD) architectures perform 42 CHAPTER 2: Background the same instruction concurrently on multiple data elements at a time, and suit data parallel algorithms. Multiple Instruction Single Data (MISD) architectures perform multiple instructions on the same data element in a pipeline, and suit pipeline algorithms. Multiple Instruction Multiple Data (MIMD) architectures are capable of performing multiple instructions concurrently on multiple data elements.

There exists a sub-category of MIMD, known as single program multiple data (SPMD) [46]. In this scheme, each UE is executing the same program, but are not necessarily at the same stage of the program as the others. With scheduling capabilities, this allows memory latency to be hidden by the processing of other UEs. That is, while a UE waits for a memory access to be processed by the memory subsystem, it is suspended by the scheduler and a diﬀerent UE is executed to make use of the otherwise idle processor. If a UE has suﬃciently high ratio of computation to data transfer, or arithmetic intensity [21, 82], the memory latency may be completely concealed. The GPU is an example of the SPMD architecture, and is suited to a variety of the patterns in the algorithmic structure tree, including the embarrassingly parallel, geometric decomposition, and pipeline patterns.

The performance improvement obtained from parallel methodology varies, and is dependent on both the algorithmic structure of an application as well as the computer architecture. Amdahl’s law [4] is used to estimate the maximum improvement in performance. Speciﬁcally, for that proportion of the algorithm that is parallelised, p, between N UEs, and the remaining serial proportion, s =1 − p, the performance will improve by a factor r estimated using

1 r = (1−s) (2.29) s + N

Thus, the performance improvement of a parallel program is limited by the time 43 needed for the sequential fraction of the program. As shown in Figure 2.15(a), as the number of processors N becomes suﬃciently large (N >> 1 − s) the value of r becomes constant. The corresponding value of the plateau is r =1/s.

However, as noted by Gustafson [36], the serial proportion typically decreases with increasing parallelism. This is because the absolute size of the serial part of a program does not necessarily increase as the program becomes more parallel. Thus as the program becomes more parallel, the serial part of the program as a proportion decreases. Gustafson’s law [36] uses a scheme in which proportions are calculated based on the code run by a single UE, and thus the serial proportion, s′, is constant with increasing parallelism for real world problems. As shown in Figure 2.15(b), this results in additional UEs continuing to improve performance, estimated by

Ns s′ = (2.30) (N − 1)s +1 p p′ = (2.31) (1 − N)p + N r′ = s′ + p′N (2.32)

In the above equations, performance continues to increase for all values of N. Sub- 44 CHAPTER 2: Background stituting Equations 2.30 and 2.31 into Equation 2.32

r′ = s′ + p′N (2.33) Ns p = + N (2.34) (N − 1)s +1 (1 − N)p + N Ns (1 − s)N = + (2.35) (N − 1)s +1 (1 − N)(1 − s)+ N N = (2.36) Ns +1 − s 1 = 1−s (2.37) s + N = r (2.38) yields an equation for r′ identical to that of Equation 2.29. Thus for any given problem, Amdahl’s and Gustafson’s laws provide identical performance predictions. 45

Pattern

Divide and Type of Conquer Decomposition

Embarassingly Task Parallel

Task Parallelism

Geometric Decomposition Algorithmic Data Patterns Recursive Data

Event based

Data Flow

Pipeline

Figure 2.13: Algorithmic structure tree. The algorithmic structure tree organises diﬀerent parallel algorithm approaches, or patterns. Similar patterns are grouped into three categories that form the three main branches of the tree: decomposition by task, decomposition by data, and decomposition by data ﬂow [56]. The shading indicates the patterns that are most suited to GPU processing. It should be noted that the task parallelism in GPU processing occurs in the distribution of work between the CPU and GPU, while the other patterns are suited to the parallel processing of work by the GPU. 46 CHAPTER 2: Background

Single Multiple Instruction Instruction (SI) (MI)

Single Data SISD MISD (SD)

MIMD Multiple Data SIMD (MD) SPMD

Figure 2.14: Flynn’s taxonomy. Parallel computing architectures are classiﬁed using Flynn’s taxonomy [30]. There are four main classiﬁcations. Single Instruction Single Data (SISD) architectures are serial, performing a single instruction on a single data element at a time. Single Instruction Multiple Data (SIMD) architectures perform the same instruction concurrently on multiple data elements at a time, and suit data parallel algorithms. Multiple Instruction Single Data (MISD) architectures perform multiple instructions on the same data element in a pipeline, and suit pipeline algorithms. Multiple Instruction Multiple Data (MIMD) architectures are capable of performing multiple instructions concurrently on multiple data elements. Single program multiple data (SPMD) is a subcategory of MIMD, in which each process is executing the same program of instructions, but are not necessarily at the same stage of the program as the other processes. 47

1024

256 R = 1/N

64 s = 0.02

s = 0.05

R (speedup) 16 s = 0.1

s = 0.2 4 s = 0.5

1 1 4 16 64 256 1024 N (number of parallel threads)

(a) Amdahl’s law

1024

256 R = 1/N

64 s' = 0.1

s' = 0.2

R (speedup) 16 s' = 0.5

1 1 4 16 64 256 1024 N (number of parallel threads)

(b) Gustafson’s law

Figure 2.15: Parallel performance. Amdahl’s law shows that there is a limit to the eﬀectiveness of parallel programming [4]. As shown in Figure 2.15(a), for a constant serial code portion s, the improvement in performance plateaus as the number of UEs increases. However, for real world problems the serial portion typically decreases with increasing parallelism. This was formalised in Gustafson’s law [36]. In this scheme, the portions are calculated based on a single UE, and thus the serial portion, s′, is constant with increasing parallelism. This results in Equation 2.32, in which additional UEs continue to improve performance as shown in Figure 2.15(b). 48 CHAPTER 2: Background

2.5 Programmable Graphics

Prior to the introduction of the graphics processing unit (GPU), video output on the personal computer transmitted to the screen via a video graphics array (VGA) controller. The image was generated, or rendered, entirely by the CPU, and the VGA controller functioned as an interface between the CPU and the computer screen [29]. Rendering is a multi-stage process that produces a two-dimensional screen image of a virtual three-dimensional space, as shown in Figure 2.16.

The pipeline begins with a 3D scene of shapes defined by vertices. Vertices are vectors in the three dimensional space of the scene that locate the corners of these shapes. These vertices are then transformed by the vertex processor, from their position in the 3D space in the scene into the corresponding 2D position in the screen. The transformed vertices then undergo primitive assembly to obtain the shapes they represent, called primitives. The next stage, rasterisation, uses the 2D screen coordinates of the primitives to determine which of the screen pixels are within the shape. These pixels along with additional data interpolated from the vertices are output from the rasterisation unit and are referred to as fragments. In the final stage of the pipeline the fragment processor determines the final colour of the pixels. For each fragment from the rasteriser, it uses the contained interpolated data to sample textures. Textures are typically two dimensional images, which are overlayed onto the primitives by this process. The pixels are output to the framebuffer, from where they may be displayed to the screen or saved back to the texture memory for subsequent reuse.

The CPU is an excellent general purpose processor, with real estate split between ﬂoating point, logic, and cache units. However, the desire to bring realtime video 49 output to photorealistic quality necessitated the creation of a processor dedicated to rendering, the GPU. Because the GPU devotes the majority of the real estate to

ﬂoating point units [102], it could obtain superior performance for graphics rendering. Over the history of its development, the GPU gradually took over the various stages of graphical computation from the CPU, as next discussed. 50 CHAPTER 2: Background

Graphics Vertex API Processor & Primitive Assembly

Vertices Primitive

Rasterisation

Fragment Screen Processor

Textures

Fragments Pixels Graphics API

Figure 2.16: Programmable rendering pipeline. Shown are the stages required to convert 3D virtual spaces to 2D images. In this example, a square in three dimensional space defined by its four corner vertices is rendered. The vertices are first transformed by the vertex processor into the 2D space of the screen, and assembled into a square primitive. The primitive is converted into fragments, which are the pixels that fall within the primitive along with additional information. The fragments are then textured by the fragment processor to produce a final 2D image that is either displayed to the screen or stored in texture memory. 51

2.5.1 Development of the Programmable GPU

Three dimensional (3D) computer graphics rendering began with the rendering pipeline, shown in Figure 2.17. This pipeline was originally executed in software on the CPU, or by specialised computer systems dedicated to this purpose [61]. How- ever, both of these approaches were unable to meet a growing demand for consumer graphics. The specialised computer systems were too expensive for this market.

While the CPU could produce quality graphics, it was computationally outmatched by the by the introduction of consumer graphics hardware, called the Graphics Pro- cessing Unit (GPU).

GPUs were introduced in 1990s with the release of graphics cards such as NVIDIA’s RIVA TNT card, ATI’s Rage 128 card, and 3dfx’s Voodoo3. As shown in Fig- ure 2.17, these first GPUs performed rasterisation and fragment shading with a fixed function capacity. A fixed function pipeline stage performs standard rendering operations, and customisation is limited to altering the values of parameters. Table 2.1 summarises the release dates and associated computational power of the TNT and subsequent NVIDIA graphics hardware.

In 1999 the functionality of the GPU was increased to support the entire rendering pipeline with ﬁxed function capability. The NVIDIA Corporation released the GeForce256 graphics card, and this was soon followed by the Radeon card from ATI. The addition of the initial pipeline stages reduced the computational pressure on the central processing unit. At this time, graphics cards were highly conﬁgurable but not programmable. Thus, the cards were limited in their functionality to that of the standard rendering pipeline.

This limitation was removed with the next generation of GPUs, released in 2001. 52 CHAPTER 2: Background

These included NVIDIA’s GeForce3 and later GeForce4, as well as ATI’s Radeon 8500 card. These early programmable GPUs contained a programmable vertex processor, which allowed user-written programs to be executed on a per-vertex basis in the rendering pipeline. The programmable vertex processor was located at the beginning of the pipeline, prior to the primitive assembly, as can be seen in Figure 2.18. While the inclusion of the vertex processor made general purpose programming on graphical hardware possible, due to its location early in the pipeline the results of the program were lost in the remaining graphics pipeline processing.

In 2003, NVIDIA’s GeForceFX and ATI’s Radeon 9700 graphics cards were released, and introduced a programmable fragment processor that allowed fragment programs to be run on a per-fragment basis in the rendering pipeline. The programmable fragment processor was located at the end of the pipeline after the rasterisation, as can be seen in Figure 2.18. Its proximity to the end of the pipeline made the fragment processor preferred for general purpose computing, since the results of the program could be read directly from the framebuﬀer. The term GPGPU was coined for the programming of the GPU for general purpose algorithms using graphics programming interfaces.

The GeForceFX was the ﬁrst graphics card that supported NVIDIA’s Cg language [29]. Standing for “C for Graphics”, Cg is a C-like high level programming language for NVIDIA’s programmable graphics hardware. The release of a high-level programming language was an important step forward for programmable graphics, comparable to the development of high level languages for the CPU. At this stage of development, for the ﬁrst time the GPU could be programmed for general purpose applications with a high level graphics programming language.

NVIDIA released its GeForce6 card in September 2004. The card had significant 53 performance improvements, in particular the introduction of 32 bit floating point calculations. This GPU also supported the new PCI Express bus architecture, which is not only twice as fast as the previous AGP 8x bus, but also allowed the use of several graphics cards in parallel. These two developments significantly increased the advantages of the graphics hardware for general purpose processing.

The GeForce 8800 was released in late 2006. Combined with the release of NVIDIA’s CUDA programming API [69, 72], this card revolutionised the use of

GPU’s for general purpose computing. The underlying architecture of the GeForce 8800 broke the tradition of using the pipelined hardware implementations of previous generations. Instead it used a uniﬁed architecture, in which one parallel device was used for each stage of the graphics pipeline via scheduling. At the same time, the CUDA programming API allowed the programmer to use the GPU as purely a computational device, by avoiding the graphical paradigms inherent in the languages used previously in GPGPU techniques. This new non-graphical GPU programming method is referred to as GPU computing. It made the GPU accessible to a broader range of programmers, because a computer graphics background was no longer required to use the GPU.

In 2007, NVIDIA released the Tesla series. These systems, ranging from a single card to a desk side server, were based on the GeForce 8800 architecture. However, unlike the GeForce 8800, there was no video output port and signiﬁcantly more onboard memory on the Tesla. These cards were thus designed for the general GPU computing market, in particular applications limited by the on board memory of existing cards. The GeForce 9800 GX2, released in 2008, contained two GPUs and surpassed the milestone of a teraﬂop of GPU processing power on a single card.

The growth of graphics hardware performance, shown in Figure 2.19, is another 54 CHAPTER 2: Background incentive to explore its general purpose applications. The CPU speeds over the past two decades have doubled every eighteen months in accordance with Moore’s law [62]. This law deﬁnes the speed as the density of transistors at a given die size. In contrast, the GPU speeds have doubled every six months, which NVIDIA refers to as “Moore’s law cubed” [68]. This is due to the design of the GPU, which can not only improved in a similar manner as the CPU, but also in the degree of parallelism.

This enables GPUs to use additional transistors for computation, achieving higher arithmetic density with the same transistor count [77]. With this current higher rate of growth, the advantages gained through the utilisation of the graphics hardware can only increase. 55

Rendering Pipeline

3D Vertex Primitive Fragment 2D Rasterisation scene Transformation Assembley Shading image

Programmable Programmable Programmable Programmable previously on CPU on CPU on CPU on CPU

Programmable Programmable Fixed Function Fixed Function 1998 on CPU on CPU on GPU on GPU

Fixed Function Fixed Function Fixed Function Fixed Function 2000 on GPU on GPU on GPU on GPU

Programmable Fixed Function Fixed Function Fixed Function 2001 on GPU on GPU on GPU on GPU

2003 Programmable Fixed Function Fixed Function Programmable on GPU on GPU on GPU on GPU

2006 Dawn of GPU Computing

Figure 2.17: Development of the GPU pipeline. The rendering pipeline shown at the top of this figures was first implemented on the CPU. Over time, parts of the pipeline were implemented on the GPU with a fixed functionality. A fixed function pipeline stage performs standard rendering operations, and customisation is limited to altering the values of parameters. Over time, these fixed function stages became fully programmable. 56 CHAPTER 2: Background

Year Product Name Process Transistors Fill Rate FLOPS 1998 RIVA TNT 0.25µm 7 M 50 M* 1999 GeForce 256 0.22µm 23 M 120 M* 2001 GeForce3 0.15µm 57 M 800 M 2003 GeForce FX 0.13µm 125 M 2000 M 2004 GeForce6 0.13µm 220 M 6400 M 2006 GeForce 8800 GTS 0.09µm 511 M 10.3 B 345.6 G 2006 GeForce 8800 GTX 0.09µm 681 M 13.8 B 518 G 2008 GeForce 9800 GX2 0.065µm 1508 M 19.2 B 1.152 T

Table 2.1: A history of NVIDIA graphics hardware. A table showing the development of the processing power of NVIDIA graphics hardware [29]. The number of transistors is measured in millions and is representative of the complexity of the hardware. The process is the minimum feature size, or die size, of the circuit on the silicon chip, measured in micrometers. The antialiasing fill rate is measured in million pixels per second and represents the speed of pixel computations. Values marked with a * are aliased fill rates due to a lack of hardware support for an- tialiased rendering. The polygon rate is measured in million polygons per second, and is a measure of the GPU’s ability to draw polygons. The GPU’s peak theoretical floating point operations per second (FLOPS) is also listed for the GPU computing products. The emphasised GeForce 8800 GTS is the GPU that was used for this research. 57

CPU

3D API Application

CPU/GPU Bus: AGP or PCI-Express

GPU Vertex Primitives Fragment Index Location Pixels Stream Primitive Rasterisation & Stream Raster Front End Assembley Interpolation Operations

Pretransformed Transformed Pretransformed Transformed Frame Vertices Vertices Fragments Fragments Buffer

Programmable Programmable Vertex Processor Fragment Processor

Output

Figure 2.18: A model of the graphics processing unit. This diagram is adapted from NVIDIA’s Cg Manual [29]. It shows the location of the fragment and vertex processors in the rendering pipeline. The proximity of the fragment processor to the end of the pipeline made it easier to capture the output. As a result, the fragment processor was preferred for general purpose computing compared to the vertex processor. 58 CHAPTER 2: Background

1000 GPU

100 CPU compute speed (GFLOPS)

10 2003 2004 2005 2006 2007 2008 date (years)

(a) Processor performance

100 GPU bandwidth (GB/s) CPU 10

2003 2004 2005 2006 2007 date (years)

(b) Memory bandwidth

Figure 2.19: Evolution of the GPU. Shown is the historical growth of the graphics and central processing units. The GPU has consistently increased its ﬂoat- ing point performance and memory bandwidth at a signiﬁcantly faster rate than the CPU. This has been achieved through the use of a massively parallel computing architecture. 59

2.5.2 Compute Uniﬁed Device Architecture (CUDA)

The Compute Uniﬁed Device Architecture (CUDA) is a parallel programming model and software environment [72]. CUDA has dedicated libraries for the Fast Fourier Transform (FFT) [71] and Basic Linear Algebra Subprograms (BLAS) [70]. As seen in Figure 2.22, there are the two main parts that make up a CUDA enabled machine: a CPU host, and one or more devices that correspond to a graphics card. The main process runs on the host machine, and coordinates the execution of lightweight parallel programs on the device, called kernels.

Each kernel consists of a number of UEs, which are called threads. As shown in Figure 2.20, the threads in CUDA are organised into a distinct topology [72]. In this topology, threads are grouped into a block. Each thread is indexed within its parent block, and this indexing can be in up to three dimensions. The blocks are then grouped into a grid. The blocks each have an index within the grid, which currently can be in up to two dimensions. Future versions of CUDA will support three dimensional grids.

A grid of threads may use up to the entire GPU processing resources. The grid is broken into blocks to control the division of threads between the GPU multiprocessors. Threads within a block are guaranteed to be on the same multiprocessor and thus can communicate using shared memory. Threads in diﬀerent blocks may not be on the same multiprocessor, and cannot use shared memory to communicate between blocks.

Kernels can be classiﬁed based on their data access patterns. There are four classiﬁcations relevant to this work: map, gather, scatter and reduction. In a map kernel, each input value is independently processed to produce a corresponding 60 CHAPTER 2: Background output value. The input and output values are arranged in the same order in memory. In contrast, gather kernels read inputs in a non-ordered fashion and scatter kernels write output in a non-ordered fashion. Reduction kernels have more than one input for each output. The outputs are still produced independently, which results in a reduction in the size of the data.

In a typical CUDA program, data is ﬁrst acquired and placed into the host memory as depicted in Figure 2.22. The data acquisition may take many forms, such as being generated by the host program, reading from a storage medium, or via a network interface. The data is then transfered to the GPU for subsequent processing via the PCI-express bus. The host process will then instantiate a kernel on the device. The kernel reads data from the device memory, perform computations, and writes results to device memory. Multiple kernels may be executed, and the results remain resident in device memory between kernels. The ﬁnal results are transfered to the host machine by the host process for subsequent output.

There are two modes available for transferring data between the host and device in CUDA. The default mode transfers data to and from pageable memory on the host.

Pageable memory may be swapped out of the host memory into virtual memory located on a storage device on the host, such as a hard disk drive. Data in virtual memory must be read back from the storage device with signiﬁcant latency in order to be used. There is also a page-locked mode, in which page-locked memory is allocated for transfer. Page-locked memory is never swapped out to virtual memory.

The advantage of the page-locked mode is that the underlying system copies directly out of the page-locked memory. In contrast to this direct copy, there is an extra copy involved with the normal pageable memory. However, the size of the allocatable page-locked memory is relatively small, and for large data transfers the pageable mode is superior. 61

On the device, CUDA utilises a number of memory spaces, as shown in Fig- ure 2.21. There are two main types of memory spaces. The ﬁrst are located on the device processor. These memory spaces include the registers and shared memory. The registers are used for holding small amounts of data to be operated on directly by the kernel code. The shared memory is used as a programmer-manageable address space for communication between threads located in the same block. As they are part of the GPU itself, these memories are limited in size but have fast access speeds.

In contrast the device memory is located next to the GPU on the device. It has a larger capacity, but also a higher latency, compared to memories located on the device processor. Device memory is capable of high parallel bandwidth.

However, memory access must be ordered according to speciﬁc coalescing rules to realise these rates. The device memory has several designated memory areas with separate purposes: global, constant, local, and texture memory. These memory areas and their interaction with the thread topology is shown in Figure 2.21.

Global memory serves as a staging area for input data obtained by the host machine and for results ready for transfer back to the host machine. Data transfers between the host and device are handled by the host machine. While running, GPU kernels can read from and write to the global memory. However, a kernel may not read from the same memory allocation to which it writes within a single kernel execution. Data can remain resident in global memory between kernel executions should it be required for later computation on the device. For the hardware investigated in this research, coalesced memory access requires consecutive threads in a warp to access consecutive memory addresses, aligned to the the total size of the memory accessed. A warp is a group of 32 threads that are processed in a SIMD manner by the GPU. More extensive details of coalesced memory access on the GPU 62 CHAPTER 2: Background can be found in the CUDA Programming Guide [72].

Constant memory is used to pass values common to all threads to the CUDA kernel. In this way, a kernel’s behaviour can be altered depending on the value of the constant at runtime. The values of constants are set by the host machine, and are used in the computation of kernels subsequently executed on the GPU.

Local memory is an overﬂow for the registers of a GPU. Being located in the device memory rather than on the GPU itself, it has a much higher latency than that of the registers. Ideally, a kernel will not require more registers than present and local memory will not be required. Due to the fact that if one thread overﬂows to local memory all of them will, the local memory access conforms to the constraints for parallel device memory access.

Texture memory is a feature still present from the GPU’s graphical heritage. It supports hardware accelerated memory sampling, such as various interpolation functions. For example, consider a value that is the interpolation of several data elements stored in device memory. Using global memory access, all of these data elements must be transfered, and then the interpolation calculated on the GPU itself. If texture memory is used, the interpolation is calculated by the texturing hardware and the single value is transferred to the GPU. This reduces both the computational load on the GPU and the amount of data that is transferred.

While these memory spaces do not have the low latencies of those located on the device processor itself, the latency can be hidden by processing. As discussed in Sec- tion 2.4, the GPU is a SPMD machine, consisting of several SIMD multiprocessors.

These process the blocks of the topology in groups of threads called warps. Should a warp be waiting on a global memory fetch, the thread scheduler can suspend that 63 warp, and begin the processing of another warp to amortise the latency. As long as a kernel has suﬃcient arithmetic intensity, the ratio of computational operations to memory access operations, the global memory latency can be completely hidden.

The development of CUDA has made programming the GPU for general purpose applications accessible to a wider range of programmers. This has been made possible by pioneering research that stretched the capabilities of the GPU, when it was used only for the graphical rendering for which it was designed. This research is reviewed in the next chapter. 64 CHAPTER 2: Background

HOST DEVICE

Grid 1

Kernel 1 Block 0,0 Block 1,0

Threads Threads

Block 0,1 Block 1,1

Threads Threads

Grid 2

Kernel 2 Block 0 Block 1

Threads Threads

Figure 2.20: Thread topology. Each kernel consists of a number of UEs, which are called threads. The threads in CUDA are organised into the topology shown [72]. In this topology, the threads are grouped into a block. Each thread is indexed within its parent block, and this indexing can be in up to three dimensions. The blocks are then grouped into a grid. The blocks each have an index within the grid, which currently can be in up to two dimensions. Future versions of CUDA will support three dimensional grids. In terms of the underlying hardware, the grid of threads may use up to the entire GPU processing resources. The grid is broken into blocks such as to mirror the hardware dividing the threads between the GPU multiprocessors. Threads within a block are guaranteed to be on the same multiprocessor and thus can communicate using shared memory, which is described later in this section. Threads in diﬀerent blocks may not be on the same multiprocessor, and cannot use shared memory to communicate between blocks. 65

Grid

Block Block

Shared Memory Shared Memory

Thread Thread Thread Thread

Local Memory Local Memory Local Memory Local Memory

Registers Registers Registers Registers

Global Memory

Constant Memory

Texture Memory

Figure 2.21: Memory locations available to CUDA threads. Shown are the various memories available for access from within a CUDA thread. Memories shaded yellow are located in the GPU multiprocessor and have a low access latency. Memories shaded green are located in the GPU device memory and have a higher latency. The host memory, not shown here but in Figure 2.22, is unaccessible from within a thread, and data must be explicitly transfered into the GPU device memory before kernel execution begins. Kernel results must be explicitly transfered back to the host. 66 CHAPTER 2: Background

HOST DEVICE

Host Device Processor Processor (CPU) (GPU)

Front GPU Side Memory Bus Bus

Chipset Host Memory PCI Device Memory Memory Express Bus Memory (RAM) Controller Bus Hub

Figure 2.22: GPU-enabled system architecture. Shown are the various processing, chipset and memory systems of a host connected to a GPU device. Code is executed on the host processor. The host processor ﬁrst copies device programs to the device. It then controls the staging of data in the host memory to the device memory via the chipset. It then signals the device processor to commence processing of data, and is then free to work on other tasks. Once the device kernel is complete, the host processor then manages the retrieval of the data from the device memory to host memory. Chapter 3

Literature Review

The next generation radio telescope, called the Square Kilometre Array (SKA), will be far larger than the interferometer arrays of today [38]. As the number of receiving elements in an array increases, the computational resources required to process the data scale quadratically. The SKA will require a massive level of computation compared to current arrays [17, 51, 25]. The traditional correlator algorithms required for this computation are well researched [13, 104, 107, 12].

However, the processing traditionally has taken place on application speciﬁc integrated circuits (ASIC) or more recently on ﬁeld programmable gate array (FPGA) architectures [22]. Beowulf clusters consisting of commodity CPU processors have been used recently, in applications such as very long baseline interferometry (VLBI) [23].

In preparation for the SKA, prototype pathﬁnder arrays are currently being developed, such as the Murchison Wideﬁeld Array (MWA) [26]. These prototypes provide the opportunity to consider alternate computing architectures, and to assess potential gains that could be achieved through their use.

67 68 CHAPTER 3: Literature Review

GPU acceleration of scientific algorithms is a field that has undergone rapid development, from initial GPGPU research to GPU computing today. The GPU has been shown to be a powerful co-processor in a variety of application areas; including general mathematics [47], image processing [16, 83], signal processing [44, 105], and physical simulation [39] and cryptography [14]. To date, GPU computing has featured in several mainstream graphics publications [28, 82, 67], in which the use of the GPU for non-graphical computation is presented. In addition, there exist several surveys of the GPU computing field [77, 76, 35]. These surveys highlight both the initial pioneering GPU research as well as subsequent advances that are significant to the entire GPU computing research area.

The potential power consumption of the hardware required to perform SKA- scale correlation is sufficiently large to affect the choice of different designs being considered [38]. The GPU architecture has been shown to be power efficient in comparison to CPU architectures. While the GPU itself typically requires more power than a CPU, when the corresponding processing capabilities are taken into account the GPU consumes fewer watts per flop [59]. For a stand-alone system, the addition of GPUs to a CPU cluster has been shown to produce higher speedups but with smaller additional power consumption than upgrading to a CPU cluster of comparable computational performance [101]. For computing clusters, expansion via GPUs has been shown to result in double the processing speed for a 20% power increase [33].

This chapter reviews research that is signiﬁcant to this thesis. Firstly, the development of GPU computing programming languages is discussed. Subsequently, the development of a GPU implementation of the FFT is presented. This is followed by a survey of GPU research in the ﬁeld of astronomy and astrophysics. 69

3.1 GPU Programming Languages

As the vertex and pixel shader units of the GPU became programmable in the ﬁrst few years of this decade, a new ﬁeld of research called general purpose GPU

(GPGPU) programming developed [77]. Rather than using the graphics hardware for rendering, this ﬁeld saw the use of the GPU’s parallel processing power applied to the acceleration of general purpose computing algorithms. GPGPU programming was achieved by reinterpreting the rendering pipeline concepts into those of general computing, using graphics application programming interfaces (API). The two most commonly used were OpenGL [106] and DirectX [24]. The host program ran on the CPU and interacted with the GPU using these APIs. It compiled and transferred small programs called shaders to the GPU for processing. The shaders were written in specialised languages, such as Cg [29] and the OpenGL shader language [88].

A paradigm shift in GPGPU programming came in 2003, when a language called Brook for GPUs emerged from the Stanford University Graphics Lab [8]. It adapted to the GPU aspects of the Brook programming language, which was designed as an extension of C with support for stream processing. BrookGPU made use of features such as inline shader programs [57] to abstract the underlying graphics architecture, and instead presented to the programmer a parallel compute oriented language. It had several backends for both GPUs and CPUs of the time, which included the NVIDIA NV30 driver, the Microsoft DirectX9 driver, the OpenGL ARB driver, multithreaded CPUs, and standard CPUs. This broad range gave

BrookGPU excellent portability.

In BrookGPU, streams were treated as variables, and accessed from the host by transferring data to and from arrays. Kernels and reductions were written as func- 70 CHAPTER 3: Literature Review tions that can be called from the program, and compiled into a fragment program when the rest of the code was compiled. Parameters were passed to the kernel as a variable. When compiled, programs were created for all backends. During runtime, the backend was selected by setting an environment variable.

Another strength of BrookGPU was that it completely abstracted the GPU. The programmer wrote stream programs that ran on any programmable graphics hardware without needing to learn the languages for each manufacturer. Its integrated support of GPU emulation on the CPU also enabled rudimentary time comparisons during development. The function approach taken by BrookGPU did have its dis- advantages. The stream was copied to the GPU prior to the kernel being processed, and copied back to the CPU after processing. If the next kernel required the output of the previous one, the stream was copied back to the GPU again. Thus, a bottleneck was created between the CPU and GPU in programs with multiple kernels. Consequently, BrookGPU was limited to applications with individual kernels that were complex enough that the speed of calculation compensated for the stream transfer time.

Following the development of BrookGPU, several other GPU computing languages emerged. These include solutions from the graphic hardware vendors, such as AMD’s Brook+ [3] and NVIDIA’s CUDA API [72]. Other third parties also developed solutions, such as Rapidmind [58]. BrookGPU, along with these subsequent frameworks, removed the requirement of graphical computing knowledge for general purpose GPU computing. For clarity, techniques that used graphics APIs are referred to as GPGPU, while the new non-graphical techniques are referred to as GPU computing. Because a graphical background was no longer needed in GPU computing, it opened the ﬁeld to a larger portion of the research community, including radio astronomy. 71

3.2 Fast Fourier Transform

The development of eﬃcient FFT algorithms on the GPU is of particular interest to this work, because the FFT is required in the second computational stage of the correlation algorithm. For radio interferometry involving only a small number of streams, this is the most computationally intensive stage of the correlation algorithm. The FFT is a fundamental transform required by signal and image processing. For this reason parallel implementations of the FFT have existed prior to the use of the GPU for computational processing [80], and the GPU implementation of the FFT has been well researched by the GPU computing ﬁeld [64, 34, 71].

The Discrete Fourier Transform (DFT) enables the spectral analysis of discrete- time signals [75]. However, a DFT of length N has a computational complexity of O(N 2). In 1965, Cooley and Tukey derived the FFT [15], which obtained the result of the DFT with a computational complexity of O(N log(N)). This signiﬁcant reduction in complexity led to the mainstream adoption of the FFT and associated digital techniques for signal processing.

The FFT consists of a number of intermediate stages, the size of these stages are referred to as radices. Diﬀerent combinations of radices can used to obtain the same result. However, a particular combination of radices may suit the memory caching of a particular hardware architecture. For this reason in 1997, Fringo and Johnson developed the FFTW library [31]. This library conducts planning in which FFT performance for a variety of radix combinations is measured. The best combination is then used to provide the best performance on any given architecture during FFT execution. The best combination of radices, called plans, can be stored to avoid the need to recalculate during subsequent program operation. 72 CHAPTER 3: Literature Review

The FFT was shown to be achievable on the parallel architecture of the GPU by Moreland and Angel [64]. Their implementation consisted of a two-dimensional

FFT, and included both forward and reverse transforms. This utilised an approach that alternated between FFT stages and tangle stages. The tangle stages were used to allow for efficient packing of the FFT data. The resulting performance of this approach was was slower than a comparative CPU implementation by a factor of six. However, this research was significant as the first implementation of the FFT on a GPU.

Following this early implementation, improved performance was obtained by using algorithms more suited to the GPU architecture. Key to the development of these algorithms was a better understanding of the GPU memory, such as the model developed by Govindaraju, Larsen, Gray and Manocha [34]. Their model describes the underlying mechanisms used to access data and perform computation on the GPU. Using this knowledge, they implemented a one-dimensional FFT algorithm that used cache-eﬃcient memory access patterns. This implementation achieved performance results that were three times faster than dual-core CPU implementations.

Modern GPU computing languages now include FFT libraries [72, 71], which are utilised by the implementations presented later in this work. These libraries are developed by the GPU vendors for their hardware, and the knowledge of their own architectures results in these libraries being highly optimised. Timings of the

CUDA FFT library are presented in Chapter 5. 73

3.3 GPUs in Astronomy and Astrophysics

In the ﬁeld of astrophysics, GPUs have been used to simulate the behaviour of a large number of astronomical bodies. This has traditionally been achieved by the use of custom processing hardware, called GRAPE. Zwart, Belleman and Geldof used the Cg shader language and GPGPU techniques to perform these simulations [109]. They subsequently updated their work with the advent of GPU computing, using the CUDA language [5]. They note a large advantage of the GPU is that it can hold signiﬁcantly more particles in memory than the GRAPE.

GPUs have been shown to accelerate high performance computing clusters. Fan, Qiu, Kaufman and Yoakum-Stover have developed a 30 node GPU cluster [27]. They then simulated the dispersion of airborne contaminants in the Times Square area of New York City. This resulted in performance increases of a factor of 4.6 compared to a 30 node single-core CPU implementation. Schive, Chien, Wong, Tsaia, and Chiueh have subsequently developed a 16 node dual-card GPU cluster [92] for astrophysics. They have performed the n-body simulations described previously, and have shown this system to be capable of simulating up to 320 million particles. It outperforms a custom hardware GRAPE-6A by a factor of two, at a superior cost-per-dollar ratio.

As well as research in the ﬁeld of astrophysics, there is some research relevant to radio astronomy signal correlation. For example, while not directly related to astronomy, research into the transfer of data between the host machine and the GPU device is required for correlation on the GPU device. As such, it is important to ensure that the bandwidth between the two is used optimally. Research has shown that data transfer should occur in large batches, rather than in smaller frequent amounts [40]. This can eﬀect the achieved throughput by up to a factor of four. 74 CHAPTER 3: Literature Review

The conjugate multiply and accumulate (CMAC) stage of the correlation algorithm is the most data intensive stage of the correlation algorithm. The underlying sum-product machine operation had been shown on the GPU to be 270 times faster [95]. Research by Schaaf and Overeem has previously implemented the CMAC stage on the GPU [91]. However, it was constrained due to its GPGPU implementation using a legacy graphics API, which resulted in heavy global memory access.

Additionally, because only the CMAC stage was implemented, the data transfer between the host and device consisted of unpacked ﬂoating point values. As the implementation used a graphical API, this data transfer was not able to occur concurrently with kernel execution. The development of GPU computing APIs that are not dependent on graphical paradigms allows greater ﬂexibility in the implementation of the CMAC stage on the GPU.

Synthesis imaging consists of more than the correlation, and there has been some research relevant to the post-correlation processing. In particular, the gridding of data in preparation for a two dimensional reverse Fourier transform has been shown to be possible for a variety of gridding methods. Schiwietz, Chang, Speier, and

Westermann demonstrate the gridding of data for magnetic resonance image (MRI) reconstruction [98]. They show a two order of magnitude performance increase using the GPU. Schomberg and Jan Timmer also demonstrate the parallel gridding of data for X-ray computed tomography (CT) [93]. As well as the one dimensional

FFTs required for correlation, two dimensional inverse FFTs for deconvolution are available in the CUDA FFT library [71]. Wayth and Dale have implemented the post-correlation realtime system for the Murchison Wideﬁeld Array (MWA) prototype on the GPU [103].

The current state of GPU computing research, as it pertains to radio signal processing, has been presented. This has revealed the GPU to be a high performance 75 computing architecture with promise for power eﬃcient processing. The related research reviewed in this chapter indicate that radio signal correlation is a potential application area for GPU computing. This work thus continues with a detailed discussion of a GPU FX Correlation model, presented in the next chapter. 76 CHAPTER 3: Literature Review Chapter 4

Model

My model for a heterogeneous parallel FX correlator is now presented. This model uses the Murchison Wideﬁeld Array (MWA) prototype [26] as a basis. The MWA prototype uses the serial FX approach, shown previously in Figure 2.7. To develop the GPU model, the parallel pattern methodology presented previously in Section 2.4 was applied to the serial design. The model uses these patterns to utilise both the heterogeneous nature of the GPU-enabled host, as well as the data parallelism of the GPU device.

The GPU computing architecture is inherently heterogeneous, in that both the host processors (CPUs) and the device processors (GPUs) are available for processing. For this reason task parallelism is utilised by the model. Shown in Figure 4.1 is the heterogeneous parallel model, in which tasks are split between the host and device. In operation, the host ﬁrst acquires radio signal data for processing. The host next transfers this data to the device, and triggers the processing of the three correlation stages on the device. While the device is processing these stages, the host is free to work on other tasks, such as acquiring the next batch of data. Results

77 78 CHAPTER 4: Model are retrieved from the device by the host. Upon completion of all data processing, the host frees allocated memory on both the host and device.

For the GPU to process tasks, data must ﬁrst be replicated in device memory space. This requires the allocation of suﬃcient device memory to store the data. The data must then be transfered onto the GPU device from the host memory prior to GPU processing. Once a batch of processing is complete, the results are transfered from the GPU device memory to the host memory. These memory transfers occur via the PCI-express bus. Figure 4.1 shows these additional steps in the GPU FX correlation algorithm. It is noted that between these stages the intermediary results are stored in the GPU device memory, avoiding unnecessary data transfer.

Figure 4.1 shows the three correlation stages implemented with GPU kernels on the device: the unpack stage, the Fourier transform stage, and the CMAC stage.

These kernels parallelise the serial correlator stages using embarrassingly parallel and geometric decomposition patterns of parallelism. The application of the patterns to each of the stages is next detailed. For reference, the kernel codes can be found in Appendix A.

The unpack kernel of the FX correlator is an ideal candidate for processing on the GPU, because it has an embarrassingly parallel pattern. The characteristic of this pattern is that consists of many identical and independent tasks that can be computed in parallel. Each unpack task reads in a single data value and output a single corresponding unpacked value. The input value is a 8 bit integer datatype that is unpacked into a 32 bit ﬂoating point datatype for subsequent processing. For

th th the t packed value in the n stream, rn[t], the unpacked value, xn[t], is calculated using the equation

xn[t]= uSrn[t]+ uB (4.1) 79

The input value rn[t] is ﬁrst converted to a 32 bit ﬂoating point value and multiplied by a scaling factor uS. Where the units of the signal are unimportant, or if such considerations are accounted for post-correlation, this scaling factor may be omitted. If a bias is present in the signal due to the packing scheme of the sampling hardware, it is removed with uB. The computation required to calculate Equation 4.1 has a complexity of O(N). Thus, the amount of processing scales linearly with the number of streams N. This is the least computationally intensive kernel.

As well as the kernel to be executed by each thread on the GPU, the thread topology must also be defined. Topologies in the CUDA API were discussed in Sec- tion 2.5.2. They define the distribution of threads across the GPU multiprocessors. For the unpack kernel, there are two considerations affecting the choice of topology.

The primary consideration is that enough threads must be used on the GPU device to ensure it is at full thread capacity. There should not be signiﬁcantly more threads than required to satisfy this condition to avoid overheads related to thread instantiation. The topology used for the unpack stage was three blocks, with 128 threads per block, for each of the twelve multiprocessors on the GPU. Once the unpack kernel processing is complete, it is then followed by the Fourier transform stage.

The Fourier transform kernel is a more challenging kernel to implement using the GPU. This is because of the highly interleaved pattern of memory accesses that occur during the transform. A single interleaved access is referred to as a butterﬂy.

However, as the Fourier transform is fundamental to the ﬁelds of signal and image processing, a signiﬁcant amount of research into its optimal implementation on the GPU has occurred as discussed in Section 3.2. Resulting from that work, the CUDA language has a high performance FFT library called CUFFT. 80 CHAPTER 4: Model

The CUFFT library provides a FFTW style approach [31], in which a planning stage is executed prior to any actual transforms. This planning stage runs a series of test FFTs on the GPU device, with a variety of diﬀerent radices to determine the most optimal for the particular hardware conﬁguration. As this planning stage occurs once during initialisation, there is no additional overhead once the algorithm begins to process data.

Consider the ath transform of length L in the nth telescope stream. CUFFT takes the unpacked data, xa,n[t], in the time domain, t, as input and outputs the spectra, Sa,n[ν], in the frequency domain, ν, such that

L−1 −i2πνt/L Sa,n[ν]= xa,n[t]e (4.2) t=0 X CUFFT implements this efficiently using a parallel FFT algorithm. It processes the entire buffer in a single library call through the use of batching. The larger the number of transforms in a batch, the better the performance of the library. Thus, larger buffers are preferred for this stage.

For N data streams and a transform of length L, the computational complexity of the Fourier transform stage is O(NL log2(L)). Extremely long transform lengths are not typically used in FX correlation. For a given L, this stage also scales linearly with N. The number of ﬂoating point operations used in this stage depends of the particular radices selected in planning. Research in the ﬁeld uses a standard

5L log2(L) for the transform [31]. Thus, there are 5 log2(L) ﬂoating point operations per stream element. The GPU FX correlator algorithm then continues with the CMAC stage.

The CMAC kernel is a reduction kernel that takes the FFT output spectra S 81 as input. For each m-n stream pair, it conjugate multiplies and accumulates a total of A spectra pairs to produce the complex visibilities, C, using the equation

A−1 ∗ Cm,n[ν]= Sa,m[ν]Sa,n[ν] (4.3) a=0 X Should an accumulation span the spectra buﬀer, the complex visibility buﬀer is used to store intermediary results. Mathematically, this is expressed as

q ′ ∗ Cm,n[ν]= Cm,n[ν]+ Sj,m[ν]Sj,n[ν] (4.4) j=p X for a range delimited by the memory indices p and q in the spectra buﬀer. C denotes the previous accumulation subtotal, and C′ denotes the new accumulation result. The complex visibility buﬀer must be reset between accumulations.

The computational complexity of the CMAC stage is O(N 2). Speciﬁcally, 3N +3

ﬂoating point operations must be performed per stream operation. This complexity scales quadratically with the number of data streams, while the complexity of the previous two stages scales linearly. Therefore, the CMAC stage becomes a bottleneck as the number of streams increases, and is the most important consideration for optimisation. As such, optimisation of the CMAC stage is explored in greater detail in the next section. 82 CHAPTER 4: Model

Initialise

Read digital samples

Transfer data to GPU

Kernel 1 Unpack

Kernel 2 Fast Fourier transform

Kernel 3

CMAC

YES Accumulation complete?

Retrieve results from GPU NO

Write accumulated complex visibilities

Is there more YES data to process?

Finalise

Figure 4.1: GPU FX correlator pipeline. Shown is a simplified diagram of the GPU correlator algorithm flow. Each pass of the algorithm processes significantly more data than the serial version in order for optimal parallelisation. Two data transfers outlined in bold have been added in which data is transfered to the device and results are retrieved from the device. Intermediate results remain in the device global memory between the kernels, and are not transferred to the host machine. Operations processed by the host are coloured yellow, while the kernels that execute on the device are coloured green. 83

4.1 CMAC Stage Optimisation

The advent of GPU computing and the associated high level languages has exposed the full potential of the graphics hardware, and removed the burden of a graphics rendering application programming interface (API). However, implementing optimal algorithms on the GPU remains non-trivial. Algorithms that are both optimal and generalised are diﬃcult to program on the GPU. The cause is the thread and memory architecture of the GPU.

In the pursuit of extreme parallelism, a large number of simultaneously executing threads has been made possible by restricting the resources available for each thread. Consideration must be given to ensure that the individual threads remain light enough to run on GPU in terms of the required hardware resources; in particular the memory requirements and access patterns. The thread topology must also be managed to ensure enough threads are running on the hardware to utilise its full capabilities. Controlling these factors for a number of algorithm variables is challenging.

Since the CMAC stage scales quadratically with the number of telescope data streams, it the most computationally expensive stage of the FX correlator algorithm for large telescope arrays. Because of this, optimal performance of this stage in the GPU implementation is crucial to the overall performance of the algorithm. The resources required by the CMAC stage varies with the number of telescope streams, N, and the length of the FFT spectra, L. Several diﬀerent approaches have been explored to investigate the most optimal implementation of the CMAC stage, which are next discussed.

This work has investigated using several diﬀerent levels of parallelism in order 84 CHAPTER 4: Model to determine the best solution for a given set of correlation parameters. The different methods are presented in increasing levels of parallelism: a single thread

CPU approach, a frequency parallel approach (1xNxN), a stream parallel approach (1x1xN), a group parallel approach (1xGxG), and finally a pair parallel approach (1x1x1). The abbreviated form I use for these methods refers to the number of pair results calculated by a single thread, or G threads in the case of 1xGxG. The three values referring to the number of frequency channels, first streams, and second streams respectively. N refers to the number of telescope streams, and G refers to the number in a subset of those streams. Figures 4.2(a), 4.2(b), 4.2(c) and 4.2(d) show the differences between the CUDA approaches respectively. Each individual block in these figures represents the result for the fth frequency channel for one pair of streams m and n. The results calculated for each thread, or G threads for 1xGxG, are spaced apart in the Figure from those calculated by other threads to illustrate the parallelism of these approaches.

The different approaches used the block and grid dimensions of the CUDA topology that maximised their performance. For the majority of the approaches, a block consists of 64 threads, corresponding to 2 warps of 32 threads. As detailed previously in Section 2.5.2, a warp is a group of 32 threads that are processed on the GPU in a SIMD paradigm. Each consecutive thread in the block reads adjacent frequency channels to ensure global coalescence to the complex transform data. The exception is the 1xGxG approach, which utilises a two-dimensional block topology of 32x4, consisting of four warps of 32 threads. The 4 warps sample the same 32 frequencies for G = 4 adjacent streams in a staggered coalescent manner. That is, each individual warp is coalesced, however consecutive warps are accessing non-consecutive memory. This allows the kernel to acquire data for four different streams while still maintaining coalesced memory access. For all approaches, the first dimension of a 85 grid ensures enough threads for all frequency channels. The second dimension then allows for sufficient threads for the parallelism of the approach.

The serial approach uses a single thread to compute all of the F N(N + 1)/2 cross spectra frequency values, where N is the number of data streams, and F is the number of frequency channels in a complex visibility. The thread calculates results for all frequencies and pairs serially. This method is the base case for comparison purposes. The serial processing in this approach has all the results for each frequency of a non-redundant stream pair processed by a single thread. In my implementation, the CPU thread was processed by one core of a dual core CPU, while the other core was used by the underlying operating system.

In the frequency parallel approach (1xNxN), each of F threads compute N(N +1)/2 cross spectra frequency values. In Figure 4.2(a), the result of each ith-jth pair of telescope streams for each f th frequency is represented by a single cube. The cubes are separated into slices corresponding to the results of all pairings of streams for a single frequency. In the 1xNxN approach, a single thread calculates the results for one frequency of all N pairings of all N streams, which corresponds to one slice in the ﬁgure. Note that results for redundant pairs are not calculated.

In the stream parallel approach (1x1xN), each of NF threads compute N −n cross spectra frequencies values. In Figure 4.2(b), the result cubes are separated into columns corresponding to the results for all pairings for a single stream and for a single frequency. In the 1x1xN approach, a single thread calculates the results for one frequency of one stream’s N pairs, which corresponds to one column in the

ﬁgure. Results for redundant pairs are not calculated this approach as well.

In the group parallel approach (1xGxG), the N streams are split into K 86 CHAPTER 4: Model groups of size G, and K2GF/2 threads compute G cross spectra output frequencies. In Figure 4.2(c), the result cubes are separated into square groups. In the 1xGxG approach, G threads calculate the results for one frequency of G pairings of G threads, which correspond to a single group in the figure. Groups composed entirely of redundant pairs are not processed, and groups composed partially of redundant pairs discard those results. The extra groups are included for efficient indexing, and the extra pairs within groups are an unavoidable result of SIMD (Single Instruction Multiple Data) nature of blocks in CUDA [32], and become a negligible overhead for sufficiently large K. The group size in the diagram, G = 4, matches the size used in my testing.

In the pair parallel approach (1x1x1), N 2F threads compute one cross spectra output frequency each. This is the method with the largest degree of parallelism investigated. In Figure 4.2(d), each result cube is separated from every other result cube. In the 1x1x1 approach, each thread calculates results for one frequency of one stream paired with one other stream. Threads for redundant pairs are launched but perform no processing. These extra threads are included for eﬃcient indexing.

For all of these approaches, it was crucial to obtain global memory coalescence. However, for a length L transform, the real to complex CUFFT library routine produces spectra consisting of L/2 + 1 complex values. Since the value of L used for radio astronomy is typically a power of two, the output spectra size is not. The extra complex value adds to the offset of each subsequent spectra. Thus, memory access of the CMAC approaches is increasingly offset by one complex value. Since a complex floating point value is 8 bytes in size, and coalescent global memory access requires alignment to a minimum of 32 bytes, this offset prevents optimal memory coalescence by the CMAC approaches. 87

(a) Frequency parallel ap- (b) Stream parallel approach proach (1xNxN) (1x1xN)

Figure 4.2: Parallelism of the approaches. In each of these four diagrams, each block represents the result for a single frequency channel f of a single pair of streams m, n. The results are grouped together with other results calculated by the same thread. The three dimensions in the abbreviations for these approaches refer to the number of frequency channels, m streams and n streams computed by a single thread respectively. Non-shaded threads indicate where redundant threads have been instantiated with no instructions in order to simplify indexing. 88 CHAPTER 4: Model

For this reason, the complex to complex transform CUFFT library routine was used. For a length L transform, this routine produces L complex values. When used on a real signal, the extra values replicate the L/2 + 1 output values. This data padding results in spectra that are aligned for coalesced global memory access. Extremely small values of L, such as L < 16 would require alternate approaches. However, such small values of L are not typically used in radio astronomy correlation.

For a complex to complex transform, the unpack algorithm must convert the real input data to complex values, by padding each floating point unpacked data value with an additional floating point value set to zero. These modifications increase the size of the required device memory for the unpacked signal data and transformed spectra. The use of host and device memory in the GPU FX correlator model is now discussed. 89

4.2 Memory Management

Correct management of the host and device memory is critical to enabling optimal performance. Kernel execution cannot commence until all the required data has completed the host to device memory transfer. In a similar manner, host to device transfers must complete before results can be accessed by the host. These two additional transfer stages are shown in Figure 4.1. The communication of data between host and device memory occurs via the PCI-express bus.

The PCI-express 1.1 bus currently supports a maximum transfer rate of 4 giga- bytes per second in each direction. This is sufficient for my model. As shown in Figure 4.1, data is transfered to the GPU once, and results are retrieved from the GPU once. There are no additional transfers required between the three computational stages. Such transfers would have a significant detrimental effect on the performance of the algorithm. Page-locked memory spaces, introduced in Section 2.5.2, were not used based on the results of preliminary testing. Future versions of CUDA will use multiple independent processing pipelines that will allow the transfer of data to occur simultaneously with kernel execution. This will effectively hide the necessary communication between the host and device. The resulting effect on the correlator performance is left for future research to explore.

The size of memory buffers can potentially limit the parallelism of the kernels. Because memory buffers cannot be refreshed during kernel execution, it is important that large data buffers are used. Larger amounts of data available to a kernel increases the scope for parallelism. The memory buffer size is set during host and device memory allocation. The buffer size is also relevant to data transfer, as the GPU is more efficient when transferring larger amounts of data [40]. Thus, entire 90 CHAPTER 4: Model buffers of data should be copied between device and host in a single transfer, rather than in a series of smaller transfers.

As the incoming data streams are of arbitrary length, the GPU operates on a buffer that is a portion of the entire data stream. The size of the buffer for these portions is therefore dependent on the capabilities of the GPU hardware. However, the size of an accumulation is dependent on the desired science outcomes of an observation and the specifications of the particular telescope array. Arising from this are two algorithmic features not present in a simple CPU implementation: the accumulation is not necessarily aligned with the buffer boundaries, and the accumulation may span buffer boundaries. Consequently, the CMAC kernel uses the result buffer to hold intermediary accumulation values while the spectra buffer is refreshed. This enables accumulations that span consecutive spectra buffers of data.

The relationship between the GPU global memory and the GPU is similar in some respects to that between the RAM and the CPU. It serves as a large data staging area for the lower latency shared and register memory on the GPU itself, as RAM does for the CPU cache. The interface between global memory and an individual GPU multiprocessor is a parallel memory interface, which is accelerated only for specific coalesced memory access patterns [72]. Hence the ordering of data chosen in an algorithm has significant performance effects during GPU memory operations.

For the GPU FX correlator algorithm, it is beneﬁcial for the input data streams to be grouped by stream rather than by time. That is, the sequential samplings of any given data stream are contiguous in memory. Should the data instead have the values corresponding to a given time from all streams contiguous in memory, corner 91 turning must be applied to shuﬄe the data into the correct ordering. The model presented here assumed that corner turning is not required, and the implementation of this operation is left to future research.

Shown in Figure 4.3 is the data flow of my GPU correlator. During computation, a series of memory buffers in device memory is used to store the packed signals, R; unpacked signals, X; spectra, S; and complex visibilities, C. These buffers are accessed by the GPU via the GPU memory bus in each stage of the algorithm.

In order to transfer data to and from the device, the initial and final buffers are allocated in both the host and device memory. Data is transfered between these buffers explicitly during the program execution, as detailed previously in Figure 4.1. 92 CHAPTER 4: Model

GPU Shared Memory and Registers

Unpack Transform Accumulate

Complex Packed Unpacked Spectra Visibility Buffer Buffer Buffer Buffer (R) (X) (S) (C)

GPU Global Memory

Random Access Memory (RAM)

Digitised Signal Result Complex Array Data Data Visibility Signals Buffer Buffer Data (R) (C)

Figure 4.3: GPU FX correlator data flow. During computation, a series of memory buffers in device memory is used to store the packed signals, R; unpacked signals, X; spectra, S; and complex visibilities, C. These variables were introduced in Section 2.2. The buffers are accessed by the GPU via the GPU memory bus in each stage of the algorithm. In order to transfer data to and from the device, the first and last buffer also exist in the host’s RAM. Data is transfered between these host and device memory buffers explicitly during the program execution. This transfer occurs via the memory bus, chipset, and PCI-express bus as shown in Figure 2.22. 93

4.3 Polyphase Filter

The GPU polyphase filter builds on the approach taken for the unpacking kernel of the vanilla GPU correlator. The stage launches sufficient threads to make use of the GPU compute resources. Each thread processes a portion of the data in the packed buffer, R, and output to the unpacked buffer X. The threads access the input data in a coalesced manner as discussed for the unpack stage in previous sections.

The main algorithmic addition is a circular buﬀer that is kept in shared memory.

This buffer stores multiple consecutive reads, up to the number of taps required for the calculation, for each thread in a block. The minimum size of this buffer is the number of threads in a warp multiplied by the number of taps. The maximum size of this buffer is the shared memory available on each multiprocessor. Ideally, the buffer should be small enough that multiple blocks may be run on the same multiprocessor.

In operation, the polyphase kernel first fills the buffer with input data. Each thread then unpacks each tap value in the buffer, multiplies by a preprocessed filter function also located in shared memory, sums the resulting values, and outputs to the unpacked buffer. The next tap is then read into the circular buffer, overwriting the first tap, and the process continues until all data has been processed. The input buffer is increased to provide enough data of the subsequent input buffer for the final results to be read. This requires copying a negligibly small additional amount of data in each data copy from the CPU to the GPU.

The unpacking operations in this scheme occur multiple times on the same data element, one for each tap. This approach has been taken, as it allows the circular buﬀer to contain packed data, and thus have a much smaller size. This allows the 94 CHAPTER 4: Model

GPU multiprocessor to process multiple blocks concurrently, allowing for improved performance. The additional unpacking operations should be hidden by memory latency, with no loss in performance.

The unpack stage uses a kernel to launch enough threads to make use of all the GPU’s available compute resources. These threads then process the data in the packed buffer, R, and output to the unpacked buffer X. For optimal processing speeds, the global memory in which these buffers reside must be accessed in a coalescent manner. That is, the sequential threads within the same single instruction multiple data (SIMD) warp must access corresponding sequential memory addresses [72]. For interleaved array data, wherein all timesamples from each signal for a given time are adjacent in memory; the data must be shuffled to a non-interleaved form for the FFT, wherein all consecutive timesamples for a given signal are adjacent. In order to both read and write to global memory in a coalesced manner, the data must be shuffled in shared memory. This work has assumed non-interleaved data, and has not investigated the interleaved case. Chapter 5

Testing

A GPU FX correlator was successfully implemented and tested, using the heterogeneous parallel model presented in Chapter 4. The testing was used to investigate the GPU FX correlator implementation, using a single core CPU implementation for comparison. The purpose of this comparison is to determine the suitability of the heterogeneous parallel architectures to radio astronomy signal correlation. This suitability consists of a number of criteria. Most importantly, the GPU correlator implementation must produce correct results with a sufficient performance increase over the serial implementation to warrant the additional parallel programming overhead. As power consumption has become a significant factor in computing, the power usage of the correlator implementation is investigated. Finally, the adaptability of heterogeneous parallel architectures is also considered. For this, a polyphase filter was added to the GPU FX correlator implementation. The ability to add new algorithmic features to a correlator widens the scope of its potential scientific applications. I address all of these criteria with the test results I present in this chapter, which are summarised in Table 5.1.

95 96 CHAPTER 5: Testing

The test data consisted of four digital signal streams recorded from prototype antenna tiles in the Mileura Wideﬁeld Array (MWA) low frequency demonstrator [26].

A short sample of the data used for testing is shown in Figure 5.1. The signals were sampled at a rate of 16 MHz, which corresponds to one sample every 62.5 nanoseconds. Each data sample had a precision of 8 bits. Four streams of data, totalling one gigabyte, was collected for testing. For tests that required more than four signal streams, the original four signals were replicated. The performance of the algorithm is not dependent on the values of the input data.

I investigated a variety of the two most signiﬁcant correlation parameters: the length of the frequency transform L, and the number of data streams N. These parameters have a signiﬁcant impact on the thread resources, thread load on the

GPU, and overall memory usage that could eﬀect the operating speed. For the transform length parameter L, testing values varied by powers of two from L = 128 to L = 2048, since these are considered to be lengths for which a GPU correlator is most likely to be used. For the number of data streams N, the values tested varied in powers of two from N = 1 streams up to N = 128 streams. A lower limit of

N = 4 was chosen for some of the optimisation techniques, which became somewhat trivialised for N = 1, 2. The upper limit was chosen as multiple GPU approaches must be considered past this degree of processing. Such approaches would split the streams between cards using the smaller stream sizes presented here.

The test system hardware consisted of a Tyan Thunder K8WE (S2895) motherboard, with a Dual Core AMD Opteron Processor 265 CPU. The power consumption of this CPU is rated at 90W. As this work has not addressed multicore CPU approaches, only a single core was utilised in the testing. The GPU used was the NVIDIA GeForce 8800 GTS with 320MB of memory. This GPU has 96 streaming processors (SP) with a clock rate of 1200 MHz. With each SP capable of three 97

ﬂoating point operations per clock, the maximum theoretical performance for the 8800 GTS is 3× 96× 1200× 106 = 345.6 GFLOPs. The maximum bandwidth to the onboard memory of this GPU is 64 GB/s. The power consumption of the 8800 GTS is rated at 135W. For RAM, the system had 2GB of DDR400 memory. A Seagate Barracuda ST3250620AS 250GB SATA2 hard disk drive was used for storage. This system utilised a PCI-Express bus architecture for communication between the host and GPU device.

The test system ran the Ubuntu Fiesty Linux v7.04 operating system, using version 2.6.20-16 of the Linux kernel. The GPU accessed via the NVIDIA Linux Display Driver x86 version 100.14.11. The libraries required for testing included: libc 6, libgcc 4.1.2, libcuda 1, libcufft 1, and libfftw 3. The FFTW library was used for Fourier transform processing by the CPU correlator. The CUDA and CUFFT libraries supported CUDA compute version 1.0. All timing tests utilised the ftime routine of the sys/time.h system header. Timing tests were run ten times averaged to produce results, with outliers due to the operating system removed. Aside from these rare outliers, the obtained timing results were identical due to the 100ms granularity of the system timer. Multiple iterations of tests were used to increase the test runtimes to at least 10 seconds each to ensure several significant figures of accuracy. 98 CHAPTER 5: Testing

Section Test Figure Preliminary PCI-express data transfer rates 5.3 GPU fast Fourier transform 5.5 CMAC Stage CMAC stage results for a varying number of signals 5.7 CMAC stage results for diﬀerent transform lengths 5.8 GPU Correlator Test output 5.10 Overview of stream bandwidth 5.11 The variation of stream bandwidth with N 5.12 The variation of stream bandwidth with L 5.13 Total data throughput 5.14 CorrelatorFLOPS 5.15 Performance per watt 5.16 Polyphase Filter Polyphase ﬁlter performance 5.18

Table 5.1: Testing Summary. This table summarises the testing results presented in this chapter. The preliminary testing provided insight into the computational ability of the test system. The CMAC stage testing investigated how correlation parameters affected the different potential approaches for the CMAC kernel. Testing then examined the overall GPU correlator to determine the performance of the implementation. Finally, the addition of the polyphase filter and its associated performance was tested to explore the adaptability of the GPU implementation. 99

0 bit value

-8

-16 0 32 64 96 128 timesample Figure 5.1: Test data. Shown is a short sample of the data used for testing. This data was sampled from signals observed by prototype antenna tiles in the Mileura Wideﬁeld Array (MWA) low frequency demonstrator [26]. The signals were sampled at a rate of 16 MHz, which corresponds to one sample every 62.5 nanoseconds. 100 CHAPTER 5: Testing

5.1 Preliminary Testing

This section presents the results of preliminary tests performed prior to implementing the full correlator algorithm. These results were required to obtain insight into the computational ability of the test system. First examined are the available computational resources of the GPU device. Following, performance details regarding the host-device bandwidth and the CUDA fast Fourier transform library, CUFFT, are examined.

Device resources are an important consideration in obtaining optimum GPU compute performance. Each resource examined is ﬁnite, and thus approaches that exceed a given resource will not execute. Furthermore, some resources are shared between threads. Thus the amount of a resource that each thread requires determines the maximum number of threads that may execute concurrently. This directly aﬀects the performance of the system if there are too few threads for computation.

The CUDA SDK was used to determine the computational capabilities of the GPU device in the test system. It contained 12 multiprocessors, and a compute capability of 1.0. The thread topology could have blocks with maximum dimensions of 512 by 512 by 64, with a maximum of 512 threads. The topology supported up to two dimensional grids with dimensions not exceeding 65,536 by 65,536.

The CUDA SDK was used to probe the memory of each type on the GPU device. Each multiprocessor on the GPU device contained 4096 32-bit registers and 16,384 bytes of shared memory. The GPU device contained 288,210,994 bytes of global memory, 65,536 bytes of constant memory. However, the amount of allocatable global memory could vary depending on the required usage for rendering of the desktop user interface. As this value can change arbitrarily, no formal testing was 101 performed. All testing ensured a suﬃcient buﬀer of memory for the user interface was maintained, and there was minimal user interface usage during testing. This phenomenon could be avoided with the use of a dedicated compute GPU in addition to that used for graphical rendering.

I next examined the rate of data transfer across the PCI-express bus between the host and device memories, which occurs in the data transfer stage of the GPU model shown in Figure 5.2. Since the signal data is one-dimensional, concerns common to two-dimensional data transfers such as byte alignment and padding is not considered. Measurements were taken for the page-locked mode [72] as well as the normal transfers. I considered one CUDA API call to instantiate a host to device transfer to correspond to one transfer. The bandwidth for the two modes were measured for a variety of packet sizes, and the results are shown in Figure 5.3.

The performance of the CUFFT library was tested to compare performance to the leading single core CPU fast Fourier transform software, FFTW [31]. These tests were performed because the FFT is required for the second stage of the FX correlation algorithm as shown in Figure 5.4. I investigated two modes of GPU operation as well as the serial FFTW implementation on the CPU. In the ﬁrst GPU mode, the library directly transformed values resident in the GPU’s global memory. In the second, values resident in host memory were transfered to the GPU, transformed by the library, and then transfered back to host memory. The GPU FX correlator model uses the FFT to process data already resident on the

GPU device, however the latter mode was included in this testing to demonstrate the costs incurred by transferring unpacked ﬂoating point data to the device, and non-accumulated results back to the host. These modes were tested for transforms of length L = 128 to L =222, and the performance results are shown in Figure 5.5. 102 CHAPTER 5: Testing

Initialise

Read digital samples

Transfer data to GPU

Kernel 1 Unpack

Kernel 2 Fast Fourier transform

Kernel 3

CMAC

YES Accumulation complete?

Retrieve results from GPU NO

Write accumulated complex visibilities

Is there more YES data to process?

Finalise

Figure 5.2: Bandwidth testing. Shown is a diagram of the GPU correlator algorithm flow. The bandwidth testing was performed to determine the maximum transfer rate achievable between the host and device as highlighted in this figure. Due to the accumulation in the third computational stage of the algorithm, there is significantly less data to be transferred to from the device to the host later in the algorithm. For his reason only results of testing for the highlighted stage will be presented. 103

Normal

Page-locked 1G transfer bandwidth (bytes per second)

0 0 16M 34M 50M 67M size of transfer packets (bytes)

Figure 5.3: PCI-express data transfer rates. This graph shows the achievable rate of data transfer across the PCI-express bus between the host and device memories. Rates are shown for a variety of transfer sizes. One CUDA API call to instantiate a host to device transfer is considered to correspond to one transfer. Measurements for the page-locked mode [72] are included as well as the normal transfers, to verify that page locking is not suitable for the data streaming used in the correlator algorithm. 104 CHAPTER 5: Testing

Initialise

Read digital samples

Transfer data to GPU

Kernel 1 Unpack

Kernel 2 Fast Fourier transform

Kernel 3

CMAC

YES Accumulation complete?

Retrieve results from GPU NO

Write accumulated complex visibilities

Is there more YES data to process?

Finalise

Figure 5.4: Fast Fourier transform testing. Shown is a diagram of the GPU correlator algorithm ﬂow, with the FFT kernel highlighted. The FFT kernel testing was performed to determine the performance of the CUFFT 1 library on the GPU device, as well as the FFTW 3 library on the host system. Testing of the CUFFT library included tests both with and without data transfer. Testing without data transfer is representative of the performance of this kernel in the GPU implementation. The tests that included data transfer used unpacked ﬂoating point input and non-accumulated output, which is not representative of the GPU implementation. These latter tests have been included to demonstrate the loss of performance incurred by these communications that can be reduced by the other two GPU kernel stages. 105

GPU CUFFT 1G GPU CUFFT with transfer CPU FFTW

100M

10M complex values transformed per second

128 1024 16k 262k 4.2M transform length

Figure 5.5: GPU fast Fourier transform. This graph compares the performance of the CUFFT library to the leading single core CPU software fast Fourier transform implementation, FFTW [31]. Testing of the CUFFT library investigated two modes of operation. In the ﬁrst, the library directly transformed values resident in the GPU’s global memory. In the second, values resident in host memory were transfered to the GPU, transformed by the library, and then transfered back to host memory. 106 CHAPTER 5: Testing

5.2 CMAC Stage Testing

I next examined how correlation parameters aﬀected the diﬀerent potential approaches for the CMAC kernel, highlighted in Figure 5.6. The correlation parameters tested included the length of the FFT, L and the number of telescope signal streams, N. Of the approaches presented in the previous chapter, testing considered the serial, 1x1xN, 1xGxG and 1x1x1 approaches. The 1xNxN approach is unable to be implemented on current NVIDIA hardware for more than a limited number of input data streams. This is because the number of registers required by the kernel scales quadratically beyond the number available on the GPU.

I first investigated how the transform length parameter affected computational speed. Testing values varied by powers of two from L = 128 to L = 2048, since these are considered to be lengths for which a GPU correlator is most likely to be used. Two sets of these results have been graphed. Figure 5.8(a) shows how the performance of the various approaches varies with transform length for N = 64 streams. To examine the effect of a low thread configuration, Figure 5.8(b) shows how the performance varies for N = 4 streams.

I next investigated how the number of streams in a correlation affected performance. The values tested varied in powers of two from N = 4 streams up to N = 128 streams. The lower limit was chosen as the techniques become somewhat trivialised for N = 1, 2 and the upper limit was chosen as multiple GPU approaches must be considered past this degree of processing. Such approaches would split the streams between cards using the smaller stream sizes presented here. Figure 5.7(a) shows the effect of varying the number of streams on processing performance for transform length L = 1024. To examine the effect of a low thread configuration, Figure 5.8(a) 107 shows the performance for transform length L = 128. 108 CHAPTER 5: Testing

Initialise

Read digital samples

Transfer data to GPU

Kernel 1 Unpack

Kernel 2 Fast Fourier transform

Kernel 3

CMAC

YES Accumulation complete?

Retrieve results from GPU NO

Write accumulated complex visibilities

Is there more YES data to process?

Finalise

Figure 5.6: CMAC stage testing. Shown is a diagram of the GPU correlator algorithm ﬂow, with the CMAC kernel highlighted. I will now examine a number of diﬀerent potential implementations of this stage, in order to determine which performs the fastest for a given set of correlation parameters. 109

CPU 1GHz comp 1x1xN comp 1xGxG comp 1x1x1 100MHz real 1x1xN real 1xGxG real 1x1x1 10MHz

1MHz bandwidth per signal 100kHz

10kHz 4 8 16 32 64 128 number of data streams

(a) High L = 1024, varying N

CPU 1GHz comp 1x1xN comp 1xGxG comp 1x1x1 100MHz real 1x1xN real 1xGxG real 1x1x1 10MHz

1MHz bandwidth per signal 100kHz

10kHz 4 8 16 32 64 128 number of data streams

(b) Low L = 128, varying N

Figure 5.7: CMAC stage results for a varying number of signals. Shown are the rates achieved. The number of signals, N, varied from 4 to 128. Each of the approaches were tested on real to complex (real) and complex to complex (comp) transform data. The bandwidth is the half the number of samples per stream per second the correlator can compute in real time, assuming real input signals in accordance with Nyquist’s theorem. 110 CHAPTER 5: Testing

CPU comp 1x1xN comp 1xGxG 10MHz comp 1x1x1 real 1x1xN real 1xGxG real 1x1x1 1MHz

bandwidth per signal 100kHz

10kHz 128 256 512 1024 2048 transform length

(a) High N = 64, varying L

CPU comp 1x1xN comp 1xGxG 1GHz comp 1x1x1 real 1x1xN real 1xGxG real 1x1x1 100MHz

bandwidth per signal 10MHz

1MHz 128 256 512 1024 2048 transform length

(b) Low N = 4, varying L

Figure 5.8: CMAC stage results for diﬀerent transform lengths. Shown are the rates achieved. The transform length, L, varied from 128 to 2048. Each of the approaches were tested on real to complex (real) and complex to complex (comp) transform data. The bandwidth is the half the number of samples per stream per second the correlator can compute in real time, assuming real input signals in accordance with Nyquist’s theorem. 111

5.3 GPU Correlator Results

Results of testing for the entire GPU correlator, shown in Figure 5.9, are now presented. I ﬁrst examined correctness, to ensure the produced output was valid. Cor- rectness tests ran the correlator implementations using the 1 gigabyte of MWA signal data as input. Shown in Figure 5.10 is an autocorrelation of one of the signals produced by the GPU FX correlator. This output matched the standard output supplied with the MWA tile data.

A direct comparison of the serial and parallel correlator output revealed slight differences. Forty values taken from the real channels of the first autocorrelation spectra were compared; these values are listed in Table 5.2. An average relative error of 0.0000131, and a largest relative error of 0.0000477 was observed. These variations were assumed to be due the FFT radix used by the two implementations being different. Both the FFTW and CUFFT libraries automatically select the optimal FFT radix for the hardware, and it is unlikely that the same set of radix will be optimal for both CPU and GPU.

Shown in Figure 5.11 are the correlation parameters ranges that were varied for the testing: the length of the Fourier transform, L, and the number of streams,

N. As discussed previously in the chapter, the transform length range was chosen to include those typically used in current radio correlators: 128 <= L <= 1024, incremented in powers of two as these are typically used lengths. Figure 5.13 shows how the real time bandwidth per stream varies with L. The number of streams tested covered the range 1 <= N <= 128, starting from a single stream and incrementing to the maximum value of N = 128. This value was chosen as the total global memory of the card began to reduce the allowable transform length. Finally, the 112 CHAPTER 5: Testing total throughput of the correlation is shown in Figure 5.14.

In addition to the correlator throughput, the number of ﬂoating point operations per second (FLOPS) performed by the implementation was calculated. This was achieved by multiplying the measured throughput by the number of ﬂoating point operations per stream sample, producing the results shown in Figure 5.15.

The number of FLOPS for each stage was derived in the following manner. In the first stage, each sample must be unpacked. For the 8 bit samples used in the testing, this required two floating point operations, a multiply and an add. Thus the FLOP per sample in the first stage is A = 2. For a polyphase filter with T taps, 2T − 1 additional operations are required per element [100]. The card actually performs 4T −3 additional operations since shared memory resources are scarce and sharing packed data allows higher thread occupancy and thus better performance despite more compute. However the smaller value is used for the purpose of standard comparison.

In the second stage, a fast Fourier transform is applied to a series of L samples, where L is the size of the transform. This requires a number of ﬂoating point operations that depends on this length, speciﬁcally 5L log2 L [31]. This is only precise for radix-2 Cooley-Tukey algorithms, but is the standard used for comparison with other approaches. Thus there are 5 log2 L FLOPS per stream elements in this stage.

In the final stage, the number of FLOPS for the multiply add is 6 per channel per pair. For N streams there are N(N +1)/2 pairs. The FFT produces an identical number of channels as there are samples. Thus there are 3N + 3 flops per stream element in this stage. Thus the total float operations are 2 + 5 log2 L +3N +3= 113

3N +5log2 L + 5 per sample. 2T − 1 additional operations per element are required for the polyphase ﬁlter stage.

The amount of power used by the entire computer under several diﬀerent operating loads was measured, in order to determine the FLOP per watt eﬃciency of the implementations. Results were taken for both the CPU and GPU correlation, as well as when the machine was idle. This was repeated for with and without the X11 graphical interface. It should be noted that the CPU power usage does include an idle GPU in it’s power consumption. The test machine motherboard would not support booting without a graphics card. However, the power rating of the CPU is 95 watts. These power results are shown in Table 5.3.

The performance results were divided by the number of watts required, to produce the graph shown in Figure 5.16. Also plotted are the maximum possible power eﬃciencies, which assume that the power supply, motherboard, and all other peripherals except for the GPU or CPU draw zero power. The power ratings supplied by the manufacturers were used in this calculation. 114 CHAPTER 5: Testing

Initialise

Read digital samples

Transfer data to GPU

Kernel 1 Unpack

Kernel 2 Fast Fourier transform

Kernel 3

CMAC

YES Accumulation complete?

Retrieve results from GPU NO

Write accumulated complex visibilities

Is there more YES data to process?

Finalise

Figure 5.9: GPU correlator testing. Shown is a diagram of the GPU correlator algorithm ﬂow. Operations processed by the host are coloured yellow, while the kernels that execute on the device are coloured green. The entire algorithm is highlighted, as the correctness testing and performance testing now presented are representative of the entire algorithm. 115

-10

-20 relative power per channel [dB]

-30 256 frequency channel Figure 5.10: Test output. Shown is an autocorrelation of one of the signals, produced by my GPU FX correlator. This output was correct compared to the CPU FX correlator I also implemented, although there was a slight variation in the results attributable to diﬀerent radices used in the FFT libraries. 116 CHAPTER 5: Testing

CPU results GPU results relative error 304972431360.000000 304975446016.000000 0.000010 37821464.000000 37822180.000000 0.000019 37308172.000000 37309220.000000 0.000028 37475216.000000 37475148.000000 0.000002 37237392.000000 37238140.000000 0.000020 37065256.000000 37065048.000000 0.000006 37616392.000000 37616380.000000 0.000000 37669348.000000 37670068.000000 0.000019 38732304.000000 38731776.000000 0.000014 37922300.000000 37923044.000000 0.000020 38540296.000000 38540232.000000 0.000002 39222000.000000 39222508.000000 0.000013 39706708.000000 39706900.000000 0.000005 40351944.000000 40352472.000000 0.000013 40454904.000000 40455124.000000 0.000005 40564384.000000 40565260.000000 0.000022 40726024.000000 40726428.000000 0.000010 41891344.000000 41891848.000000 0.000012 42424596.000000 42424804.000000 0.000005 43427080.000000 43427108.000000 0.000001 43503580.000000 43503688.000000 0.000002 44464684.000000 44464852.000000 0.000004 45469620.000000 45470324.000000 0.000015 46780332.000000 46780484.000000 0.000003 46963360.000000 46964528.000000 0.000025 48391492.000000 48392124.000000 0.000013 50482428.000000 50484836.000000 0.000048 51228304.000000 51229236.000000 0.000018 53108032.000000 53108508.000000 0.000009 54400584.000000 54398692.000000 0.000035 56066080.000000 56066444.000000 0.000006 58752968.000000 58751976.000000 0.000017 59759816.000000 59760360.000000 0.000009 61693108.000000 61692876.000000 0.000004 64569492.000000 64570884.000000 0.000022 66858544.000000 66859796.000000 0.000019 69568000.000000 69569272.000000 0.000018 73219928.000000 73221752.000000 0.000025 77314432.000000 77313896.000000 0.000007 79746600.000000 79746688.000000 0.000001

Table 5.2: Accuracy test data . Shown are the real values for the first forty frequency channels of the first autocorrelation for the first stream. No normalisation or calibration has been applied. The relative error is listed in the third column. A mean of 0.0000131 and maximum of 0.0000477 for the relative error was obtained directly from the output values using float point precision arithmetic. 117

CPU GPU

100MHz

10MHz

1MHz

100kHz bandwidth per signal 1024

10kHz 512 1 2 4 256 length of transform 8 16 32 number of signals 64 128 128

Figure 5.11: Overview of stream bandwidth. Shown is an overview of the real time bandwidth per stream for the range of correlator parameters tested: the length of Fourier transform and the number of data streams. Refer the cross-sections shown in Figures 5.12 and 5.13 for a clear comparison of results. 118 CHAPTER 5: Testing

100MHz L = 128 L = 1024

10MHz

1MHz GPU

bandwidth per stream 100kHz CPU

10kHz

1 2 4 8 16 32 64 128 number of data streams Figure 5.12: The variation of stream bandwidth with N. Shown is the real time bandwidth per stream as the number of data streams, N, varies. Results are plotted for the minimum and maximum transform lengths tested, L = 128 and L = 1024. The bandwidth calculation assumes sampling at the Nyquist rate. 119

100MHz

10MHz GPU

1MHz

bandwidth per stream CPU 100kHz

10kHz

128 256 512 1024 length of FFT

Figure 5.13: The variation of stream bandwidth with L. Shown is the real time bandwidth per stream as the FFT length, L, varies. Results are plotted for N = 16. Although the magnitude of the lines diﬀer for other values of N, the trends are similar to the two present in the CPU and GPU lines respectively. 120 CHAPTER 5: Testing

100M GPU

10M total samples per second 1M CPU L=128 L=1024

100k 1 2 4 8 16 32 64 128 number of data streams

Figure 5.14: Total data throughput. Shown is the total data throughput in samples per second. Results are shown as they vary with the number of streams for two transform lengths: L = 128 and L = 1024. 121

100G

GPU 10G FLOPS 1G

CPU L=128 100M L=1024

1 2 4 8 16 32 64 128 number of data streams Figure 5.15: Correlator FLOPS. Shown is the rate of ﬂoating point operation per second (FLOPS) achieved by the correlators. For N streams and a length L fast Fourier transform, each stream element requires 3N +5log2 L + 5 ﬂoating point operations in the correlation pipeline. 122 CHAPTER 5: Testing

Correlation X11 Min Watts Max Watts Av Watts None Yes 178 182 180 CPU Yes 190 194 192 GPU Yes 185 198 191.5 None No 183 186 184.5 CPU No 195 198 196.5 GPU No 186 198 192

Table 5.3: Observed power usage. Shown is the amount of power used by the entire computer under several diﬀerent operating loads. Results were taken for both the CPU and GPU correlation, as well as when the machine was idle. This was repeated for with and without the X11 graphical interface. It should be noted that the CPU power usage does include an idle GPU. The test machine motherboard would not support booting without a graphics card. However, the power rating of the CPU is 95 watts, and the GPU is rated at 135 watts. These ﬁgures are used to produce the upper limits in Figure 5.16. 123

ideal GPU

GPU

100k ideal CPU

total samples per second watt CPU 10k

1 2 4 8 16 32 64 128 number of data streams

(a) L = 128 Fourier transforms

ideal GPU

GPU

100k ideal CPU

total samples per second watt CPU 10k

1 2 4 8 16 32 64 128 number of data streams

(b) L = 1024 Fourier transforms

Figure 5.16: Performance per watt. Shown is the total correlation throughput divided by the power consumption. The throughput is measured in samples per second, and the power is measured in Watts. The measured power usage for the system includes the power supply, motherboard, hard disk drive, optical drive, and peripherals excluding the display. Also shown are values calculated using the ideal peak usage taken from the power rating of the CPU and GPU respectively. 124 CHAPTER 5: Testing

5.4 Polyphase Filter Testing

I next investigated the addition of a polyphase filter to the unpack stage of the GPU FX correlation model. The modified correlator algorithm is shown in Figure 5.17. This was achieved using kernel code to implemented the polyphase filter, b[n], defined in Equation 2.26. For testing, an implementation for corresponding to a subsequent FFT of length L = 128 was developed. A more generalised GPU polyphase filter is left for future research.

For this implementation, testing examined how the rate of execution varied both with the number of taps in the ﬁlter as well as the number of streams in the buﬀer.

The number of taps varied across powers of two from T =1 to T = 8. The number of streams varied across powers of two from N =1to N = 128. For the T = 1 case, the polyphase ﬁlter is equivalent in performance to the unpack stage it replaces. While it does contain an additional ﬁlter multiplication for each data value, this is hidden by the latency of the memory fetch for that value.

The polyphase filter kernel scales linearly with the number of streams, as opposed to the quadratic scaling of the CMAC stage. For this reason, it takes significantly less time than the CMAC stage. Thus, overall performance tests would not reveal how the polyphase filter is affected by N and T . For this reason, the performance tests consider the performance for only the polyphase filter kernel corresponding to b[n], and not the complete correlation implementation. The rate at which the GPU polyphase filter processed the data for a various number of streams is shown in Figure 5.18(a). This is also shown for a various number of taps in Figure 5.18(b). 125

Initialise

Read digital samples

Transfer data to GPU

Kernel 1 Polyphase

Kernel 2 Fast Fourier transform

Kernel 3

CMAC

YES Accumulation complete?

Retrieve results from GPU NO

Write accumulated complex visibilities

Is there more YES data to process?

Finalise

Figure 5.17: Polyphase filter testing. Shown is a diagram of the GPU correlator algorithm flow. Highlighted is the polyphase filter kernel, which has replaces the unpack kernel for testing. The resulting change in performance for a variety of correlation parameters is now examined. 126 CHAPTER 5: Testing

N=4 100M N=8

N=16

N=32

10M N=64

N=128 samples per stream second

1M 1 2 4 8 number of taps

(a) Performance by stream

taps=1 taps=2 taps=4 100M taps=8

10M samples per stream second

1M 4 8 16 32 64 128 number of streams

(b) Performance by tap

Figure 5.18: Polyphase filter performance. Shown are the total stream throughput for the GPU polyphase filter. In the first figure, performance is measured for a varying number of taps in the polyphase filter. The amount of computational operations in the filter kernel scales with the number of taps. However, the performance of the filter is not fully effected by the additional computation, due to computation being hidden with global memory latency. The second graph shows the effect of varying the number of streams on the kernel. The performance is in- versely proportional with the number of streams, which corresponds to the workload required for a given number of streams. Chapter 6

Discussion

The results presented in the previous chapter are now discussed. I will examine the performance gains, computational precision, memory bandwidth, power consumption, and ease of programming of the GPU implementation. This will demonstrate the suitability of the graphics processing unit to accelerate signal correlation algorithms used in radio interferometry. Furthermore, I will illustrate that signiﬁcant progress toward satisfying the processing requirements of the next generation of scientiﬁc computation can be achieved by parallel processing architectures.

This chapter will first discuss the preliminary investigation of the GPU. Sub- sequently, the effect of the correlation parameters on the choice of kernel for the CMAC stage is explored. This is followed by a discussion of the overall GPU correlator implementation. An analysis of power usage by the implementation is then presented. The chapter closes with a discussion of the relative ease with which the GPU algorithm can be modified.

127 128 CHAPTER 6: Discussion

6.1 Preliminary Analysis

Research presented in the literature review revealed potential GPU computing bottlenecks [40, 64]. These bottlenecks could signiﬁcantly impact the performance of a

GPU implementation. Preliminary testing was thus carried out to ensure that performance was not signiﬁcantly impacted by these bottlenecks before development of the GPU FX correlator began. There were two areas that were investigated: the transfer of data between host and GPU device, and the GPU FFT library implementation.

The transfer of data between the host and GPU device has been shown to become limited depending on the size of the packets used for transfer [40]. In CUDA, the size of the packets refers to the number of bytes speciﬁed in a single cudaMemcpy routine call. For this reason, a range of packet sizes were tested to determine their corresponding performance. The preliminary testing also examined the diﬀerence between pageable and pinned data transfer modes to determine the most optimal method of copying data to and from the GPU device. The results of these tests were shown in Figure 5.3. These results showed that the pageable memory transfers provide superior performance, and that the packet size must be larger than approximately eight megabytes. This was used in the GPU FX correlator implementation to minimise the performance impact of data transfer between the host and device.

Research that has used an implementation of the FFT on the GPU architecture has seen mixed performance results [64, 34, 71]. As the FFT forms the second computational stage of the FX correlation algorithm, it was important to determine the performance of the CUDA FFT library. The results of the FFT testing revealed that the overhead created by the transfer of unpacked floating point values before 129 and after the transform reduced the performance of the card. However, the GPU FX correlator model does not transfer 32 bit unpacked floating point values. Instead it transfers data to the GPU device in an 8 bit packed form, reducing the effect of the data transfer to the GPU device by a factor of four. Furthermore, the accumulation in the CMAC stage reduces the results to a negligible fraction of the original data size. Thus the effect of data transfer from the GPU device is reduced.

Performing the additional correlation stages on the GPU device reduces the eﬀect of data transfer by approximately a factor of eight. This results from the factor of four decrease in the size of the input data stream, and the factor of two decrease from the reduction of the result data stream to a negligible size. However, this has not considered the eﬀect of moving the computation of the other two stages to the GPU device from the CPU. Performance gains in these other stages will also mitigate the cost of transferring data between the host and GPU device. The performance of the CMAC stage is next discussed.

6.2 Optimisation Analysis

The optimisation analysis investigated how the correlation parameters of FFT length and the number of telescope streams aﬀected the best GPU CMAC approach. This is important because the CMAC stage requires the greatest amount of computation for a non-trivial number of telescope streams, and the trend in radio astronomy interferometry is for an increasingly large number of telescopes in interferometer arrays [38]. For the cross multiplication and accumulation stage of an FX correlator, testing demonstrated the performance of the GPU. In the test results, shown previously in Figure 5.7(b), 5.7(a), 5.8(b), and 5.8(a), the GPU was tens to hun- 130 CHAPTER 6: Discussion dreds of times faster than the serial implementation. While there is potential for optimisation in the CPU implementation, the achievable performance gains would not be signiﬁcant when compared to the orders of magnitude performance increase required to reach that of the GPU.

The GPU correlator by Van Der Schaaf and Overeem [95], reviewed earlier in Chapter 3, saw performance improve by just under a factor of five when compared to a serial implementation. In contrast, the CMAC stage presented in Chapter 4 increased this improvement to over a factor of a hundred. There are two reasons for this large leap in computational power. Firstly, the GPU has grown in power much faster than the CPU in the intervening three years. This is shown in Figure 2.19(a). Secondly, the advent of GPU computing discussed in Section 3.1 has allowed a greater flexibility in algorithm design. This has resulted in a significant reduction in number of global memory accesses and kernel executions required.

Testing revealed two main factors that determine the best CMAC kernel for a given set of correlation parameters. The first is the coalescence of the global memory access on the GPU. The effect of this can be seen in the superior performance of the complex to complex data implementations over those of the real to complex data. This effect also leads to the 1xGxG approach, while carrying out some redundant processing unlike the 1x1x1 and 1x1xN approaches, being significantly faster due to more efficient memory access. The use of the shared memory as a cache to share data between threads in a group results in a speed up proportional to the reduced global memory access, as seen in the majority of the correlation parameter space. However, this approach became inefficient for low FFT lengths and numbers of telescope streams due to a lack of GPU processing threads.

The GPU has the capability of actively processing hundreds of threads of exe- 131 cution simultaneously. Furthermore, it has the capability of scheduling threads. If a group of threads stall from waiting on the latency involved in a global memory access, while the execution of another group of thread proceeds. Thus a kernel may be working on thousands of threads at any given time. However, as these threads are processed in parallel, the GPU will take a similar amount of time to process one thread as it would to process it’s maximum thread capacity. In this way, failing to parallelise an algorithm to keep the GPU at maximum capacity will result in a loss of performance, which I refer to as thread deﬁciency.

Thread deficiency is the second factor in determining the best approach for a given set of correlation parameters. This can be seen for smaller transform lengths and numbers of streams in Figure 5.7(b) and 5.8(b). The 1x1xN and 1xGxG methods begin to lose performance, whereas the 1x1x1 remains unaffected. This is due to the finer parallelisation of this approach has more threads than the others. If there are not enough threads to fill the GPU, it is taking the same amount of time to process less work and thus performance drops. The thread deficiency in an approach occurs approximately when the total number of threads for an approach drops below the maximum thread occupancy of the GPU for that approach. This is distinct from theoretical thread occupancy for a GPU, as other hardware restrictions for resources such as shared memory and registers may cause the actual maximum thread occupancy for an approach to be less than the theoretical maximum occupancy for the GPU.

These hardware speciﬁcations vary from one GPU to another. Thus there is not a deﬁned boundary of correlation parameters where one approach GPU CMAC approach surpasses the other in performance. For this reason, it is recommended that the performance of the approaches be measured for the desired correlation parameters on the GPU hardware that is to be used in order to select the best 132 CHAPTER 6: Discussion approach. Such a measurement can be performed during the initialisation stage of the GPU correlator. Because this measurement only needs to be performed once, it will not reduce the performance of the GPU correlator.

Utilising the GPU to optimally process the cross multiplication and accumulation stage of a correlation algorithm is non-trivial. However, the gains that can be achieved both over a traditional CPU approach, and through the correct choice of optimised approaches make this worthwhile. This work has investigated several possible implementations to obtain a signiﬁcant gain in overall GPU algorithm performance.

6.3 GPU FX Correlator Analysis

The results of the preliminary testing and CMAC stage testing were used to develop the GPU FX correlator, to investigate the performance of the GPU architecture for FX correlation. An important consideration is whether the GPU implementation produces correct results. The CUDA programming language used in the testing follows the IEEE-754 standard for single-precision binary ﬂoating-point arithmetic [1].

Some of the more advanced features of this standard are not supported. However, it is more than suﬃcient for the calculations presented in this work.

The results of the correctness tests shown previously in Figure 5.10 match those supplied with the test data. A comparison of the correctness tests from the CPU and GPU implementation revealed an average relative diﬀerence between results of

0.000013. These differences arise because the order of floating point operations in the two implementations may be different. In particular, the implementations select the most optimal set of FFT radices for each hardware architecture. While mathe- 133 matically these radices are interchangeable, minutely different results are obtained when floating point arithmetic is used. In terms of a real world implementation, noise both from the environment and the receiving equipment [6] would be far more significant. It should be noted that both the CPU and GPU are providing 32 bit floating point approximations to the correct result, and neither should be considered the absolute truth. More accurate results could be obtained by using double precision floating point calculations at the cost of performance. Double precision is available on both the CPU and modern GPU architectures.

The GPU correlator consistently outperforms the CPU version. As seen in Fig- ure 5.12, the GPU performance advantage varies between a factor of 3 for the correlations that least suit the GPU, to a factor of 70 for those parameters most optimal for the GPU. For the majority of the typical correlation parameter space, the GPU correlator performs faster by over an order of magnitude. However, the performance of the GPU implementation is reduced for correlation parameters that are low in FFT lengths and number of telescope streams. This is due to thread deﬁciency, since the low correlation parameters result in a lower number of threads for CMAC stage.

The computational load is most demanding for large transform lengths and numbers of streams. For this reason, the GPU optimisation concentrated on such parameter values. There are parallelisation approaches that could be applied to increase the number of threads for low correlation parameters, and improve the GPU performance. Currently, the GPU correlator uses one thread for each frequency channel and each pair in the CMAC stage for cases where thread deﬁciency may be en- countered. It may be possible to increase the number of threads by using multiple threads in place of the current single thread to each accumulate a separate part.

This would add an overhead of an additional step which would add these subtotals 134 CHAPTER 6: Discussion together, however this would most likely be more than accounted for by the resulting performance boost. Investigation of improving the correlator performance in the thread deﬁcient regime is left for future research.

A close examination of Figures 5.12 and 5.13 will reveal that the CPU correlator prefers the smaller transform lengths (L), whereas the GPU prefers the longer transform lengths. For the CPU, the net computational complexity per stream element ǫ is given by

ǫ ∈ O[N + log2(L)] (6.1) for T timesamples per each of N streams and an FFT length L. Thus for a given number of streams and total number of timesamples, the complexity will scale with log(L). This increased complexity accounts for the slower CPU performance as L increases.

However, for the GPU case the complexity is identical yet the results are the reverse. This is due to the fact that longer transform lengths give the GPU correlator more scope for parallelism, particularly in the CMAC stage of the algorithm.

This can be seen in the GPU L = 128 result in Figure 5.12 as the performance drops significantly for low numbers of streams. The GPU L = 1024 fairs better in this regime as the higher transform length results in more active threads during the CMAC stage. It should be noted at higher lengths than typically used in radio astronomy correlation, once there are sufficient threads for the GPU to perform optimally, subsequent increases in transform length result in a similar performance decline as seen in the CPU. The GPU is by no means immune to the limits of computational complexity, but rather has thread deficiency as an additional consideration that is skewing the most optimal configuration higher than it would otherwise be. 135

Aside from the eﬀects from thread deﬁciency, the GPU correlator is bound by the memory bandwidth to the GPU memory. This is demonstrated by the 1xGxG method achieving twice the performance of the other methods in Figures 5.7 and 5.8, because the shared memory techniques in the 1xGxG method reduce the memory access by a factor of two. It should be noted that the GPU has the highest memory bandwidth of the currently available commodity computing devices. The 2006 model

GeForce 8800 GTS used in this research has a memory bandwidth of 64 GB/sec. Due to the use of the CUFFT library, an exact count of the memory operations in the GPU correlator is not possible. However assuming the minimum required access, the total bytes of global memory access per 1 byte sample would be approximately 26+4N. Choosing N = 128 to avoid the effects of thread deficiency, a 64GB/s memory bandwidth should result in a data rate of 118 megasamples per second. This is consistent with Figure 5.14, with the relevant data point corresponding to 105 megasamples per second. It is expected that the rising trend for the GPU in Figure 5.15 due to diminishing thread deficiency will level out having reached this saturation point.

An estimate of the proportion of the GPU computational resources used by these approaches can be obtained. The GPU used in for this research has a maximum theoretical performance of 345.6 GFLOPs. Using 3N+5log2 L+5, and selecting N = 128 and L = 1024 to avoid thread deﬁciency eﬀects, each stream element requires

439 ﬂoating point operations. For the measured performance of 105 megasamples per second from Figure 5.14, this corresponds to 46.1 GFLOPs. It should be noted that this value only includes operations directly applied to the data. Additional necessary operations, such as for memory addressing, have not been included as they can vary between implementations. However, this result does show that if memory operations could be reduced, the performance of the GPU implementation 136 CHAPTER 6: Discussion can increase by up to a maximum theoretical factor of 7.5.

Testing did not examine a number of streams beyond N = 128. This is because the GPU global memory would begin to impose a restriction on the length of transform range tested. This could be overcome by using a multiple GPU approach, in which each GPU correlates a portion of the stream pairs. As the number of streams increases such a solution would already be required in order to obtain real time bandwidths.

It is also clear that the PCI-express bus is not a bottleneck, this is shown by

Figure 5.14. The graph shows the total rate of input data that the correlators can process in realtime. The CPU and the GPU are bound by their computational ability rather than the bus bandwidth through which the input data can reach the device. For a correlation algorithm, the input bandwidth dominates and the output bandwidth is negligible in comparison due to the data reduction eﬀect of accumulation. The GPU correlator is almost saturating the maximum SATA2 data rate. However, realtime GPU correlator data acquisition will not occur via physical hard disks, as the highest capacity disks would be processed in a matter of minutes.

Instead the host machine would receive the data streamed over a higher bandwidth network connection.

With the current results, the GPU processing power would have to grow by an order of magnitude to saturate the current PCI-express architecture. In the meantime, future bus architectures such as PCI-express 2 and 3 will continue to increase the bandwidth between host and device. Many of the compute vs bandwidth concerns have already been addressed for the CPU, and suggested solutions suit the parallel nature of the GPU [45]. It is possible that the CPU and GPU architectures will merge in future hardware designs, removing the need for data transfer between 137 the two over PCI-Express.

The data used in the testing consisted of real 8 bit samples. While the unpacking data of diﬀering bit precision should have no signiﬁcant impact on the performance of the GPU correlator, there may be a slight performance decrease for higher bit precision due to the associated additional data transfer between the GPU and the host machine. Conversely a lower bit precision should result in slightly higher performance. The data streams themselves were not interleaved, and thus consecutive timesamples of a given stream were contiguous in memory. For interleaved samples, where the samples for all streams that correspond to a given time are contiguous in memory, the samples would need to be deinterleaved prior to the Fast Fourier transform. The best way to address this is left for future research.

6.4 Power and Cost Analysis

The exponential growth in the computational performance of processors detailed in Section 2.5.1 and shown in Figure 2.19(a) has come at the cost of a similar growth in their power consumption. The power consumption of processing systems has become a significant budgeting concern for the next generation of radio telescope arrays. For this reason the energy usage of both the parallel and serial FX correlator models was explored. This was achieved by measuring the power consumption of the GPU FX correlator. The power consumption of the serial CPU implementation was estimated using the power specification provided by the manufacturer. This value does not include inefficiencies of the power supply, and additional power consumption by the motherboard and other internal components of the test system.

The results of this exploration were presented in Section 5.3. The direct mea- 138 CHAPTER 6: Discussion surement the net power usage of the GPU FX correlator initially showed that it was higher than the power rating of the serial CPU implementation. The performance results for the correlator implementations were then taken into consideration. In terms of performance per watt, the results of Figure 5.16 show the parallel implementation to be superior. Thus the GPU is significantly more power efficient than the CPU for this application. Indeed, even if the CPU were running in a system with a perfect power supply, a motherboard and peripherals with zero power requirements, and no graphics card; it would still be less power efficient than the real world GPU.

There is also a trend present in Figure 5.16. The power advantage of the GPU scales with the size of the array. As the array gets larger the power eﬃciency improves. This is caused by the superior performance of the CMAC stage kernel for larger numbers of telescope streams. Thus for the truly large scale instruments required for future radio astronomy science, a GPU-accelerated correlator should provide a higher power eﬃciency than a correlator based on CPUs alone.

Providing a detailed analysis of the relative cost of the CPU and GPU implementation is problematic. A comparison of performance per dollar, based on the purchasing price of the equipment, would not necessarily be representative. This is because the cost of the hardware varies dramatically over time. It should be noted that such a comparison should take into account the cost of the entirety of the two systems, and not just compare the CPU and GPU components separately. 139

6.5 Adaptability Analysis

The real advantage of software correlators is their ability to be adapted easily to new algorithms for different interferometer configurations and science outcomes. The polyphase filter stage described in Section 2.3 as added to the GPU correlator implementation to show that is retains this adaptability. As detailed in Section 4.3, the polyphase filter stage was added to the unpack stage of the correlation algorithm.

In order to obtain the desired performance, the approach was critically analysed in a manner similar to that presented in the CMAC stage. The hardware speciﬁ- cation was considered to ensure suﬃcient threads to realise the parallelism of the

GPU while not exceeding the available compute resources. These resources include the number of registers, the available shared memory, and the thread capacity of the GPU multiprocessor. While ﬁnding optimal solutions that ﬁt within these constraints is certainly a time consuming process for a novice to the GPU computing paradigm, with experience this becomes a more expedient process.

The testing of this ﬁlter stage was presented in Section 5.4. The results shown in Figure 5.18(a) and 5.18(b) reveal that not only was this stage successfully implemented, but that the resulting increase in processing time was less than the additional computation required for the ﬁlter. This is possible due to the SPMD memory latency hiding of the GPU architecture, described in Section 2.4. To sum- marise, the additional computation occurred during memory latency already present in the original algorithm. This has the implication that some additional features can be added to the algorithm with little performance impact. Some of these features are discussed as potential future research in the next chapter. 140 CHAPTER 6: Discussion Chapter 7

Conclusion

This chapter summarises the work presented in this thesis. Beginning with the concept that parallel computing architectures can be used to meet the processing demands of science, this research has revealed signiﬁcant results. This includes a data parallel model of a FX radio signal correlator using a GPU computing approach. The model has shown that the techniques presented in this work can yield a system with output matching that of traditional serial approaches, with performance gains measured in orders of magnitude. At the same time, these results have demonstrated that this performance is obtainable at a lower power cost per FLOP than the serial approach and still maintains a degree of adaptability to new algorithmic features. This chapter summarises the individual contributions of this thesis, and then concludes with future considerations for extending this work.

141 142 CHAPTER 7: Conclusion

7.1 Thesis Summary

I first conducted preliminary testing, to address potential bottlenecks in the GPU compute paradigm revealed by my background research. The purpose of this testing was to ensure that these bottlenecks were not a significant obstacle before commit- ting to further development on the GPU. The preliminary tests first investigated factors affecting the transfer of data between the host and GPU device. The results of the tests indicated that pageable memory transfers with a minimum size of eight megabytes resulted the most optimal host-device bandwidth. Preliminary testing then investigated the performance of the CUDA fast Fourier transform library, CUFFT. Results showed that CUDA FFT was roughly ten time faster than a FFTW CPU implementation, but that data transfer of 32 bit floating point values reduced this performance considerably. Since the correlator data consists of a packed 8 bit integer format, it is one quarter the size of an equivalent 32 bit floating point representation. I concluded that the transfer of data in its existing 8 bit integer form and subsequent unpacking to floating point on the GPU would mitigate the performance drop caused by date transfer to the GPU device. Furthermore, the accumulation that occurs in the CMAC stage of the FX correlation algorithm would reduce the data transfer from the GPU device to a negligible amount.

I then developed several potential parallel approaches for the CMAC stage kernel. The purpose of these approaches was to investigate two main correlation parameters: the length of the FFT, and the number of telescope data streams. The approaches varied from memory eﬃcient models that reused memory fetches from the GPU device memory, to extremely parallel approaches that contained a larger number of threads. These approaches were tested for the ranges of correlation parameters commonly used in radio astronomy, to determine the best approach for a given set 143 of parameters. The results of my testing showed that for FFT lengths larger than 512, or numbers of telescope streams larger than 16, the memory eﬃcient model was superior. However, for small FFT lengths and small numbers of telescope streams, the approach which contained more threads was more appropriate.

Taking the best CMAC stage kernels, I then implemented the entire GPU FX correlation algorithm. The purpose of this implementation was to determine the suitability of the GPU architecture to radio astronomy correlation. The GPU implementation was tested for correctness and performance. My results showed that the GPU implementation produced correct results, and performed up to a hundred times faster than a comparative serial CPU implementation. The performance trends of the previous CMAC stage testing were evident in the full FX correlation implementation results. This is due to the CMAC stage being the most computationally intensive stage in the algorithm. From the performance results, I concluded that the GPU architecture was indeed suited to radio astronomy correlation.

However, the power usage of the GPU was a concern, since the power consumption of computing facilities has become a significant budget consideration. For this reason, I measured the power consumption of the GPU FX correlator. The power consumption of the serial CPU implementation was estimated using the power specification provided by the manufacturer. This value does not include inefficiencies of the power supply, and additional power consumption by the motherboard and other internal components of the test system. While the GPU correlator did use more power than the CPU rating, I also considered the relative performance output of each implementation. In terms of performance per watt the GPU implementation was superior by up to a factor of 30.

Finally, I also modiﬁed the GPU implementation with the addition of a polyphase 144 CHAPTER 7: Conclusion

filter stage. The purpose of adding this stage was to investigate how easily GPU algorithms could be modified. In order to achieve desirable performance, I applied the GPU programming techniques developed while investigating the best CMAC stage approach. My understanding of the GPU computing paradigm was critical. The resulting polyphase filter implementation was then tested. Since the polyphase filter included additional computation, I expected the performance of the GPU implementation to drop accordingly. However, the implementation performance was better than expected. I concluded that some of the additional computation was used by the memory latency hiding mechanisms of the GPU hardware.

7.2 Future Research

This research has thoroughly investigated the parallel implementation of a FX correlator on the GPU architecture. However, there are many related areas yet to be explored. This chapter lists some of these areas. These include additional features of the correlator itself, and the rest of the aperture synthesis pipeline. The scaling of this work to cluster computing, and alternative hardware architectures is also discussed.

The FX correlation algorithms, both CPU and GPU, represent a simple correlation benchmark framework. Additional features for speciﬁc correlation array conﬁgurations; such as delay compensation, corner turning, and fringe rotation; are not implemented. The omission of these features was due to time constraints, and there is no barrier to their implementation on the GPU. Should their computation fall within global memory latency, it is possible that there will be little additional overhead in the GPU correlator. 145

Other frequency ﬁlters could also be examined. The vanilla FFT approach contains inherent leakage of a non-aligned frequency into the other spectral bins, increasing the signal to noise ratio [52]. This is traditionally addressed in radio astronomy by the polyphase ﬁlter approach also presented in Chapter 4. It is possible that given the low arithmetic intensity of the FFT, that an alternative approach that traditionally has a higher computation cost may be viable on the GPU architecture.

The aperture synthesis pipeline, as introduced in Section 2.1.3, consist of several sequential parts that convert the one dimensional radio signals collected by the telescopes into two dimensional images of the radio source. The parallel FX correlator demonstrated in this work forms the ﬁrst of these parts. As reviewed in Chapter 3, Wayth and Dale have implemented a parallel version of the latter stages of aperture synthesis [103]. Subsequent work could focus on parts of the pipeline not yet addressed, such as image deconvolution techniques introduced in Section 2.1.3.

Although the parallel implementation of the correlator is signiﬁcantly faster than a serial approach, it is still only able to process a ﬁnite amount of data in realtime. In order to deal with the scale of data foreshadowed in Section 3, a multitude of

GPU devices would be required. Consequently, the correlation implementation must be parallelised across multiple GPU devices.

A possible approach would be to copy the techniques used by radio spectrometry hardware. The multichannel receiver introduced in Section 2.1.1 split the incoming frequencies into bands. This approach could also be used in the case of a GPU correlator cluster. In this scheme, each GPU device correlates a band of the overall bandwidth.

Another potential approach would be to parallelise by data streams. In this 146 CHAPTER 7: Conclusion scheme, each GPU device would process a group of baseline pairs. A drawback of this approach is that the unpacking and Fourier transform stages of the correlator pipeline, that was shown in Figure 4.1, would need to be processed multiple times for some streams. Although the computational complexity of these stages is less than that of the CMAC stage, they are by no means negligible.

While CUDA is an excellent parallel language for implementing scientiﬁc algorithms on the GPU, it limits the resulting program to vendor speciﬁc hardware.

The GPU computing ﬁeld is rapidly maturing, and approaching standardisation. OpenCL (Open Computing Language) is an open royalty-free standard for general purpose parallel programming across CPUs, GPUs, and other processors, giving software developers portable and eﬃcient access to the power of these heterogeneous processing platforms [66]. The implementation of a parallel correlator in such a language would increase its accessibility for the radio astronomy community. References

[1] IEEE standard for binary ﬂoating-point arithmetic. ANSI/IEEE Std 754-1985, 1985. Technical Report.

[2] J. G. Ables. Maximum Entropy Spectral Analysis. Astronomy and Astro- physics Supplement, 15:383–+, June 1974.

[3] AMD. Amd stream computing: Software stack. 2007. Inter- net, http://ati.amd.com/technology/streamcomputing/resources.html, accessed 03/12/2008.

[4] Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. Readings in computer architecture, pages 79–81, 2000.

[5] R. G. Belleman, J. Bedorf, and S. Portegies Zwart. High Performance Di- rect Gravitational N-body Simulations on Graphics Processing Units – II: An implementation in CUDA. ArXiv e-prints, 707, July 2007.

[6] F. H. Briggs, J. F. Bell, and M. J. Kesteven. Removing Radio Interference from Contaminated Astronomical Spectra Using an Independent Reference Signal and Closure Relations. 120:3351–3361, December 2000. arXiv:astro- ph/0006222.

[7] R. H. Brown and A. C. B. Lovell. The exploration of space by radio. Chapman and Hall Ltd, 1957.

[8] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. Brook for GPUs: stream computing on graphics hardware. ACM Trans. Graph., 23(3):777–786, 2004.

[9] John Bunton. Multi-resolution fx correlator. ALMA memo 447, Feb 2003.

[10] B. F. Burke and F. Graham-Smith. An Introduction to Radio Astronomy. Cambridge University Press, 1997.

147 148 References

[11] I-Liang Chern and Ian T. Foster. Parallel implementation of a control volume method for solving pdes on the sphere. In Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientiﬁc Computing, pages 301–306, Philadelphia, PA, USA, 1992. Society for Industrial and Applied Mathematics.

[12] Y. Chikada, M. Ishiguro, H. Hirabayashi, M. Morimoto, K. I. Morita, K. Miyazawa, K. Nagane, K. Murata, A. Tojo, S. Inoue, T. Kanzawa, and H. Iwashita. A Digital FFT Spectro-Correlator for Radio Astronomy. In J. A. Roberts, editor, Indirect Imaging. Measurement and Processing for Indirect Imaging, page 387, 1984.

[13] S. Chikada, Y.; Ishiguro, M.; Hirabayashi, H.; Morimoto, M.; Morita, K.; Kan- zawa, T.; Iwashita, H.; Nakazima, K.; Ishikawa, S.; Takahashi, T.; Handa, K.; Kasuga, T.; Okumura, S.; Miyazawa, T.; Nakazuru, T.; Miura, K.; Nagasawa. A 6 320-MHz 1024-channel FFT cross-spectrum analyzer for radio astronomy. Proceedings of the IEEE, 75(9):1203–1210, September 1987.

[14] D. Cook, J. Ioannidis, A. Keromytis, and J. Luck. Cryptographics: Secret key cryptography using graphics cards, 2005.

[15] James W. Cooley and John W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19(90):297–301, 1965.

[16] Greg Coombe, Mark J. Harris, and Anselmo Lastra. Radiosity on graphics hardware. In Graphics Interface, pages 161–168, 2004.

[17] T. J. Cornwell. Ska and Evla Computing Costs for Wide Field Imaging. Experimental Astronomy, 17:329–343, June 2004.

[18] T.J. Cornwell and Ger van Diepen. Scaling mount exaﬂop: from the pathﬁnd- ers to the square kilometre array. 2008.

[19] CSIRO. The csiro parkes radio telescope. 2007. Internet, http://www.scienceimage.csiro.au/index.cfm?event=site.image.detail&id=4030, accessed 24/12/2008.

[20] CSIRO. Science image : Pricing and licences. 2007. In- ternet, http://www.scienceimage.csiro.au/index.cfm?event=site.pricing, accessed 24/12/2008. Permission to use images free of charge obtained via email.

[21] W. J. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonte, J-H A., N. Jayasena, U. J. Kapasi, A. Das, J. Gummaraju, and I. Buck. Merrimac: Supercomputing with streams. In SC’03, Phoenix, Arizona, November 2003.

[22] A. Deller, S. Tingay, M. Bailes, and C. West. Distributed FX software correlation for eVLBI. In Proceedings of the 8th European VLBI Network Symposium, 2006. 149

[23] Adam T. Deller, S. J. Tingay, M. Bailes, and C. West. DiFX: A software correlator for very long baseline interferometry using multi-processor computing environments. 2007. astro-ph/0702141.

[24] Kelly Dempski. Real-time Rendering Tricks and Techniques in DirectX. Thom- son Course Technology, 2002.

[25] S. W. Ellingson and W. Cazemier. Eﬃcient multibeam synthesis with interference nulling for large arrays. IEEE Transactions on Antennas and Propa- gation, 51:503–511, March 2003.

[26] Bowman J. D. et al. Field Deployment of Prototype Antenna Tiles for the Mileura Wideﬁeld Array Low Frequency Demonstrator. 133:1505–1518, April 2007. arXiv:astro-ph/0611751.

[27] Zhe Fan, Feng Qiu, Arie Kaufman, and Suzanne Yoakum-Stover. GPU cluster for high performance computing. In SC ’04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, page 47, Washington, DC, USA, 2004. IEEE Computer Society.

[28] Randima Fernando, editor. GPU Gems: Programming Techniques, Tips, and Tricks for Real-Time Graphics. Addison-Wesley, 2004.

[29] Randima Fernando and Mark J. Kilgard. The Cg Tutorial: The Deﬁnitive Guide to Programmable Real-Time Graphics. Addison-Wesley Longman Pub- lishing Co., Inc., Boston, MA, USA, 2003.

[30] M.J. Flynn. Very high-speed computing systems. Proceedings of the IEEE, 54(12):1901–1909, Dec. 1966.

[31] M. Frigo and S. G. Johnson. The fastest fourier transform in the west. Tech- nical report, Cambridge, MA, USA, 1997.

[32] Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic warp formation and scheduling for eﬃcient gpu control ﬂow. In MICRO ’07: Proceedings of the 40th Annual IEEE/ACM International Symposium on Mi- croarchitecture, pages 407–420, Washington, DC, USA, 2007. IEEE Computer Society.

[33] Dominik G¨oddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick Mc- Cormick, Sven H. M. Buijssen, Matthias Grajewski, and Stefan Turek. Explor- ing weak scalability for fem calculations on a gpu-enhanced cluster. Parallel Comput., 33(10-11):685–699, 2007.

[34] Naga K. Govindaraju, Scott Larsen, Jim Gray, and Dinesh Manocha. A memory model for scientiﬁc algorithms on graphics processors. Technical report, UNC, 2006. 150 References

[35] GPGPU. General-purpose computation using graphics hardware. 2008. In- ternet, http://www.gpgpu.org/, accessed 16/12/2008.

[36] John L. Gustafson. Reevaluating amdahl’s law. Commun. ACM, 31(5):532– 533, 1988.

[37] K.G. Haines, J.A. Moya, and T.P. Caudell. Modeling nonsynaptic communication between neurons in the lamina ganglionaris of musca domestica. Neural Networks, 1999. IJCNN ’99. International Joint Conference on, 1:131–136 vol.1, 1999.

[38] P. J. Hall. The Square Kilometre Array: An Engineering Perspective. The Square Kilometre Array: An Engineering Perspective, Edited by Peter J. Hall. 2005 V, 430 p. 1-4020-3797-X. Berlin: Springer, 2005., 2005.

[39] Mark J. Harris, Greg Coombe, Thorsten Scheuermann, and Anselmo Lastra. Physically-based visual simulation on graphics hardware. SIGGRAPH Euro- graphics Workshop on Graphics Hardware, 2002.

[40] Owen Harrison and John Waldron. Optimising data movement rates for parallel processing applications on graphics processors. In Parallel and Distributed Computing and Networks, 2007.

[41] A. Hewish, S. J. Bell, J. D. Pilkington, P. F. Scott, and R. A. Collins. Ob- servation of a Rapidly Pulsating Radio Source. Nature, 217:709–+, February 1968.

[42] J. A. H¨ogbom. Aperture Synthesis with a Non-Regular Distribution of Inter- ferometer Baselines. Astronomy and Astrophysics Supplement, 15:417, June 1974.

[43] K. G. Jansky. Directional Studies of Atmospherics at High Frequencies. In N. Kassim, M. Perez, W. Junor, and P. Henning, editors, Astronomical Society of the Paciﬁc Conference Series, volume 345 of Astronomical Society of the Paciﬁc Conference Series, pages 3–15, December 2005.

[44] Marcin Jedrzejewski and Krzyszt Marasek. Computation of room acoustics using programmable video hardware. In International Conference on Computer Vision and Graphics, September 2004.

[45] Eric E. Johnson. Graﬃti on the memory wall. SIGARCH Comput. Archit. News, 23(4):7–8, 1995.

[46] Arvind Krishnamurthy and Katherine A. Yelick. Optimizing parallel spmd programs. In LCPC ’94: Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing, pages 331–345, London, UK, 1995. Springer-Verlag. 151

[47] Jens Krueger and Ruediger Westermann. Linear algebra operators for GPU implementation of numerical algorithms. ACM Transactions on Graphics (TOG), 22(3):908–916, 2003. [48] S. R. Kulkarni, S. B. Anderson, T. A. Prince, and A. Wolszczan. Old pulsars in the low-density globular clusters M13 and M53. Nature, 349:47–49, January 1991. [49] Muckul. R. Kundu. Solar Radio Astronomy. John Wiley & Sons Inc, November 1965. [50] S. J. Lilly. Discovery of a radio galaxy at a redshift of 3.395. Astrophysics Journal, 333:161–167, October 1988. [51] Colin J. Lonsdale, Sheperd S. Doeleman, and Divya Oberoi. Eﬃcient imaging strategies for next-generation radio arrays. The Square Kilometre Array: An Engineering Perspective, pages 345–362, January 2005. [52] Richard G. Lyons. Understanding Digital Signal Processing (2nd Edition). Prentice Hall PTR, Upper Saddle River, NJ, USA, 2004. [53] John Markoﬀ. Intels big shift after hitting technical wall. The New York Times, 2004. [54] H. Markram. The blue brain project. NATURE REVIEWS NEURO- SCIENCE, 7(2):153–160, 2006. [55] Berna L. Massingill, Timothy G. Mattson, and Beverly A. Sanders. Patterns for parallel application programs. In Proceedings of the Sixth Pattern Lan- guages of Programs Workshop, 1999. [56] Berna L. Massingill, Timothy G. Mattson, and Beverly A. Sanders. Reengi- neering for parallelism: an entry point into plpp for legacy applications: Re- search articles. Concurrent Computint : Practice and Experience, 19(4):503– 529, 2007. [57] Michael D. McCool, Zheng Qin, and Tiberiu S. Popa. Shader metaprogramming. In HWWS ’02: Proceedings of the ACM SIG- GRAPH/EUROGRAPHICS conference on Graphics hardware, pages 57–68, Aire-la-Ville, Switzerland, Switzerland, 2002. Eurographics Association. [58] Michael D. McCool, Kevin Wadleigh, Brent Henderson, and Hsin-Ying Lin. Performance evaluation of gpus using the rapidmind development platform. In SC ’06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 181, New York, NY, USA, 2006. ACM. [59] J. Michalakes and M. Vachharajani. Gpu acceleration of numerical weather prediction. Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1–7, April 2008. 152 References

[60] A. A. Michelson. On the Application of Interference Methods to Astronomical Measurements. Proceedings of the National Academy of Science, 6:474–475, August 1920.

[61] John S. Montrym, Daniel R. Baum, David L. Dignam, and Christopher J. Migdal. Inﬁnitereality: a real-time graphics system. In SIGGRAPH, 1997.

[62] G. E. Moore. Cramming more components onto integrated circuits. Electron- ics, 38(8):114–117, 1965.

[63] J. M. Moran. Thirty Years of VLBI: Early Days, Successes, and Future. In J. A. Zensus, G. B. Taylor, and J. M. Wrobel, editors, IAU Colloq. 164: Radio Emission from Galactic and Extragalactic Compact Sources, volume 144 of Astronomical Society of the Paciﬁc Conference Series, 1998.

[64] Kenneth Moreland and Edward Angel. The FFT on a GPU. Graphics Hard- ware, 2003.

[65] S. R. Mosier and J. Fainberg. A new high-speed solar spectrograph for meter and decameter wavelengths. Solar Physics, 40:501–509, February 1975.

[66] Aaftab Munshi. The OpenCL speciﬁcation. Technical report, 2008.

[67] Hubert Nguyen, editor. GPU Gems 3. Addison-Wesley, 2007.

[68] NVIDIA. New nvidia GPU breaks one billion pixels per second barrier. Press Release, Internet, 2000. http://www.nvidia.com/.

[69] NVIDIA. Nvidia unveils cuda - the gpu computing revolution begins, Novem- ber 2006. NVIDIA Press Release.

[70] NVIDIA. CUDA CUBLAS Library 1.0. June 2007.

[71] NVIDIA. CUDA CUFFT Library 1.0. June 2007.

[72] NVIDIA. CUDA Programming Guide 1.0. June 2007.

[73] National Radio Astronomy Observatory. Jansky antenna. 2008. Internet, http://images.nrao.edu/Historical/Telescopes/107, accessed 15/12/2008.

[74] National Radio Astronomy Observatory. Nrao image use policy. 2008. Internet, http://images.nrao.edu/image use.shtml, accessed 15/12/2008.

[75] Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck. Discrete-Time Signal Processing (2nd Edition). Prentice Hall, February 1999.

[76] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. Gpu computing. Proceedings of the IEEE, 96(5):879–899, May 2008. 153

[77] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krger, Aaron E. Lefohn, and Timothy J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80–113, 2007.

[78] Aaron Parsons, Donald Backer, Chen Chang, Daniel Chapman, Henry Chen, Patrick Crescini, Christina de Jesus, Chris Dick, Pierre Droz, David MacMa- hon, Kirsten Meder, Jeﬀ Mock, Vinayak Nagpal, Borivoje Nikolic, Arash Parsa, Brian Richards, Andrew Siemion, John Wawrzynek, Dan Werthimer, and Melvyn Wright. Petaop/second fpga signal processing for seti and radio astronomy. Signals, Systems and Computers, 2006. ACSSC ’06. Fortieth Asilomar Conference on, pages 2031–2035, Oct.-Nov. 2006.

[79] R. B. Partridge. 3K: The Cosmic Microwave Background Radiation. Cam- bridge University Press, September 1995.

[80] Marshall C. Pease. An adaptation of the fast fourier transform for parallel processing. Journal of the ACM, 15(2):252–264, 1968.

[81] D.C. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox, P. Harvey, P.M. Harvey, H.P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Pham, J. Pille, S. Posluszny, M. Riley, D.L. Stasiak, M. Suzuoki, O. Takahashi, J. Warnock, S. Weitzel, D. Wendel, and K. Yazawa. Overview of the architecture, circuit design, and physical implementation of a ﬁrst-generation cell processor. IEEE Journal of Solid-State Circuits, 41:179–196, 2006.

[82] Matt Pharr, editor. GPU Gems 2: Programming Techniques for High- Performance Graphics and General-Purpose Computation. Addison-Wesley, 2005.

[83] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan. Ray tracing on programmable graphics hardware. ACM Transactions on Graphics, 21(3):703–712, July 2002.

[84] Michael J. Quinn. Parallel Computing. McGraw-Hill Inc., 1994.

[85] Lawrence R. Rabiner. Multirate Digital Signal Processing. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1996.

[86] K. Rohlfs, T. L. Wilson, and S. H¨uttemeister. Tools of Radio Astronomy. Springer, 2009.

[87] J. D. Romney. Cross Correlators, volume 180 of Astronomical Society of the Paciﬁc Conference Series. 1999.

[88] Randi J. Rost. OpenGL(R) Shading Language (2nd Edition). Addison-Wesley Professional, 2005. 154 References

[89] M. Ryle. A new radio interferometer and its application to the observation of weak radio stars. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 211(1106):351–375, 1952.

[90] M. Ryle and D. D. Vonberg. Solar Radiation on 175 Mc./s. Nature, 158:339– 340, September 1946.

[91] Kjeld Schaaf and Ruud Overeem. Cots correlator platform. Experimental Astronomy, 17(1-3):287–297, June 2004.

[92] Hsi-Yu Schive, Chia-Hung Chien, Shing-Kwong Wong, Yu-Chih Tsai, and Tzihong Chiueh. Graphic-card cluster for astrophysics (graCCA) performance tests. ArXiv e-prints, July 2007.

[93] H. Schomberg and J. Timmer. The gridding method for image reconstruction by fourier transformation. Medical Imaging, IEEE Transactions on, 14(3):596– 607, Sep 1995.

[94] Amar Shan. Heterogeneous processing: a strategy for augmenting moore’s law. Linux Journal, Jan 2006.

[95] Mark Silberstein, Assaf Schuster, Dan Geiger, Anjul Patney, and John D. Owens. Eﬃcient computation of sum-products on gpus through software- managed cache. In ICS ’08: Proceedings of the 22nd annual international conference on Supercomputing, pages 309–318, New York, NY, USA, 2008. ACM.

[96] A. G. Smith. Radio exploration of the sun. Van Nostrand Momentum Books, Princeton: Van Nostrand, 1967, 1967.

[97] J. L. Steinburg and J Lequeux. Radio Astronomy. McGraw-Hill Book Com- pany, Inc., 1963.

[98] R. Westermann T. Schiwietz, T. Chang, P. Speier. MR image reconstruction using the GPU. In Proceedings of SPIE Medical Imaging 2006, San Diego, CA, February 2006. SPIE.

[99] A. R. Thompson, J. M. Moran, and G. W. Swenson, Jr. Interferometry and Synthesis in Radio Astronomy, 2nd Edition. Wiley, April 2001.

[100] Jack Tomlinson. Computation of ﬂops requirements for a wideband spectrum analyzer. Texas Memory Systems, Inc, May 2004.

[101] P. Trancoso and M. Charalambous. Exploring graphics processor performance for general purpose applications. Digital System Design, 2005. Proceedings. 8th Euromicro Conference on, pages 306–313, Aug.-3 Sept. 2005. 155

[102] Suresh Venkatasubramanian. The graphics card as a stream computer. In SIGMOD-DIMACS Workshop on Management and Processing of Data Streams, 2003. [103] R. Wayth, K. Dale, L. J. Greenhill, D. A. Mitchell, S. Ord, and H. Pﬁster. Data Processing Using GPUs for The MWA. In Bulletin of the American Astronomical Society, volume 38 of Bulletin of the American Astronomical Society, pages 744–+, December 2007. [104] S. Weinreb, A. H. Barrett, M. L. Meeks, and J. C. Henry. Radio Observations of OH in the Interstellar Medium. Nature, 200:829–+, November 1963. [105] Sean Whalen. Audio and the graphics processing unit. In IEEE Vis 2004 GPGPU Tutorial, March 2004. [106] Mason Woo, Jackie Neider, Tom Davis, and Dave Shreiner. OpenGL Program- ming Guide: The Oﬃcial Guide to Learning OpenGL, Version 1.2. Addison- Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999. [107] J. L. Yen. The Role of Fast Fourier Transform Computers in Astronomy. Astronomy and Astrophysics Supplement, 15:483, June 1974. [108] V. V. Zheleznyakov. Radio Emission of the Sun and Planets. Pergamon Press, 1970. [109] Simon Portegies Zwart, Robert Belleman, and Peter Geldof. High performance direct gravitational n-body simulations on graphics processing units, 2007. 156 APPENDIX : REFERENCES Appendix A

Code

The following sections contain the GPU kernels and accompanying wrapper functions for the various stages of the correlation algorithm, as well as for the polyphase ﬁlter. Additional host code pertaining to the initialisation, memory management, host-device memory transfer, I/O, and algorithm control has been omitted for the sake of brevity.

A.1 Unpack Stage Kernel

/** * These routines take N telescope streams consisting of a * packed 8 bit samples and unpack them to a floating point

* representations. For input, they take a pointer to a * GPU-resident buffer that contains the packed data, grouped * by stream. They output to a GPU-resident buffer specified

157 158 APPENDIX A: Code

* by a second pointer. *

* The following variables are used: * out - a pointer to the output buffer * in - a pointer to the input buffer * size - the total number of samples in the input buffer

* * The following routines are available: * up : the unpack routine */

// GPU Kernel for unpack operation // (called from the wrapper below) __global__ void unpack(float2 *ubuff, uchar1 *pbuff, int s) {

const int index = __mul24(blockIdx.x,blockDim.x) +threadIdx.x; const int inc = __mul24(gridDim.x,blockDim.x); for (int pos = index; pos < s; pos += inc) {

uchar1 word_c = pbuff[pos]; float2 word_f; word_f.x = 1.0*word_c.x-128.0; word_f.y = 0.0; ubuff[pos] = word_f;

} } // Kernel wrapper routine for unpack operation 159 void up(float2 *out, float2 *in, int size) { dim3 grid = NULL;

dim3 block = NULL; grid.x = 3*12; grid.y = 1; grid.z = 1;

block.x = 128; block.y = 1; block.z = 1; unpack<<>>(out,in,size); }

A.2 CMAC Stage Kernels

/**

* These routines take N telescope streams consisting of a * timeseries of S spectra with L frequency channels, * conjugate multiply and accumulate the signals to produce * into N(N+1)/2 output spectra. For input, they take a

* pointer to a GPU-resident buffer that contains the spectra, * grouped by stream. They output to a GPU-resident buffer * specified by a second pointer. *

* The following variables are used: * out - a pointer to the output buffer * in - a pointer to the input buffer 160 APPENDIX A: Code

* l - the length of the fourier transform used to produce * the complex spectra

* n - the number of telescope signals that are present * in the input buffer * t0 - the spectra index to begin accumulation for this * kernel call

* tN - the spectra index to stop accumulating for this * kernel call * tT - the total number of spectra per accumulation, * possibly spanning multiple calls. *

* The following routines are available: * a_1x1 : the 1x1x1 approach * a_1xG_4 : the 1xGxG approach, for G=4 */

// GPU Kernel for 1x1x1 accumulation // (called from the wrapper below) __global__ void accumulate_1x1(float2 *out, float2 *in,

int lo2, int n, int t0, int tN, int tT) { int ni = blockIdx.y/n; int nj = blockIdx.y%n; if (ni<=nj)

{ int idx = __mul24(blockIdx.x,blockDim.x)+threadIdx.x; float2 l_sum = make_float2(0.0,0.0); 161

for (int pos=t0*(lo2*2)+idx; pos

float2 chj = in[nj*(lo2*2)*tT+pos]; float2 chi = in[ni*(lo2*2)*tT+pos]; l_sum.x += chj.x*chi.x + chj.y*chi.y; l_sum.y += chj.y*chi.x - chj.x*chi.y;

} int pos = (((nj*(nj+1))/2)+ni)*lo2+idx; float2 g_sum = out[pos]; g_sum.x += l_sum.x; g_sum.y += l_sum.y;

out[pos] = g_sum; } } // Kernel wrapper routine for 1x1x1 accumulation void a_1x1(float2 *out, float2 *in, int l, int n, int t0, int tN, int tT) { dim3 grid = NULL;

dim3 block = NULL; grid.x = l/128; grid.y = n*n; grid.z = 1; block.x = 64;

block.y = 1; block.z = 1; accumulate_1x1<<>>(out,in,l/2,n,t0,tN,tT); 162 APPENDIX A: Code

}

// GPU Kernel for 1xGxG (G=4) accumulation // (called from the wrapper below) __global__ void accumulate_1xG_4(float2 *out, float2 *in,

int lo2, int no4, int t0, int tN, int tT) { int mj = blockIdx.y/no4; int mi = blockIdx.y%no4; if (mj<=mi)

{ int lx = __mul24(blockIdx.x,blockDim.x)+threadIdx.x; int nj = threadIdx.y; int xx = threadIdx.x;

float2 l_sum0 = make_float2(0.0,0.0); float2 l_sum1 = make_float2(0.0,0.0); float2 l_sum2 = make_float2(0.0,0.0); float2 l_sum3 = make_float2(0.0,0.0);

__shared__ float2 x_ni[4][32]; for (int tx=t0; tx

__syncthreads(); l_sum0.x += x_nj.x*x_ni[0][xx].x + x_nj.y*x_ni[0][xx].y; l_sum0.y += x_nj.y*x_ni[0][xx].x - x_nj.x*x_ni[0][xx].y; 163

l_sum1.x += x_nj.x*x_ni[1][xx].x + x_nj.y*x_ni[1][xx].y; l_sum1.y += x_nj.y*x_ni[1][xx].x - x_nj.x*x_ni[1][xx].y;

l_sum2.x += x_nj.x*x_ni[2][xx].x + x_nj.y*x_ni[2][xx].y; l_sum2.y += x_nj.y*x_ni[2][xx].x - x_nj.x*x_ni[2][xx].y; l_sum3.x += x_nj.x*x_ni[3][xx].x + x_nj.y*x_ni[3][xx].y; l_sum3.y += x_nj.y*x_ni[3][xx].x - x_nj.x*x_ni[3][xx].y;

__syncthreads(); } int xj = 4*mj+nj; int xi; int pos; float2 g_sum; xi = 4*mi+0; if (xj<=xi) {

pos = ((xi*(xi+1))/2+(xj))*lo2+lx; g_sum = out[pos]; g_sum.x += l_sum0.x; g_sum.y += l_sum0.y;

out[pos] = g_sum; } xi = 4*mi+1; if (xj<=xi) {

pos = ((xi*(xi+1))/2+(xj))*lo2+lx; g_sum = out[pos]; g_sum.x += l_sum1.x; 164 APPENDIX A: Code

g_sum.y += l_sum1.y; out[pos] = g_sum;

} xi = 4*mi+2; if (xj<=xi) {

pos = ((xi*(xi+1))/2+(xj))*lo2+lx; g_sum = out[pos]; g_sum.x += l_sum2.x; g_sum.y += l_sum2.y; out[pos] = g_sum;

} xi = 4*mi+3; if (xj<=xi) {

pos = ((xi*(xi+1))/2+(xj))*lo2+lx; g_sum = out[pos]; g_sum.x += l_sum3.x; g_sum.y += l_sum3.y;

out[pos] = g_sum; } } } // Kernel wrapper routine for 1xGxG (G=4) accumulation void a_1xG_4(float2 *out, float2 *in, int l, int n, int t0, int tN, int tT) { 165

dim3 grid = NULL; dim3 block = NULL;

grid.x = l/64; grid.y = (n/4)*(n/4); grid.z = 1; block.x = 32;

block.y = 4; block.z = 1; accumulate_1xG_4<<>>(out,in,l/2,n/4,t0,tN,tT); }

// GPU Kernel for 1x1xN accumulation // (called from the wrapper below) __global__ void accumulate_1xN(float2 *out, float2 *in, int lo2, int n, int t0, int tN, int tT)

{ int idx = __mul24(blockIdx.x,blockDim.x)+threadIdx.x; int nj = blockIdx.y; float2 l_sum;

for (int ni=0; ni<=nj; ni++) { l_sum = make_float2(0.0,0.0); for (int pos=t0*(lo2*2)+idx; pos

float2 chj = in[nj*(lo2*2)*tT+pos]; float2 chi = in[ni*(lo2*2)*tT+pos]; l_sum.x += chj.x*chi.x + chj.y*chi.y; 166 APPENDIX A: Code

l_sum.y += chj.y*chi.x - chj.x*chi.y; }

int pos = (((nj*(nj+1))/2)+ni)*lo2+idx; float2 g_sum = out[pos]; g_sum.x += l_sum.x; g_sum.y += l_sum.y;

out[pos] = g_sum; } } // Kernel wrapper routine for 1x1xN accumulation void a_1xN(float2 *out, float2 *in,

int l, int n, int t0, int tN, int tT) { dim3 grid = NULL; dim3 block = NULL;

grid[0].x = l/128; grid[0].y = n; grid[0].z = 1; block[0].x = 64;

block[0].y = 1; block[0].z = 1; accumulate_1xN<<>>(out,in,l/2,n,t0,tN,tT); } 167

A.3 Polyphase Filter Kernel

/**

* These routines take N telescope streams consisting of a * packed 8 bit samples, unpacks them to a floating point * representation, and then pass them through a polyphase * filter (unpacking occurs partway through the filter).

* For input, they take a pointer to a GPU-resident buffer * that contains the packed data, grouped by stream. They * output to a GPU-resident buffer specified by a second * pointer. *

* The following variables are used: * out - a pointer to the output buffer * in - a pointer to the input buffer * size - the total number of samples in the input buffer

* taps - the number of taps in the polyphase filter * n - the number of streams present in the input buffer * * The following routines are available:

* u_poly : the polyphase filter routine */

// GPU Kernel for unpack operation

// (called from the wrapper below) __global__ void upoly(float2 *out, uchar1 *in, int size, int taps) 168 APPENDIX A: Code

{ int x = threadIdx.x;

int y = blockIdx.x; int nx = blockIdx.y; int n = gridDim.y; int l = blockDim.x;

int pS = (size/(gridDim.x*n)); int poff = (taps-1)*l; int p0 = (nx*(gridDim.x*pS+poff))+y*pS+x; int pN = p0-x+pS; int w = taps*l;

int nxt = taps - 1; //circular buffer index __shared__ unsigned char s_in[128*8]; __shared__ float s_f[128*8]; for (int t=0; t

{ int loc = l*t+x; s_f[loc]=(0.5-0.5*cos(loc*2*pi/w))*(sin((w/2-loc)*pi/l)/(pi*l)); }

// load initial buffer, bar the last tap for (int p=0; p

} // load, calculate, write loop int tab = (taps-1)*l; 169

for (int p=p0; p

// load next buffer s_in[l*nxt+x] = in[p+tab].x; __syncthreads(); nxt = (nxt+1)&(taps-1);

// multiply each value by filter and sum across taps float sum = 0.0; for (int t=0; t

int loc_f = l*t+x; float val = s_in[loc_v]*1.0-127.0; sum += s_f[loc_f]*val; }

// write filtered sum to output memory int po = p - (nx*(taps-1)*l); out[po] = make_float2(sum,0.0); }

} // Kernel wrapper routine for the polyphase filter void u_poly(float2 *out, char *in, int size, int taps, int n, int l) {

dim3 grid = NULL; dim3 block = NULL; // grid.x = available 8bytes in shared mem*multiprocessors 170 APPENDIX A: Code

// divided by required resources grid[0].x = 2048*64/(n*l*taps);

grid[0].y = n; grid[0].z = 1; block[0].x = l; block[0].y = 1;

block[0].z = 1; upoly<<>>(out, (uchar1*)in, size, taps); }