Linköpings universitet | Institutionen för datavetenskap Examensarbete på grundnivå, 16 hp | Datateknik Vårterminen 2020 | LIU-IDA/LITH-EX-G-20/033-SE

Parallelizing Digital Signal Processing for GPU ______Hannes Ekstam Ljusegren Hannes Jonsson

Supervisor: George Osipov ​ Examiner: Peter Jonsson

External Supervisors: Fredrik Bjurefors & Daniel Fransson

Linköpings universitet SE-581 83 Linköping 013-28 10 00, www.liu.se ​

Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/. ​

Copyright The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/. ​

© Hannes Ekstam Ljusegren © Hannes Jonsson

1 Abstract

Because of the increasing importance of signal processing in today's society, there is a need to easily experiment with new ways to signals. Usually, fast-performing digital signal processing is done with special-purpose hardware that are difficult to develop for. GPUs pose an alternative for fast performing digital signal processing. The work in this thesis is an analysis and implementation of a GPU version of a digital signal processing chain provided by SAAB. Through an iterative process of development and testing, a final implementation was achieved. Two benchmarks, both comprised of 4.2 M test samples, were made to compare the CPU implementation with the GPU implementation. The was run on three different platforms: a desktop computer, a NVIDIA Jetson AGX Xavier and a NVIDIA Jetson TX2. The results show that the parallelized version can reach several magnitudes higher throughput than the CPU implementation.

2

Table of Contents Abstract 2

1. Introduction 5 1.1 Background 5 1.2 Problem Statement 6

2. Theory 6 2.1 Digital Processing Chain 6 2.1.1 Electromagnetic Waves 7 2.1.2 ADC Sampling 7 2.1.3 Fourier Transform 7 2.1.4 Window Function 8 2.1.5 Window Overlap 9 2.1.6 WOLA 9 2.1.7 Batching 9 2.1.8 Pulse Detection 10 2.2 Hardware 10 2.2.1 10 2.2.2 10 2.2.3 NVIDIA Jetson AGX Xavier 11 2.2.4 Air-T NVIDIA Jetson TX2 11 2.3 Software 11 2.3.1 Compute Unified Device Architecture 11 2.3.2 CUFFT 12 2.3.3 NVIDIA profiler 12 2.4 Related Works 12

3. Method 13 3.1 Parallelizing DP-chain 13 3.2 Benchmarks 13 3.3 Hardware Setup 14 3.4 Software Setup 14

4. Results 15 4.1 Parallelizing DP-chain 15 4.2 Benchmarks 16

5. Discussion 20 5.1 Batch Size 20 5.2 Latency Sacrifice 21 5.3 Pulse Activity Dependence 21 5.4 Throughput Deviation 21

3 5.5 Jetson Optimizations 21 5.6 Power Mode Divergences 22 5.7 Criticism of Methodology 22 5.8 Credibility of Sources 23 5.9 Ethics 23

6. Conclusion 23

References 24

4

1. Introduction

1.1 Background

Signal processing is an important part of many applications today, such as: audio manipulation, image processing, data compression techniques, radar and many more. In a lot of these cases the processing of signal data needs to be carried out very fast. For instance, when using radar in an airplane or a naval vessel it is of vital importance to quickly identify a recorded signal. Consequently, the usual solution of using a central processing unit (CPU) to carry out the computations is generally not enough to reach the desired performance requirements [1]. Therefore, the usage of special purpose hardware is common when there is a need for time-critical signal processing.

Traditionally, signal processing has mainly been computed on highly specialized digital signal processors and field-programmable gate arrays (FPGA) due to their low energy consumption combined with high computational power. A drawback when using these technologies is their high development cost, which is caused by the difficulties and required technical knowledge to utilize parallel programming [1]. Graphics processing units (GPU) provide opportunities to lower the development cost [1] and could possibly be used as a replacement for signal processing computations.

GPUs are rapidly improving in performance and efficiency. With its highly parallel structure, it makes for an excellent general-purpose unit, especially for large independent data sets as seen in computer graphics- and image processing. The modern GPU manufacturers also allow connecting several GPUs to work in unison, further increasing performance. Furthermore, GPUs are relatively simple to work with, using libraries such as CUDA and OpenCL. GPUs are also relatively cheap, both in terms of price and development cost compared to DSPs and FPGAs. [2]

GPUs and FPGAs both have in common that they efficiently compute in parallel. Despite that, FPGAs still outperform the GPUs at some specialized tasks in terms of processing speed and generally require less power and are thus widely used in many areas.

Saab works with special-purpose hardware to perform signal processing efficiently. But, because of the higher development costs associated with developing for special purpose hardware, they are looking for other ways that are easier to work with to speed up development and the possibility to test new functionality. Previously, a software-based implementation has been under development which would ease prototyping of new solutions. The first part of the thesis work would be to revise certain parts of the software-based implementation to enable GPU utilization. The second part of the thesis is to compare their current software implementation with a parallelized approach with GPUs on a number of platforms.

5 1.2 Problem Statement

The signal processing chain has three steps, an analog-to-digital converter, a channelizer and a pulse detector. The digital processing chain (DP-chain) is only the channelizer and the pulse detector part. The analog-to-digital converter receives an electromagnetic wave from an antenna and converts it to digital samples. The samples are fed into the channelizer which consists of a window function and a Fast Fourier Transform (FFT). The channelizer produces bins which represent how well the samples correlate to a certain frequency bandwidth. The final step is a pulse detector that takes the information in the frequency bins from the FFT and attempts to detect pulses. These steps will be explained in further detail in chapter 2.

The study contains the following research questions:

1. What parts of the DP-chain are parallelizable on GPU?

2. How does the parallelized DP-chain perform in comparison to the CPU implementation?

The platforms that were used include two embedded computer modules and a desktop computer. The two embedded computer modules were the NVIDIA Jetson AGX Xavier and an AIR-T platform with a NVIDIA Jetson TX2. The desktop had a NVIDIA RTX 2080 Super GPU and an i9-9900 CPU. The chain was also measured using different power modes on the Jetson platforms.

2. Theory

To develop a parallel digital processing chain, some theory needs to be understood. This chapter will go through related works and the theory behind the digital processing chain will be explained.

2.1 Digital Processing Chain

In this part, the entire digital DP-chain will be discussed, including details and the theory. The DP-chain is the channelizer and the pulse detector of the signal processing chain. Figure 1 shows the signal processing chain.

6

Figure 1. Visualization of the signal processing chain. Each step shows its input and output. The digital signal processing chain contains two major parts, the channelizer and the pulse detector. However, the channelizer has been split up into smaller parts containing the window function and FFT.

2.1.1 Electromagnetic Waves The phenomenon of an electromagnetic wave arises because a changing magnetic field induces a changing electric field, which in turn induces a magnetic field and so on. This process continues and the wave propagates through space at a constant speed. Electromagnetic waves, unlike usual mechanical waves, do not need a medium to travel through. The speed at which an electromagnetic wave travels is the speed of light, which is approximately 3 · 108 m/s. Electromagnetic waves are commonly used to transfer ​ information, by modulation, but they can also be used to locate objects at distance by sending an electromagnetic pulse in a certain direction and then measure the time it takes to bounce off the object and return. In this report, the term pulse will refer to artificially made electromagnetic waves.

2.1.2 ADC Sampling An analog-to-digital converter (ADC) is an electronic component that takes a continuous analog signal and converts it to a digital signal. The digital signal samples are binary numbers that are proportional to the analog signal, with a certain bit precision. An ADC has a limit on how quickly it can retrieve digital signal samples, a sample rate, which is measured in Hz. An antenna will induce a voltage from electromagnetic waves passing by it, which can then be converted to digital signal samples by an ADC. The number of samples that can be processed per second through the DP-chain will directly influence what sample rate can be used by the ADC in a real-time environment.

2.1.3 Fourier Transform The Fourier transform breaks up a function of time into a function of frequency. The transformed function tells how periodic the original function is at certain frequencies. The definition of the Fourier transform is: ∞ F (α) = ∫ f(t)e−i2πtα dt (1) −∞ ​

7 where f denotes the function of time and F denotes the transformed function of frequency α . The function of time and the function of frequency output complex numbers. For the transformed function, the magnitude of the complex number tells how well it correlates to the given frequency, while the angle of the complex number tells what phase it corresponds best with.

Since the Fourier transform is a continuous mathematical analytical transformation, it needs to be discretized to be used in digital processing. This is called a discrete Fourier transform (DFT) and it works by taking equally-spaced samples of the continuous function. The definition of the DFT is: N−1 −i2π kn F k = ∑ f ne N (2) n=0 ​ where f d enotes the nth sample from the function of time and F d enotes the discrete n ​ ​ k transformed function of frequency bin indexed by k . Each of the frequency bins, F k , consists of a band of frequencies. The frequency band per bin is determined by how many samples (referred to as N in the definition) were used in the discrete transform. The DFT can easily 2 be implemented in software based on the definition, but it would require O(N )​ operations. ​ There is a more widely used algorithm to compute the DFT which is called a fast Fourier transform (FFT), which was first presented by James W. Cooley and John W. Tukey [3]. The FFT algorithm is a divide and conquer strategy to compute the DFT and it reduces the complexity to be O(Nlog2(N)) and thus performs better when N grows compared to the naive ​ ​ DFT.

2.1.4 Window Function When trying to detect pulses, there is a need to only analyze a slice of a signal through time. This slice of a signal is moved through time and is referred to as a window. By only analyzing part of the signal it is possible to catch the majority of the periodic nature of a pulse in the window, so when a Fourier transform is performed, that periodicity is identifiable.

To only pick a window of a signal, a mathematical function is applied to the signal called the window function. This window function is zero-valued outside of the window interval and is multiplied by the signal, this causes the resulting signal to be zero outside the interval of the window. There are many types of window functions, and they all affect the result of a Fourier transform in different ways. Two of the major properties that the window function influence in the Fourier transform is the amount of spectral leakage and the resolution. Spectral leakage is the amount of non-zero values the Fourier transform will produce that are outside of a certain signal’s frequency. With a lot of spectral leakage, a signal A with a certain frequency and high amplitude may obscure a signal B with a different frequency and lower amplitude. With bad enough spectral leakage, this obfuscation may happen even if the frequency of A and the frequency of B is relatively far away. The resolution is how well a specific frequency is captured by the Fourier transform. With a bad resolution, two frequencies that are very close to each other may not be differentiable. [11]

8 This whole windowing process is able to be discretized just like with the Fourier transform. By taking discrete samples from the window function and multiplying them with the samples taken for the DFT, it is possible to make digital computations. Figure 2 shows the processing of samples by using windows.

Figure 2. Visualization of the discrete windowing process. Each line of the samples represents a 32-bit floating-point number. The window size in this case is 10 samples. The window function will therefore contain 10 floating-point numbers that are multiplied by the samples in the window. WN is ​ ​ the Nth position of the window in time. In this case, the window moves its entire size after each ​ ​ digital signal processing.

2.1.5 Window Overlap Because the window function usually approaches zero at the edges, information is lost when sliding whole window sizes through time. To counteract this, an overlap between windows is taken. In other words, instead of sliding the window by whole window sizes, a smaller value is advanced forward. This value of how much the window moves for each processing is called the step size.

2.1.6 WOLA By increasing the size of a window by an integer multiple of the DFT input size, a summation of multiple DFT inputs can be used as input to a single DFT. This has the effect of widening the frequency band of the output and also reducing spectral leakage. [14]

This technique will be referred to as WOLA (weight, overlap, add) in this report.

2.1.7 Batching Batching is the process of taking multiple windows and processing them in parallel. Because of the window overlap, the number of samples processed in each batch will be the product of the step size and how many windows are being processed per batch.

samples per batch = step size · batch size (3)

The batch size is the number of windows processed in a single batch. A visualization for why the samples per batch formula are reasonable is given in Figure 3.

9

Figure 3. Similar to Figure 2, but with window overlapping and batching. W1-W5 represents the 5 windows that will be processed in one batch. The step size shows how far the window should move for the next samples to be processed. The reason for why the last samples in W5 (the ones that are not overlapped by W4) are not considered processed is because the next batch will start on those samples, therefore they are not finished being processed.

2.1.8 Pulse Detection In this report, the pulse detection consist of finding which frequencies signal strength is above a certain threshold for a period of time. This is accomplished by iterating each frequency bin from the output of the FFT, computing the magnitude of that bin and then checking if the value is above a constant threshold. This process is then repeated for each FFT output when the window goes through time. A pulse is said to be detected when it has continuously been above the constant threshold for a certain amount of time. In this report, this constant threshold will be referred to as the detection threshold.

2.2 Hardware

2.2.1 Central Processing Unit The CPU is one of the main components of a computer. The main function of the CPU is to run programs by executing a set of instructions specified by the program. The CPU is optimized for flexibility and to perform a wide amount of programs serially. Consequently, there is typically a lack of performance in concurrent tasks such as . [7]

2.2.2 Graphics Processing Unit Historically, the GPU was designed to process graphical workloads at superior speeds compared to the CPU. Nowadays, the GPU’s computational power has been expanded to compute more general things that are not only graphics related.

A GPU generally contains more computing cores than a CPU. As a result, the GPU can reach much higher throughput than a CPU. However, the high throughput usually comes at the cost of higher latency compared to a CPU. Thus, using a GPU will perform well in applications with a large amount of independent calculation. [7]

10

The architecture of a GPU varies by generation. On a high level, a NVIDIA GPU consists of a number of streaming multiprocessors (SM). The SMs in turn consists of CUDA cores, a dedicated set of registers and a shared memory. [7]

2.2.3 NVIDIA Jetson AGX Xavier The NVIDIA Jetson AGX Xavier system has an 8-core NVIDIA Carmel ARM CPU. The AGX also has an integrated 512-core NVIDIA Volta @ 1.37 GHz GPU. The memory used is a 16 GB 256-bit LPDDR4x @ 2133 MHz. The internal memory allows a data transfer rate of 136.5 GB/s. The CPU and the GPU share memory space, nullifying copy time between the components. For powering the system, there are three modes: 10 W, 15 W and 30 W, allowing for dynamic use of power depending on computational needs. [5]

2.2.4 Air-T NVIDIA Jetson TX2 The Air-T is a system designed with radio transmission and receiving in mind. It has several components, including a CPU, an FPGA and a NVIDIA Jetson TX2 embedded on the chip. In this study, only the TX2 will be used for measurements.

The Jetson TX2 module has two CPUs, a dual-core NVIDIA Denver CPU and a quad-core A57 ARM Cortex CPU. The TX2 has a 256-core NVIDIA Pascal GPU. For memory, the TX2 has an 8 GB LPDDR4 memory @ 1866 MHz with a data transfer rate of 59.7 GB/s. This memory is shared between the CPU and the GPU. Like the Jetson Xavier, the TX2 has different power modes; a 7.5 W mode and a 15 W mode. [6]

2.3 Software

2.3.1 Compute Unified Device Architecture ​ Compute Unified Device Architecture (CUDA) is an API by NVIDIA for parallel computing on GPUs. The API works by extending the C++ programming language with a set of extensions and a runtime library. [8]

In CUDA, a GPU function call starts a kernel. A kernel is executed in parallel by a number of threads. The programmer organizes these threads in a number of blocks called blocks, which all execute the same kernel. Each thread in a thread block executes one instance of the kernel. The minimum amount of scheduled work is 32 threads and is called a warp. These thread blocks are grouped in a grid. An illustration of the structure can be found in Figure 4. [8] [9]

11

Figure 4. The figure illustrates the thread structure. A thread has personal memory. Several threads build a thread block. Several thread blocks build a grid. Every Grid runs a kernel. There may be multiple grids.

2.3.2 CUFFT CUDA fast Fourier transform (cuFFT) is NVIDIA’s API to perform FFTs on GPUs. The API comes with two libraries, the cuFFT library and the cuFFTW library. The cuFFT library helps users to perform FFTs on GPUs by having compiled programs available to run. The cuFFTW is mainly a porting tool for the commonly used CPU based version for Fourier transforms, Fastest Fourier Transform In The West (FFTW). [10]

2.3.3 NVIDIA profiler Nvprof is a tool commonly used when profiling GPU applications. It allows a user to gather CUDA related information such as kernel execution time, memory transfers and more. Nvprof is run through a command line with optional flags and an application. [8]

2.4 Related Works

F. García-Rial et al. [12] present a solution to perform image processing for 3-D THz radar in real-time, for the purpose of identifying hidden person-borne threats. Their solution contains common signal processing algorithms, such as a window function, FFT and peak detection and it achieved a refresh rate of more than 8 FPS for 6000 pixels. The window function was a custom-made CUDA kernel and the cuFFT library was used for the FFT. While many parts of the solution are similar to what is needed for pulse detection in this report, the application is different considering it is image-based.

Yaniv Rubinpur and Sivan Toledo [15] implemented and compared two versions of a wildlife tracking system, ATLAS; one where the signal processing parts (including, but not limited to, an FIR filter, a Fourier transform, an overlap-add) are done by a CPU, and the other where they are done by a GPU. The comparison was done on several different setups, including a GeForce TITAN Xp, a Jetson AGX Xavier and a Jetson TX2. The authors found that the GPU implementation resulted in a more than 50 times improved performance, and about 5

12 times less power used compared to the CPU implementation. While relatively similar to this report in both platforms used and steps performed, the main process is different. However, the result the authors got would suggest a big performance increase in this report when comparing the CPU implementation with the GPU implementation.

P. Monsurro et al. [3] discusses bottlenecks when using GPUs as a general-purpose computational unit. Two major factors include the latency of sending data from the CPU to the GPU through the PCI Express interface and the computational power. The current industry standard as of 2020 is 16xPCIe 3.0 with a bandwidth of 16 GB/s. This might be limiting on the desktop, however, both the NVIDIA Jetson AGX Xavier and the NVIDIA Jetson TX2 uses an integrated GPU with a shared memory space that can receive data with a bandwidth of about 137 GB/s [5] and 60 GB/s [6] respectively, and will consequently not be as affected by this latency. However, the computational power might be throttling the less powerful jetson cards more than the desktop setup.

3. Method

This chapter will discuss the methodology used to achieve the goal of parallelizing the digital signal processing application, and also how the benchmarking was done to analyze the performance.

3.1 Parallelizing DP-chain

The parallelization in the study was conducted through an iterative process of evaluation, implementation and optimization.

The first step in parallelizing the DP-chain was to evaluate what sections of the DP-chain was parallelizable. This was done through discussions and through understanding the different sections of the DP-chain, as well as reviewing the CPU implementation. The CPU implementation was done by SAAB and was decided to be the reference for test results. Once an evaluation had been conducted, an implementation was written. The implementation was then tested and optimizations based on the result were made. This process was done iteratively several times.

To test that the entire GPU DP-chain worked as expected, the result was compared to the CPU implementation after every iteration. The comparisons were based on the complex numbers in the frequency bins, as well as the number of pulses found when given the same input samples to the GPU and CPU implementations.

3.2 Benchmarks

For the test data, a sample size of roughly 4.2 M 32-bit floating-point samples was chosen. The DP-chain was evaluated using an expected best-case test and an expected worst-case test. These tests will be referred to as best-case and worst-case. The best-case test was set up to not have any pulses. This was considered best-case since having no pulses in the

13 samples should require the least amount of work. The worst-case was set up to have both long pulses, stretching most of the runtime, as well as short and highly repetitive pulses throughout the runtime. This was considered the worst-case because the highly repetitive pulses requires many frequent calculations while the long pulses requires heavy serial computation, thus decreasing thread efficiency on the GPU. Therefore, a combination of long pulses and short frequent pulses are expected to be difficult. The samples tested were the exact same on all different platforms.

The benchmark was measured by using the standard C++ library Chrono with the high_resolution_clock class to measure every individual step of the chain. The throughput, measured in samples per second, of each step was then calculated through

batch size * step size throughput = elapsed time (4) where batch size is the number of windows in a batch, step size is the step size used and elapsed time is the amount of time spent in a particular function. The throughput is measured for every batch. Once all the batches have been calculated, an average is taken as the final measurement.

For the final benchmarks the test was run with 17 different batch sizes, ranging from 20 to 216 ​ ​ for all platforms. A DFT size of 64 with a WOLA of 5 summations was used, meaning a window of 320 samples. The step size was 32. Each platform ran both the best-case test and the worst-case test once. The jetson platforms also ran the two tests on each power mode and with jetson_clocks, a script maximizing clock frequency and power usage.

3.3 Hardware Setup

The benchmarks were performed on three different platforms, a desktop computer, a NVIDIA Jetson AGX Xavier and NVIDIA Jetson TX2.

The desktop computer used contained an Core i9-9900 CPU, NVIDIA RTX 2080 Super GPU with a PCIe 3.0 x16 connection and 64 GB of RAM. The CPU thermal design power is 65W.

The NVIDIA Jetson AGX Xavier is specified in section 2.2.3.

The NVIDIA Jetson TX2 is specified in section 2.2.4

3.4 Software Setup

For the desktop computer, Linux CentOS was used as operating system, CUDA 10.2 for GPU programming and GCC 7.3.0 as C++ compiler.

For the NVIDIA Jetson AGX Xavier, Linux Ubuntu was used as operating system, CUDA 10.2 for GPU programming and GCC 7.4.0 as C++ compiler.

14

For the NVIDIA Jetson TX2, Linux Ubuntu was used as operating system, CUDA 10.0 for GPU programming and GCC 7.4.0 as C++ compiler.

Only the desktop setup was used to benchmark the CPU version of the DP-chain, while the GPU version was benchmarked on all three platforms. The GPU version of the DP-chain uses the same code for all three platforms, but it is compiled on the platforms and could therefore create slight variations in the binary.

4. Results

This chapter contains the results of the process that was defined in the methodology. It starts out by presenting the resulting parallelized algorithm that came from previous studies and iterative development. Then, the benchmarks of the final implementation are presented.

4.1 Parallelizing DP-chain

The CPU implementation of the digital signal processing is a serial implementation and deals with one window at a time. It gathers enough samples to fill a window, then proceeds to apply the window function together with a WOLA. Then it runs the FFT on the preprocessed samples, using the FFTW library. After this the pulse detector can run on the frequency bins, keeping track of frequencies with a magnitude above a constant threshold and writing out pulses if found. This concludes the processing of one window, the process then repeats by gathering new samples to process. The process terminates when there are no more samples to process and it prints out all the detected pulses.

Once the CPU implementation was understood, an evaluation based on previous studies and the implementation gave a first take of the GPU-based digital signal processing chain. Based on F. García-Rial et al. [12] and the CUDA API documentation [8], preprocessing (window function and WOLA) and FFT could trivially be parallelized on the GPU. For the preprocessing, each GPU thread computes the product of the decimated window with the window function and puts the result into a GPU memory buffer. The FFT was performed using the cuFFT library. In the pulse detection step, one thread for each frequency bin of the FFT output is started. Each of these threads computes the magnitude of the complex number and checks if it is higher than the detection threshold. The pulse detection threads have a state data structure to keep track of pulses when the digital signal processing is repeated through time. This concludes the processing of one window. Similar to the CPU implementation, the process repeats by gathering new samples for the next window and is done when there are no more samples to process.

This first implementation did not reach full GPU resource usage with our desired window size. A better throughput could be achieved by performing whole windows in parallel as well. The next implementation did exactly this, the number of windows processed in parallel is called the batch size. A higher batch size means waiting for the samples required to fill the

15 batch, which causes a higher latency from the time the sample was gathered to the time when it is finished being processed.

In the final iteration, a new detector was written because it turned out to be a large bottleneck. This was caused by the fact that for large batch sizes, the detector serially analyzes each FFT output in the batch even though each frequency bin is in parallel. To get more parallel utilization, an early out algorithm was implemented. This algorithm starts one thread per FFT output in the whole batch. Each thread then looks at the previous signal strength and the current signal strength to determine if it is the start of a pulse. If it is, the thread goes through the batch until it reaches the end of the pulse. This shortens the amount of linear analysis for the detector to only be the length of a pulse, while in the previous detector it always went linearly through the whole batch. This provided quite the speedup when there was low pulse activity in the sample data as will be shown in 4.2.

4.2 Benchmarks

The following results show the last iteration of testing when running the test samples through the entire DP-chain. For the desktop throughput data (Figure 5), the best-case test doubled in performance for every double of the batch size until a batch size of around 2048 where it started to plateau. The best-case test peaked at roughly 2 billion samples per second. For

16 Figure 5. The figure shows a log scaled throughput chart when running samples through the entire

DP-chain on the desktop setup. The vertical axis is scaled as a log10 and the horizontal axis is in log2. ​ ​ ​ The GPU best-case shows the number of samples processed per second during the best-case test. The GPU worst-case shows the number of samples processed per second during the worst-case test. The CPU shows the number of samples processed with the CPU implementation and is not affected by the signal size or batch size. the worst-case, the GPU peaked at a batch size of 4096 with about 200 M samples per ​ second and started to plateau with a batch size of around 256. The CPU consistently processed 4.6 M samples per second (batch size is not applicable) and serves as a reference.

For the AGX throughput data (Figure 6), the test cases behaved similarly to the desktop throughput data. The best-case test doubled in performance for all power modes until a batch size of about 2048, where it starts to plateau. When measuring the AGX without a limiting power, it reaches a peak at about 1 B samples per second. The performance from no limit to 30 W drops about 40%. This is also the relation between the other power modes. For

17 the worst-case test, the AGX peaks at a batch size of about 1024 at about 40-90 M samples processed per second depending what power mode is run. The performance then declines before starting to increase again as the batch size increases.

Figure 6. The figure shows a log scaled throughput chart when running samples through the entire DP-chain on the NVIDIA Jetson AGX Xavier setup. The vertical axis is scaled as a log and the ​10 horizontal axis is in log2. The graph shows the performance of the different power modes in both test ​ ​ cases. The AGX best-case and the AGX worst-case represents the AGX without limiting power consumption. The CPU shows the number of samples processed with the CPU implementation and is not affected by signal size or batch size.

The TX2 shows similar patterns in the throughput data (Figure 7) as the AGX. In the best-case test, performance is doubled when the batch size is doubled up to 1024 for all power modes before plateauing for bigger batch sizes. For the worst-case test, the performance is doubled up to a batch size of 1024. For larger sizes, it dips in performance before settling around 50-80 M samples per second.

Furthermore, as shown in Figures 8 and 9, the time taken for each section of the DP-chain is roughly 5-10 microseconds for batch sizes up to 256 except for the detector in the worst-case test. In the best-case test, larger batch sizes show memory copy as the

18 prominent bottleneck, taking approximately 70% of the total time. After a batch size of 2048, the time starts to double with each doubling of batch size, meaning time taken increases linearly beyond this point. In the worst-case test however, larger batch sizes show the detector as the prominent bottleneck, taking approximately 92% of the total time. The time taken for the different sections on the jetson platforms are distributed in a similar way as the desktop for both of the test cases.

Figure 7. The figure shows a log scaled throughput chart when running samples through the entire DP-chain on the NVIDIA Jetson TX2 setup. The vertical axis is scaled as a log and the horizontal ​10 axis is in log2. The graph shows the performance of the different power modes in both test cases. The ​ ​ CPU shows the number of samples processed with the CPU implementation and is not affected by signal size or batch size.

19

Figure 8. The figure shows the time it took for a Figure 9. The figure shows the time it took for a batch to be processed through individual sectors batch to be processed through individual sectors of the DP-chain during the best-case. The Y-axis of the DP-chain during the worst-case. The shows the time in microseconds and the X-axis Y-axis shows the time in microseconds and the the batch size used on a log2 scale. X-axis shows the batch size used on a log2 scale. ​ ​ ​ ​

5. Discussion

This chapter will discuss noteworthy aspects from the results.

5.1 Batch Size

The most prominent factor to gain throughput on the GPU was the batch size. As can be seen in Figure 5-7, the throughput got consistently higher with the batch size until it tapers off. This is expected behavior due to the fact that a larger batch size will utilize the number of threads on the GPU more efficiently up to a certain point. This point represents near maximum concurrent resource usage on the GPU.

At low batch sizes, the throughput from the GPU is quite low, even lower than the CPU. This shows that when the parallel power of the GPU is not used, it gets outperformed by the serial performance of the CPU. The minimum batch size to perform better than the CPU is different for each platform, but all platforms reached higher throughputs at some point. The desktop reached a higher throughput at a batch size of 4, for both the expected best and expected worst-case. All different configurations of the AGX matched or exceeded the

20 throughput of the CPU at a batch size of 32, for both the expected best and worst case. All different configurations of the TX2 matched or exceeded the throughput of the CPU at a batch size of 128, for both the expected best and worst case.

5.2 Latency Sacrifice

In a real-time scenario, the process of batching will require waiting for more samples to be gathered. This increases the latency, in the sense that it takes longer to go from receiving an input to getting an output. This is of course only a problem in cases where real-time signal processing is required, and also needs low latency. For non-real-time processing, a higher batch size will almost always result in faster processing of samples, depending on the size of the data that needs to be processed.

5.3 Pulse Activity Dependence

A common pattern with the throughput result is the impact of the worst-case data. The worst-case data can reduce the throughput by as much as a magnitude compared to the best-case data. The reason for this can be shown in Figure 8 and Figure 9. Figure 8 shows that the largest bottleneck is the memory transfer from host to device, while all the other parts of the DP-chain are by comparison not as time-consuming. On the other hand, Figure 9 shows that the pulse detector is a major bottleneck even when compared to the memory transfer. This can be explained by the fact that the only part of the DP-chain that is performance dependant on what is in the sample data, is the detector. Therefore a lot of pulse activity, especially large pulses, causes the detector to be a major bottleneck. Figure 8 and Figure 9 only show the extreme expected cases of worst and best data, so depending on application and pulse activity, the average throughput may vary between these cases.

5.4 Throughput Deviation

When running the GPU tests, a large deviation was found in the performance from test to test, usually in one or two of the batch sizes. This deviation shows itself in, e.g., Figure 9 and explains why the detector is not completely linear. The reason for this deviation is difficult to explain without scheduling the GPU manually, but could for instance be that the monitor was rendered during that exact time, or the GPU being overheated and had to throttle for a while.

5.5 Jetson Optimizations

Since the DP-chain was directly ported over from the desktop to the Jetson systems without code modification, there may be some hardware-specific optimizations to increase performance. One such optimization is to utilize the fact that the integrated GPU has shared memory with the CPU, meaning that the memory transfers are not needed. Considering that there is only one major memory transfer in the final DP-chain algorithm, this should not affect the results presented here in a significant way, but it shows that there are possibilities of platform-specific optimizations.

21 5.6 Power Mode Divergences

The following data is taken from the best-case test.

In Figure 6, the power modes different throughputs are presented for the AGX platform. The AGX Xavier, when limited to the 30 W power mode, has a maximum throughput of about 750M samples per second, which is 25 M samples/s per watt. The 15 W power mode has a max throughput of about 410 M per second, equaling roughly 27 M samples/s per watt. The 10W power mode has a max throughput of about 230 M, equaling a throughput of 23 M samples/s per watt. For the AGX Xavier in no limit mode, a reference power lies around ~50

W [13] a​ nd a max throughput of about 1.1 B, equaling roughly 22 M samples/s per watt. ​

In Figure 7, the power modes different throughputs are presented for the TX2 platform. The TX2 platform when running in the 15 W power mode, has a max performance of 260 M samples per second, equaling to 17.3 M samples/s per watt. The 7.5 W power mode has a max performance of 214 M samples per second, equaling 28.5 M samples/s per watt. For the TX2 in no limit mode, a reference power lies around ~40 W, and a max throughput of about 309 M, reaching about 7.7 M samples/s per watt.

As seen through this analysis (table 1), the AGX seems to proportionally lower its capabilities with the power supplied. The discrepancy seen in this case could be contributed to throughput deviation of the test results. For the TX2 however, there seem to be a small loss of capabilities for a relatively big decrease in power supplied. The CPU in table 1 uses its TDP of 65 W as reference power consumption and max throughput of 4.6 M.

Platform AGX AGX AGX AGX TX2 TX2 TX2 CPU* 30W 15W 10W 15W 7.5W

Samples/ 22 M 25 M 27 M 22 M 7.7 M 17.3 M 28.5 M 70 K s per watt

Max 1129 754 M 417 M 231 M 309 M 258 M 214 M 4.6 M samples/s M

Table 1. The table shows comparisons of the different platforms and power modes in terms of sample throughput. The AGX and the TX2 represent the no limit mode of the respective platforms. Maximum samples/s is the highest throughput reached during the benchmarks. *The CPU uses the TDP (65W) as reference power.

5.7 Criticism of Methodology

The fact that the detector is such a large bottleneck is a problem that is probably possible to solve. Given more time, it could have been rewritten to utilize some reduction methods that

22 are common to perform on the GPU. Thus, with some more studying on how to solve reduction problems on the GPU, maybe a more effective solution could have been made.

The benchmark results that are presented are each from one run of the 4.2 M test samples. This introduces some uncertainty around the results. A more accurate method would have been to run the benchmark multiple times for each batch size, and present the average of those runs instead. Also, it may have been accurate to include standard deviations in that case as well.

Another criticism of the method is the lack of a real-life test. The current tests use an expected best-case and an expected worst-case test that shows an accurate range of the capabilities of the parallelized program. However, it doesn’t show a real-life application, i.e, a sample set based on real data. It is therefore impossible to say whether the program will run closer to the best-case test or the worst-case test in terms of performance.

5.8 Credibility of Sources

Some of the sources are from peer-reviewed journals or conferences making them trustworthy. Other sources, such as NVIDIA’s own documentation is deemed trustworthy, as the content is regarding their products and coding language. One of the sources comes from a NVIDIA employee in a forum. Because it is a NVIDIA employee, in the NVIDIA forums, making the claim, it is deemed trustworthy. One of the sources is taken from Arxiv, an open-access site for scientific papers. While there isn’t a full peer-review for every article, the papers go through approval by moderators. It is thus deemed reliable. The final questionable source is an article from “The Collaboration for Astronomy Signal Processing and Electronic Research”, CASPER for short. This collaboration consists of multiple university institutions and is thus deemed reliable.

5.9 Ethics

As this thesis has mainly contributed by investigating if a part of the signal processing chain could be parallelized on a GPU, followed by actually parallelizing it, there are not many ethical questions to discuss. However, as we saw in table 1, the CPU has quite poor performance per watt compared to the jetson cards. Using the algorithm in this report would thus save power.

6. Conclusion

Every part of the digital signal processing chain can be parallelized, although each part shows different amounts of improvements. The parallelized digital signal processing chain is only beneficial in terms of throughput if batching of a lot of samples at a time is performed. Depending on platform, the amount of batching required to show improvement compared to the non-parallelized implementation varies. Although on all platforms tested, the parallelized version shows improvement after a certain batch size. Using a large batch size will cause a larger latency from input to output, which may affect some real-time applications that need

23 fast response times. Therefore, a tradeoff between latency and throughput may be important to consider.

The parts of the digital signal processing chain showed different amount of improvements, that often reflects how well that part of the parallelized algorithm manage to use all available GPU threads. All parts show very significant improvements except for the pulse detector, which in the worst-case can become quite linear. Therefore, future work to improve the pulse detector might be possible with an algorithm that better utilizes the concurrency of the GPU. This may be possible to do with common reduction algorithms.

Another future work opportunity is to make platform-specific optimizations on the Jetson hardware. For example, a reduction of memory transfers can be possible since these embedded systems have an integrated GPU that shares memory with the CPU. Additionally, other platform-specific optimizations are interesting to identify.

References

1. A. HajiRassouliha, A. J. Taberner, M. P. Nash, P. M. F. Nielsen, "Suitability of recent hardware accelerators (DSPs FPGAs and GPUs) for computer vision and image processing algorithms", Signal Process. Image Commun., vol. 68, pp. 101-119, Oct. 2018. 2. R. S. Perdana, B. Sitohang and A. B. Suksmono, "A survey of graphics processing unit (GPU) utilization for radar signal and data processing system," 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI), Langkawi, 2017, pp. 1-6, doi: 10.1109/ICEEI.2017.8312430. 3. P. Monsurro, A. Trifiletti, F. Lannutti, "Implementing radar algorithms on CUDA hardware", 2014 MIXDES, pp. 455-458, 19–21 June 2014. 4. James W. Cooley, John W. Tukey "An Algorithm for the Machine Calculation of Complex Fourier Series" Mathematics of Computation, vol 19, no. 90, pp 287-301, April 1965. 5. NVIDIA Jetson Xavier System-on-Module Data Sheet (taken 2020-02-10). Available at: https://developer.nvidia.com/embedded/downloads#?search=Jetson%20AGX%20Xa vier&tx=$product,jetson_agx_xavier 6. Jetson TX2 Series Module Data Sheet (taken 2020-05-05). Available at: https://developer.nvidia.com/embedded/downloads#?search=Data%20Sheet 7. Stallings, William, “Computer Organization and Architecture: Designing for Performance”, 10th edition, Pearson Education. 8. CUDA programming API documentation: Available at: https://docs.nvidia.com/cuda/ ​ 9. NVIDIA Fermi GPU Architecture whitepaper. Available at: ​ ​ https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_A rchitecture_Whitepaper.pdf 10. NVIDIA cuFFT library API documentation: Available at: https://docs.nvidia.com/cuda/cufft/index.html

24 11. Gerhard Heinzel, Albrecht Rüdiger and Roland Schilling, “Spectrum and spectral density estimation by the Discrete Fourier transform (DFT), including a comprehensive list of window functions and some new at-top windows”, Internal Report, Max-Planck-Institut für Gravitationsphysik, Hannover, 2002. 12. F. García-Rial, L. Úbeda-Medina and J. Grajal, "Real-time GPU-based image processing for a 3-D THz radar", IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 10, ​ ​ pp. 2953-2964, Oct. 2017. 13. AGX Watt Design Reference: Taken 2020-05-26, Available at: https://forums.developer.nvidia.com/t/xavier-power-consumption-information-solved/6 4595/5 14. “The Polyphase Filter Bank Technique”, Available at: https://casper.ssl.berkeley.edu/wiki/The_Polyphase_Filter_Bank_Technique 15. Rubinpur, Yaniv, and Sivan Toledo. "High-Performance GPU and CPU Signal Processing for a Reverse-GPS Wildlife Tracking System." arXiv preprint ​ arXiv:2005.10445 (2020). ​

25