Parallelizing Digital Signal Processing for GPU ______Hannes Ekstam Ljusegren Hannes Jonsson

Linköpings universitet | Institutionen för datavetenskap Examensarbete på grundnivå, 16 hp | Datateknik Vårterminen 2020 | LIU-IDA/LITH-EX-G-20/033-SE Parallelizing Digital Signal Processing for GPU _______________ Hannes Ekstam Ljusegren Hannes Jonsson Supervisor: George Osipov Examiner: Peter Jonsson External Supervisors: Fredrik Bjurefors & Daniel Fransson Linköpings universitet SE-581 83 Linköping 013-28 10 00, www.liu.se Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/. Copyright The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/. © Hannes Ekstam Ljusegren © Hannes Jonsson 1 Abstract Because of the increasing importance of signal processing in today's society, there is a need to easily experiment with new ways to process signals. Usually, fast-performing digital signal processing is done with special-purpose hardware that are difficult to develop for. GPUs pose an alternative for fast performing digital signal processing. The work in this thesis is an analysis and implementation of a GPU version of a digital signal processing chain provided by SAAB. Through an iterative process of development and testing, a final implementation was achieved. Two benchmarks, both comprised of 4.2 M test samples, were made to compare the CPU implementation with the GPU implementation. The benchmark was run on three different platforms: a desktop computer, a NVIDIA Jetson AGX Xavier and a NVIDIA Jetson TX2. The results show that the parallelized version can reach several magnitudes higher throughput than the CPU implementation. 2 Table of Contents Abstract 2 1. Introduction 5 1.1 Background 5 1.2 Problem Statement 6 2. Theory 6 2.1 Digital Processing Chain 6 2.1.1 Electromagnetic Waves 7 2.1.2 ADC Sampling 7 2.1.3 Fourier Transform 7 2.1.4 Window Function 8 2.1.5 Window Overlap 9 2.1.6 WOLA 9 2.1.7 Batching 9 2.1.8 Pulse Detection 10 2.2 Hardware 10 2.2.1 Central Processing Unit 10 2.2.2 Graphics Processing Unit 10 2.2.3 NVIDIA Jetson AGX Xavier 11 2.2.4 Air-T NVIDIA Jetson TX2 11 2.3 Software 11 2.3.1 Compute Unified Device Architecture 11 2.3.2 CUFFT 12 2.3.3 NVIDIA profiler 12 2.4 Related Works 12 3. Method 13 3.1 Parallelizing DP-chain 13 3.2 Benchmarks 13 3.3 Hardware Setup 14 3.4 Software Setup 14 4. Results 15 4.1 Parallelizing DP-chain 15 4.2 Benchmarks 16 5. Discussion 20 5.1 Batch Size 20 5.2 Latency Sacrifice 21 5.3 Pulse Activity Dependence 21 5.4 Throughput Deviation 21 3 5.5 Jetson Optimizations 21 5.6 Power Mode Divergences 22 5.7 Criticism of Methodology 22 5.8 Credibility of Sources 23 5.9 Ethics 23 6. Conclusion 23 References 24 4 1. Introduction 1.1 Background Signal processing is an important part of many applications today, such as: audio manipulation, image processing, data compression techniques, radar and many more. In a lot of these cases the processing of signal data needs to be carried out very fast. For instance, when using radar in an airplane or a naval vessel it is of vital importance to quickly identify a recorded signal. Consequently, the usual solution of using a central processing unit (CPU) to carry out the computations is generally not enough to reach the desired performance requirements [1]. Therefore, the usage of special purpose hardware is common when there is a need for time-critical signal processing. Traditionally, signal processing has mainly been computed on highly specialized digital signal processors and field-programmable gate arrays (FPGA) due to their low energy consumption combined with high computational power. A drawback when using these technologies is their high development cost, which is caused by the difficulties and required technical knowledge to utilize parallel programming [1]. Graphics processing units (GPU) provide opportunities to lower the development cost [1] and could possibly be used as a replacement for signal processing computations. GPUs are rapidly improving in performance and efficiency. With its highly parallel structure, it makes for an excellent general-purpose computing unit, especially for large independent data sets as seen in computer graphics- and image processing. The modern GPU manufacturers also allow connecting several GPUs to work in unison, further increasing performance. Furthermore, GPUs are relatively simple to work with, using libraries such as CUDA and OpenCL. GPUs are also relatively cheap, both in terms of price and development cost compared to DSPs and FPGAs. [2] GPUs and FPGAs both have in common that they efficiently compute in parallel. Despite that, FPGAs still outperform the GPUs at some specialized tasks in terms of processing speed and generally require less power and are thus widely used in many areas. Saab works with special-purpose hardware to perform signal processing efficiently. But, because of the higher development costs associated with developing for special purpose hardware, they are looking for other ways that are easier to work with to speed up development and the possibility to test new functionality. Previously, a software-based implementation has been under development which would ease prototyping of new solutions. The first part of the thesis work would be to revise certain parts of the software-based implementation to enable GPU utilization. The second part of the thesis is to compare their current software implementation with a parallelized approach with GPUs on a number of platforms. 5 1.2 Problem Statement The signal processing chain has three steps, an analog-to-digital converter, a channelizer and a pulse detector. The digital processing chain (DP-chain) is only the channelizer and the pulse detector part. The analog-to-digital converter receives an electromagnetic wave from an antenna and converts it to digital samples. The samples are fed into the channelizer which consists of a window function and a Fast Fourier Transform (FFT). The channelizer produces bins which represent how well the samples correlate to a certain frequency bandwidth. The final step is a pulse detector that takes the information in the frequency bins from the FFT and attempts to detect pulses. These steps will be explained in further detail in chapter 2. The study contains the following research questions: 1. What parts of the DP-chain are parallelizable on GPU? 2. How does the parallelized DP-chain perform in comparison to the CPU implementation? The platforms that were used include two embedded computer modules and a desktop computer. The two embedded computer modules were the NVIDIA Jetson AGX Xavier and an AIR-T platform with a NVIDIA Jetson TX2. The desktop had a NVIDIA RTX 2080 Super GPU and an i9-9900 CPU. The chain was also measured using different power modes on the Jetson platforms. 2. Theory To develop a parallel digital processing chain, some theory needs to be understood. This chapter will go through related works and the theory behind the digital processing chain will be explained. 2.1 Digital Processing Chain In this part, the entire digital DP-chain will be discussed, including details and the theory. The DP-chain is the channelizer and the pulse detector of the signal processing chain. Figure 1 shows the signal processing chain. 6 Figure 1. Visualization of the signal processing chain. Each step shows its input and output. The digital signal processing chain contains two major parts, the channelizer and the pulse detector. However, the channelizer has been split up into smaller parts containing the window function and FFT. 2.1.1 Electromagnetic Waves The phenomenon of an electromagnetic wave arises because a changing magnetic field induces a changing electric field, which in turn induces a magnetic field and so on. This process continues and the wave propagates through space at a constant speed. Electromagnetic waves, unlike usual mechanical waves, do not need a medium to travel through. The speed at which an electromagnetic wave travels is the speed of light, which is approximately 3 · 108 m/s.

Load more