CPU and GPU Co-Processing for Sound

CPU and GPU Co-processing for Sound Aleksander Gjermundsen Master of Science in Computer Science Submission date: July 2010 Supervisor: Anne Cathrine Elster, IDI Co-supervisor: Thorvald Natvig, IDI Norwegian University of Science and Technology Department of Computer and Information Science Problem Description GPUs are becoming very attractive devices for supporting HPC applications. This project will look at the GPU as an accelerator that can off-load the CPU for real-time and/or computationally intensive tasks. Preprocessing of human speech will be used as a case study for how to process an application on either the CPU or the GPU or both and the potential for dynamically allocating computational tasks between the two at run-time. Potential parallelism will include utilizing OpenCL, a new framework for enabling improved performance on both CPUs and GPUs. Assignment given: 17. February 2010 Supervisor: Anne Cathrine Elster, IDI Abstract When using voice communications, one of the problematic phenomena that can occur, is participants hearing an echo of their own voice. Acoustic echo cancellation (AEC) is used to remove this echo, but can be computationally demanding. The recent OpenCL standard allows high-level programs to be run on both multi- core CPUs, as well as Graphics Processing Units (GPUs) and custom accelerators. This opens up new possibilities for offloading computations, which is especially important for real-time applications. Although many algorithms for image- and video-processing have been studied on the GPU, audio processing algorithms have not similarly been well researched. This can be due to these algorithms not being viewed as computationally heavy and thus as suitable for GPU-offloading as, for instance, dense linear algebra. This thesis studies the AEC filter from the open-source library Speex for speech compression and audio preprocessing. We translate the original code into an opti- mized OpenCL program that can run on both CPUs and GPUs. Since the overhead of the OpenCL vendor implementations dominate running times, our results show that the existing reference implementation is faster for single channel input/output, due to its simplicity and low computational intensity. However, by increasing the number of channels processed by the filter and the length of the echo tail, a speed-up of up to 5 on CPU+GPU over CPU only, was achieved. Although these cases may not be the most common, the techniques developed in this thesis are expected to be of increasing importance as GPUs and CPUs become more integrated, especially on embedded devices. This makes latencies less of an issue and hence the value of our results stronger. An outline for future work in this area is thus also included. i ii Acknowledgements This master thesis was written at the High Performance Computing group at the Department of Computer and Information Science, at the Norwegian University of Science and Technology. I extend my gratitude to several people for making this project possible. First of all, I would like to thank my advisor on this project, Dr. Anne C. Elster, for being of invaluable assistance throughout the project. Through her widespread network of contacts, she has obtained industry collaboration and sponsoring for the HPC lab, without which this project would not have been possible. In particu- lar, we would like to thank NVIDIA and AMD for donating cutting edge hardware to the lab. I would also like to thank my co-advisor on the project, Thorvald Natvig for coming up with the initial idea and for invaluable technical insight, and HPC-lab Post Doc Dr. John P. Ryan for detailed feedback on this thesis. Last, but not least, I would like to thank all my fellow students at the HPC lab: Hol- ger Ludvigsen, Ahmed Adnan Aqrawi, Øystein Eklund Krog, Andreas Dreyer Hys- ing and Runar Heggelien Refsnæs for being very helpful throughout the semester, both with technical topics and motivation. Aleksander Gjermundsen, Trondheim, Norway, July 2010 iii iv Table of Contents Abstracti Acknowledgements iii Table of Contents ix List of Figures xii List of Tables xiii 1 Introduction1 1.1 Motivations................................2 1.1.1 Offloading Computations on PCs................2 1.1.2 Offloading Computations on Embedded Devices........3 1.1.3 Massively Parallel Audio.....................3 1.2 Goals and Contributions.........................4 1.3 Outline..................................4 2 Parallel Computing7 2.1 Parallel Computing with GPUs.....................7 2.1.1 Introduction to Parallel Computing..............8 v 2.1.2 General Purpose GPU (GPGPU) Computing.........8 2.2 OpenCL.................................. 11 2.2.1 The OpenCL Standard...................... 12 2.2.2 The OpenCL Language..................... 13 2.2.3 Alternatives to OpenCL..................... 14 2.2.4 OpenCL on Embedded Devices................. 15 2.2.5 Concepts of Kernel Execution.................. 16 2.2.6 Memory Hierarchy........................ 18 2.3 Load Balancing.............................. 21 2.3.1 History of Load Balancing.................... 21 2.3.2 Traditional Load Balancing Techniques............ 22 2.3.3 Load Balancing on GPUs.................... 23 2.3.4 Auto Tunable GPU Algorithms................. 24 3 Sound Preprocessing with Speex 27 3.1 Fundamentals of Digital Audio..................... 27 3.2 Audio Processing on GPU........................ 29 3.3 Acoustic Echo Cancellation....................... 29 3.4 Speex................................... 30 3.5 Preprocessing in Speex.......................... 32 3.6 Adaptive Filters............................. 33 3.6.1 MDF............................... 34 3.6.2 AUMDF.............................. 37 3.7 Fast Fourier Transform.......................... 37 4 Optimizing Speex Echo Cancellation for OpenCL 41 vi 4.1 Integration with Speex.......................... 42 4.2 The Testing Application......................... 42 4.3 Timing and Verification......................... 46 4.4 Integrating the FFT on GPU...................... 47 4.4.1 FFT Data Preparation...................... 48 4.4.2 FFT Data Finalization...................... 49 4.5 Parallelization of MDF.......................... 50 4.5.1 Calculate Frames in Parallel................... 50 4.5.2 Calculate I/O Channels and Echo Tail in Parallel....... 51 4.5.3 Identify Independent Sections.................. 51 4.6 OpenCL Kernel Functions........................ 51 4.6.1 MDF Kernel Functions..................... 51 4.6.2 FFT Kernel Functions...................... 56 4.7 Executing on Different Platforms.................... 57 4.8 Preliminary Performance Findings and Tuning............ 58 4.9 Increasing the Computational Load................... 59 4.9.1 Using MDF Instead of AUMDF................. 60 4.9.2 Longer Echo Tails........................ 60 4.9.3 Longer Input Frames....................... 61 4.9.4 Multiple Input/Output Channels................ 62 4.9.5 Processing Several Filesections in Parallel........... 64 4.10 Implementing Co-processing....................... 65 4.10.1 OpenCL on CPU......................... 65 4.10.2 A Simple Load-balancer..................... 65 4.10.3 Task Parallelism......................... 68 vii 5 Benchmarking and Results 71 5.1 System Specifications........................... 71 5.1.1 Machine A............................ 72 5.1.2 Machine B............................ 72 5.2 Latency of GPGPU Implementations.................. 73 5.3 FFT Performance............................. 76 5.4 Test Data and Accuracy......................... 78 5.5 Performance Impact of Echo Tail Length................ 79 5.5.1 Frames of 128 Samples...................... 79 5.5.2 Frames of 512 Samples...................... 81 5.6 Performance Impact of Multiple Channels............... 83 5.6.1 Scalability of Input Channels.................. 83 5.6.2 Scalability of Output Channels................. 84 5.6.3 Scalability of Both Input and Output Channels........ 85 5.6.4 A Reasonable Multi-channel Use Case............. 86 5.7 Profiling Kernel Functions........................ 87 5.8 Performance With Load Balancing and CPU Load.......... 90 5.9 Summary of Results........................... 91 5.9.1 Performance Is Often Bound by Latency............ 91 5.9.2 Performance Scales Well With Demanding Parameters.... 91 5.9.3 OpenCL CPU Performance Is Lacking............. 92 6 Conclusions and Future Work 95 6.1 Conclusions................................ 95 6.2 Future Work............................... 97 6.2.1 Further Optimize the OpenCL Implementation........ 97 viii 6.2.2 Investigate Massively Parallel Audio Processing........ 97 6.2.3 Explore Other Algorithms for Echo Cancellation....... 98 6.2.4 Integrate the Code Better Into the Library.......... 98 6.2.5 Investigate Converting Other Parts of Speex to OpenCL... 98 6.2.6 Utilize the GPU Version in an Application.......... 98 6.2.7 Running on an Embedded Device................ 99 6.3 Concluding Remarks........................... 99 Bibliography 104 Appendices 105 A Benchmark Time Measurements 107 B Source Code 111 B.1 Test Program............................... 111 B.2 OpenCL Implementation (Device Code/Kernel Functions)...... 117 B.3 OpenCL Implementation (FFT Device Code/Kernel Functions)... 134 ix x List of Figures 2.1 Theoretical computational power on CPUs vs GPUs.......... 10 2.2 GPGPU computing platform and languages............... 11 2.3 Overview of the area distribution on CPU and GPU chips....... 17 2.4 OpenCL NDRange of work-groups.................... 18 2.5 OpenCL memory hierarchy.......................

Load more