GPU) Computation

Working With Incremental Spatial Data During Parallel (GPU) Computation By: Robert Chisholm Supervised By: Dr Paul Richmond & Dr Steve Maddock A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy The University of Sheffield Faculty of Engineering Department of Computer Science 24th January 2020 Abstract Central to many complex systems, spatial actors require an awareness of their local environment to enable behaviours such as communication and navigation. Complex system simulations represent this behaviour with Fixed Radius Near Neighbours (FRNN) search. This algorithm allows actors to store data at spatial locations and then query the data structure to find all data stored within a fixed radius of the search origin. The work within this thesis answers the question: What techniques can be used for im- proving the performance of FRNN searches during complex system simulations on Graphics Processing Units (GPUs)? It is generally agreed that Uniform Spatial Partitioning (USP) is the most suitable data structure for providing FRNN search on GPUs. However, due to the architectural complexities of GPUs, the performance is constrained such that FRNN search remains one of the most expensive common stages between complex systems models. Existing innovations to USP highlight a need to take advantage of recent GPU advances, reducing the levels of divergence and limiting redundant memory accesses as viable routes to improve the performance of FRNN search. This thesis addresses these with three separate optimisations that can be used simultaneously. Experiments have assessed the impact of optimisations to the general case of FRNN search found within complex system simulations and demonstrated their impact in practice when applied to full complex system models. Results presented show the performance of the construction and query stages of FRNN search can be improved by over 2x and 1.3x respectively. These improvements allow complex system simulations to be executed faster, enabling increases in scale and model complexity. I I, the author, confirm that the Thesis is my own work. I am aware of the University’s Guidance on the Use of Unfair Means (www.sheffield.ac.uk/ssid/unfair-means). This work has not been previously been presented for an award at this, or any other, university. Name: Robert Chisholm Signature: Date: 24th January 2020 I Acknowledgements I would like to acknowledge & extend my gratitude to the following persons who have made the completion of this research possible: Dr Paul Richmond for sharing his GPU passion and knowledge. Dr Steve Maddock for his regular discretionary guidance correcting my grammar and ‘logic train’. Dr Daniela Romano for the earlier supervision she provided. Peter & Mozhgan for general support and rubber duck debugging around the office. My family for the unconditional support. The many friends who got me into fitness, and kept me sane: Aisha, Andres, Andy, Bader, Basti, Chris, ‘Wolf’, James, Jeremy, Kieran, Roman, Sunny & countless others (also Ben). GBC for the spirited conversation. Becca and the many others, not listed, who have also contributed to my general well being. II Contents Preface Abstract ......................................... I Declaration ....................................... I Acknowledgements ................................... II Contents ......................................... III List of Figures ..................................... VII List of Tables ...................................... X List of Acronyms .................................... XI 1 Introduction 1 1.1 Background .................................... 1 1.2 Aims & Research Question ........................... 3 1.3 Contributions ................................... 4 1.3.1 Publications ................................ 4 1.4 Summary of Chapters .............................. 4 2 Background Literature 6 2.1 Complex Simulations ............................... 7 2.1.1 Pedestrian Modelling ........................... 7 2.1.2 Traffic ................................... 9 2.1.3 Smoothed-Particle Hydrodynamics ................... 10 2.1.4 General Approaches ........................... 11 2.1.5 Summary ................................. 11 2.2 Spatial Data Structures ............................. 12 2.2.1 Hashing Schemes ............................. 12 Summary ................................. 15 2.2.2 Trees .................................... 15 Summary ................................. 18 2.2.3 Kinetic Data Structures ......................... 18 2.2.4 Summary ................................. 19 III 2.3 High Performance Computation ......................... 20 2.3.1 General Purpose computation on Graphics Processing Units ..... 20 GPU Architecture ............................ 21 Additional Fixed Function Features of Graphics Processing Units (GPUs) ............................. 24 Optimisation ............................... 25 Frameworks for General Purpose computation on Graphics Processing Units (GPGPU) Development ................ 27 Case Study of GPU Optimisation .................... 28 2.3.2 Cluster, Grid & Distributed Computing ................ 28 2.3.3 Many Integrated Cores .......................... 29 2.3.4 Architectural Specialisation Away From GPUs ............ 29 2.3.5 Summary ................................. 30 2.4 Parallel Spatial Data Structures ......................... 30 2.4.1 Hashing .................................. 31 2.4.2 Parallel Trees ............................... 33 2.4.3 Parallel Verlet Lists ........................... 38 2.4.4 Neighbourhood Grid ........................... 38 2.4.5 Primitives ................................. 39 2.4.6 Summary ................................. 40 2.5 The Uniform Spatial Partitioning Data Structure ............... 40 2.5.1 Memory Layout .............................. 41 2.5.2 Construction ............................... 42 2.5.3 Query ................................... 44 2.5.4 Uniform Spatial Partitioning in Complex System Simulations .... 44 2.5.5 State of the Art .............................. 47 2.5.6 Summary ................................. 49 2.6 Conclusion .................................... 50 3 Benchmark Models 52 3.1 Continuous: Circles Model ............................ 53 3.1.1 Model Specification ........................... 53 Initialisation ............................... 54 Single Iteration .............................. 54 Validation ................................. 56 3.1.2 Effective Usage .............................. 57 3.2 Abstract: Network Model ............................ 58 3.2.1 Model Specification ........................... 59 Initialisation ............................... 59 Single Iteration .............................. 60 Validation ................................. 61 IV 3.2.2 Effective Usage .............................. 61 3.3 FLAMEGPU Models ............................... 62 3.3.1 Pedestrian ................................. 62 3.3.2 Boids ................................... 62 3.3.3 Keratinocyte ............................... 64 3.4 Summary ..................................... 64 4 Optimising Construction 65 4.1 Justification .................................... 65 4.2 GPU Counting Sort with Atomics ....................... 66 4.3 Atomic Counting Sort Construction ...................... 68 4.4 Experiments .................................... 70 4.4.1 Quantitative Experiments ........................ 70 Bin Quantity ............................... 71 Actor Density ............................... 71 4.4.2 Applied Experiments ........................... 71 Network Model .............................. 71 Pedestrian ................................. 72 4.5 Results ....................................... 72 4.5.1 Quantitative Experiments ........................ 72 Bin Quantity ............................... 72 Actor Density ............................... 76 4.5.2 Applied Experiments ........................... 78 Network Model .............................. 78 Pedestrian ................................. 82 4.6 Conclusion .................................... 82 5 Optimising Queries: Reducing Divergence 84 5.1 Justification .................................... 85 5.2 Strips Optimisation ................................ 86 5.3 Experiments .................................... 88 5.4 Results ....................................... 90 5.4.1 Experiment 1: Strips, Scaling Population Size ............. 90 5.4.2 Experiment 2: Strips, Scaling Density ................. 90 5.4.3 Experiment 3: Strips, Warp Divergence ................ 93 5.5 Conclusions .................................... 95 6 Optimising Queries: Surplus Memory Accesses 97 6.1 Proportional Bin Width Optimisation ..................... 98 6.1.1 Combined Technique ........................... 102 6.2 Implementation .................................. 103 V 6.2.1 Construction Stage ............................ 103 6.2.2 Fixed Radius Near Neighbours (FRNN) Query Stage ......... 104 Algorithm ................................. 104 6.3 Experiments .................................... 105 6.4 Results ....................................... 106 6.4.1 Experiment 1: Proportional Bin Width ................ 106 6.4.2 Experiment 2: Construction ....................... 108 6.4.3 Experiment 3: Population Scaling ................... 111 6.4.4 Experiment 4: Model Testing ...................... 111 6.5 Conclusions .................................... 113 7 Case Studies & Discussion 115 7.1 Overview ....................................

Load more