The Process of Parallelizing the Conjunction Prediction Algorithm of ESA’S SSA Conjunction Prediction Service Using GPGPU

The Process of Parallelizing the Conjunction Prediction Algorithm of ESA’s SSA Conjunction Prediction Service using GPGPU ESAC Trainee Project September 2012 – March 2013 Marius Fehr Mentors: Vicente Navarro and Luis Martin 1 Space Situational Awareness SWE – Space Weather NEO – Near Earth Objects SST – Space Surveillance and Tracking 2 Space Surveillance and Tracking Catalog (JSpOC, US Air Force): 16’000 Objects > 10cm Estimates: 600’000 Objects > 1 cm 3 The Conjunction Prediction System 4 All vs. All Conjunction Analysis 1 2 3 4 5 6 7 [1,2] [1,3] [1,4] [1,5] ... [5,6] [5,7] [6,7] 1 2 3 4 5 6 ... 7 > Number of pairs grows quadratically with the number of objects > The analyses of all objects pairs are independent → Huge potential for parallelism > 10k objects could theoretically be analyzed in millions of parallel threads > CPUs usually launch not more than up to a dozen of threads > How can we exploit that ? 5 GPU – NVIDIAs Fermi Architecture http://benchmarkreviews.com/images/reviews/processor/NVIDIA_Fermi/nvidia-fermi-gf100-gpu-shader-model-block-diagram-full.png 6 http://benchmarkreviews.com/images/reviews/processor/NVIDIA_Fermi/nvidia-fermi-gf100-gpu-block-diagram-benchmarkreviews.png CUDA – Grid, Blocks and Threads Example: Matrix Multiplication 4 x 4 Matrices A and B 4 blocks with 4 threads each, 16 threads in total Grid Blocks Threads > Abstraction of Multiprocessors and Cores > GPU distributes blocks to idle multiprocessors > Idle/waiting threads are swapped out instantly > Up to 65k x 65k x 65k blocks > Up to 1024 threads per block 7 CUDA – Program Flow CPU GPU Kernel Kernel Grid Blocks Threads Memory GPU Memory Data Data Result Result 8 CAN – Conjunction Analysis 10k Objects List of Objects 50M pairs Prediction Period Apogee-Perigee Filter 50M → 20M Loop over Epochs Load Ephemeris Data to Ephemeris Memory if necessary Data ( Files/DB ) Smart Sieve 20M → 40k Linear Search Find Time and Distance of Closest Approach Calculate Collision Risk and write Conjunctions Conjunctions to Files/DB. ( Files/DB ) 9 Identifying Parallelizable Functions 10'000 Objects - 8 Days Apogee-Perigee Filter Loading Ephemerides OpenMP Interpolating in Smart Sieve Smart Sieve Interpolating in Linear Search Linear Search Find TCA Conjunction Definition Penetration Factor Remaining Operations 0 100 200 300 400 500 600 700 800 900 Runtime [s] 10 Smart Sieve Kernel Object pairs that passed the Apogee-Perigee Filter [1,2] [1,3] [1,4] [1,5] [1,6] [1,7] [2,3] [2,4] ... [5,6] [5,7] [6,7] ... Filter 1 ... Filter 2 ... Filter 3 ... Filter N ... Increase potential pair counter with atomicAdd() Critical Section > In general: Having a large number of threads compete for a resource is expensive > BUT: only about 0.2% of all threads actually reach the critical section 11 Linear Search Epoch ... -1 -1 -1 -1 Interpolation Search for Sign Change > Kernel 1: 1 Thread = 1 object : > Interpolating the state vectors needed for current time step > Kernel 2: 1 Thread = 1 potential pair : > Searching for a sign change in the relative velocity 12 Linear Search Epoch ... -1 -1 -1 -1 -1 Interpolation Search for Sign Change > Kernel 1: 1 Thread = 1 object : > Interpolating the state vectors needed for current time step > Kernel 2: 1 Thread = 1 potential pair : > Searching for a sign change in the relative velocity 13 Linear Search Epoch ... -1 -1 -1 -1 -1 +1 Interpolation Search for Sign Change > Kernel 1: 1 Thread = 1 object : > Interpolating the state vectors needed for current time step > Kernel 2: 1 Thread = 1 potential pair : > Searching for a sign change in the relative velocity 14 Find TCA Kernel Potential Pairs ... Sign Change ? Zero Finder ... > Kernel: 1 Thread = 1 potential pair : > Check if Linear Search found a time step with a sign change > Start zero finder (Regula Falsi) > Requires state vector interpolation for every intermediate step http://commons.wikimedia.org/wiki/File:Regula_falsi.gif 15 GPU Timeline > Goal: Minimize memory transfers and (re)allocations I Smart Sieve I LS I LS I LS Find TCA I Smart Sieve I LS I GPU Load ephemeris, pairs and constants Resize pot. pairs allocation if necessary Apogee-Perigee Load Retrieve results Filter Ephemeris of this epoch CPU Epoch 1 Epoch 2 Time 16 Test Environment > CPU: Intel Xenon E5620 4 Cores @ 2.4 GHz > Memory: 6 GB > GPU: NVIDIA Geforce GTX 580 > 1.5 GB of Memory > 512 CUDA Cores > Compute Capability 2.0 > Faster double precision calculations with the Geforce 500 Series than the newer 600 Series 17 Computation Time - Results I 1400 Fortran C CUDA 1200 OpenMP Algorithm 1000 Optimization R u n 800 t i m e [ s ] 600 GPU 400 - 88 % 200 0 313 625 1250 2500 5000 10000 Number of Objects 18 Computation Time - Results II 10'000 Objects - 8 Days CUDA Kernel Interpolating in Smart Sieve CUDA Kernel Interpolating in Linear Search CUDA Kernel Smart Sieve Algorithm Optimization CUDA Kernel Find TCA Algorithm Optimization Conjunction Definition Fortran C C CUDA CUDA Kernel Linear Search Algorithm Optimization Penetration Factor Apogee-Perigee Filter Remaining Operations CUDA Memory operations 0 50 100 150 200 250 300 350 400 450 500 Runtime [s] 19 Conclusion > Considerable improvement of the 8 days, 10’000 objects, computation time all vs. all > Parallelization with CUDA Algorithm Opt. -41% > Other optimizations CUDA -88% > Bottleneck: I/O > Reading ephemeris data from file/DB > Writing conjunctions to file/DB > Future Work > Parallelize other parts of the CPS > Computation of Conjunction Risk, Orbit Propagation, ... > Recompute ephemeris instead of loading from file/DB 20 What about your program? > Can your program be divided into thousands of parallel (and equal) computations? > Is there any communication or cooperation necessary between threads? > Only efficient between threads of the same block (< 1024) > Is the computational effort required huge compared to the size of the data? > Does your program use libraries like BLAS or FFT? > Try cuBLAS and cuFFT > Be aware: The GPU has a very flat memory hierarchy / small caches > 64 KB L1 cache, 0 – 768 KB L2 cache 21 Questions ? > Marius Fehr [email protected] > Vicente Navarro [email protected] > Luis Martin [email protected] 22 The CUDA C Extension int vector_addition ( ) { > All threads execute the same int U = 1000; int V = 256; piece of code, the kernel int N = U * V; int a[N], b[N], c[N]; > Kernel replaces loop // fill a and b > int *cuda_a, *cuda_b, *cuda_c; Index of each thread computed from block and thread id and block cudaMalloc( (void**) &cuda_a, N * sizeof(int) ); cudaMalloc( (void**) &cuda_b, N * sizeof(int) ); dimension cudaMalloc( (void**) &cuda_c, N * sizeof(int) ); > Global memory can be accessed cudaMemcpyHtoD( cuda_a, a, N * sizeof(int) ); cudaMemcpyHtoD( cuda_b, b, N * sizeof(int) ); by every thread add<<< U , V >>>( cuda_a, cuda_b, cuda_c ); > Make sure there are no race cudaMemcpyDtoH( c, cuda_c, N * sizeof(int) ); conditions cudaFree( cuda_a ); cudaFree( cuda_b ); __global__ void add( int *a, int *b, int *c ){ cudaFree( cuda_c ); int idx = blockIdx.x * blockDim.x return 0; + threadIdx.x; } c [ idx ] = a [ idx ] + b [ idx ]; } 23 CUDA GPUs 3500 7 3000 6 2500 5 2000 4 1500 3 1000 2 500 1 0 0 Quadro 2000 Quadro 4000 Quadro 6000 Geforce GTX 580 Geforce GTX 680 Tesla C2075 GPU FP32 [Gflops] FP64 [Gflops] Price [$] Memory [GB] 24 > Creates a GPU-enabled version of the Fortran code > Works either using directives (like OpenMP) or automatic analysis and optimization > 30-day trial available Warps and Threads Threads are grouped in Warps 1 Warp = 32 Threads 1 Core: 1 instruction in 1 cycle BUT all cores in the same group Group 1 execute the same instruction at the Group 2 same time Single Instruction Multiple Thread Example: Fermi Architecture 2 groups x 16 Cores: 1 Warp in 1 cycle BUT not every core has it’s own Special Function Unit Shared Memory GPU Grid Data Data Blocks ... Threads ● Copy from global to Data shared memory and back . in . ● Global Typically 16 KB per block GPU Memory ● Very low latency Data Data ● Shared memory is only visible for threads inside ... the block ● CUDA provides tools for synchronization Texture Memory GPU Dedicated Hardware Grid Global Blocks GPU Memory Threads ● Global memory accessed by dedicated hardware, used for textures ● Read cache; ~ 6 – 8 KB per multiprocessor, optimized for spatial locality in texture coordinates ● Serves as read-through cache and supports multiple simultaneous reads through hardware accelerated filtering .

Load more