
The Process of Parallelizing the Conjunction Prediction Algorithm of ESA’s SSA Conjunction Prediction Service using GPGPU ESAC Trainee Project September 2012 – March 2013 Marius Fehr Mentors: Vicente Navarro and Luis Martin 1 Space Situational Awareness SWE – Space Weather NEO – Near Earth Objects SST – Space Surveillance and Tracking 2 Space Surveillance and Tracking Catalog (JSpOC, US Air Force): 16’000 Objects > 10cm Estimates: 600’000 Objects > 1 cm 3 The Conjunction Prediction System 4 All vs. All Conjunction Analysis 1 2 3 4 5 6 7 [1,2] [1,3] [1,4] [1,5] ... [5,6] [5,7] [6,7] 1 2 3 4 5 6 ... 7 > Number of pairs grows quadratically with the number of objects > The analyses of all objects pairs are independent → Huge potential for parallelism > 10k objects could theoretically be analyzed in millions of parallel threads > CPUs usually launch not more than up to a dozen of threads > How can we exploit that ? 5 GPU – NVIDIAs Fermi Architecture http://benchmarkreviews.com/images/reviews/processor/NVIDIA_Fermi/nvidia-fermi-gf100-gpu-shader-model-block-diagram-full.png 6 http://benchmarkreviews.com/images/reviews/processor/NVIDIA_Fermi/nvidia-fermi-gf100-gpu-block-diagram-benchmarkreviews.png CUDA – Grid, Blocks and Threads Example: Matrix Multiplication 4 x 4 Matrices A and B 4 blocks with 4 threads each, 16 threads in total Grid Blocks Threads > Abstraction of Multiprocessors and Cores > GPU distributes blocks to idle multiprocessors > Idle/waiting threads are swapped out instantly > Up to 65k x 65k x 65k blocks > Up to 1024 threads per block 7 CUDA – Program Flow CPU GPU Kernel Kernel Grid Blocks Threads Memory GPU Memory Data Data Result Result 8 CAN – Conjunction Analysis 10k Objects List of Objects 50M pairs Prediction Period Apogee-Perigee Filter 50M → 20M Loop over Epochs Load Ephemeris Data to Ephemeris Memory if necessary Data ( Files/DB ) Smart Sieve 20M → 40k Linear Search Find Time and Distance of Closest Approach Calculate Collision Risk and write Conjunctions Conjunctions to Files/DB. ( Files/DB ) 9 Identifying Parallelizable Functions 10'000 Objects - 8 Days Apogee-Perigee Filter Loading Ephemerides OpenMP Interpolating in Smart Sieve Smart Sieve Interpolating in Linear Search Linear Search Find TCA Conjunction Definition Penetration Factor Remaining Operations 0 100 200 300 400 500 600 700 800 900 Runtime [s] 10 Smart Sieve Kernel Object pairs that passed the Apogee-Perigee Filter [1,2] [1,3] [1,4] [1,5] [1,6] [1,7] [2,3] [2,4] ... [5,6] [5,7] [6,7] ... Filter 1 ... Filter 2 ... Filter 3 ... Filter N ... Increase potential pair counter with atomicAdd() Critical Section > In general: Having a large number of threads compete for a resource is expensive > BUT: only about 0.2% of all threads actually reach the critical section 11 Linear Search Epoch ... -1 -1 -1 -1 Interpolation Search for Sign Change > Kernel 1: 1 Thread = 1 object : > Interpolating the state vectors needed for current time step > Kernel 2: 1 Thread = 1 potential pair : > Searching for a sign change in the relative velocity 12 Linear Search Epoch ... -1 -1 -1 -1 -1 Interpolation Search for Sign Change > Kernel 1: 1 Thread = 1 object : > Interpolating the state vectors needed for current time step > Kernel 2: 1 Thread = 1 potential pair : > Searching for a sign change in the relative velocity 13 Linear Search Epoch ... -1 -1 -1 -1 -1 +1 Interpolation Search for Sign Change > Kernel 1: 1 Thread = 1 object : > Interpolating the state vectors needed for current time step > Kernel 2: 1 Thread = 1 potential pair : > Searching for a sign change in the relative velocity 14 Find TCA Kernel Potential Pairs ... Sign Change ? Zero Finder ... > Kernel: 1 Thread = 1 potential pair : > Check if Linear Search found a time step with a sign change > Start zero finder (Regula Falsi) > Requires state vector interpolation for every intermediate step http://commons.wikimedia.org/wiki/File:Regula_falsi.gif 15 GPU Timeline > Goal: Minimize memory transfers and (re)allocations I Smart Sieve I LS I LS I LS Find TCA I Smart Sieve I LS I GPU Load ephemeris, pairs and constants Resize pot. pairs allocation if necessary Apogee-Perigee Load Retrieve results Filter Ephemeris of this epoch CPU Epoch 1 Epoch 2 Time 16 Test Environment > CPU: Intel Xenon E5620 4 Cores @ 2.4 GHz > Memory: 6 GB > GPU: NVIDIA Geforce GTX 580 > 1.5 GB of Memory > 512 CUDA Cores > Compute Capability 2.0 > Faster double precision calculations with the Geforce 500 Series than the newer 600 Series 17 Computation Time - Results I 1400 Fortran C CUDA 1200 OpenMP Algorithm 1000 Optimization R u n 800 t i m e [ s ] 600 GPU 400 - 88 % 200 0 313 625 1250 2500 5000 10000 Number of Objects 18 Computation Time - Results II 10'000 Objects - 8 Days CUDA Kernel Interpolating in Smart Sieve CUDA Kernel Interpolating in Linear Search CUDA Kernel Smart Sieve Algorithm Optimization CUDA Kernel Find TCA Algorithm Optimization Conjunction Definition Fortran C C CUDA CUDA Kernel Linear Search Algorithm Optimization Penetration Factor Apogee-Perigee Filter Remaining Operations CUDA Memory operations 0 50 100 150 200 250 300 350 400 450 500 Runtime [s] 19 Conclusion > Considerable improvement of the 8 days, 10’000 objects, computation time all vs. all > Parallelization with CUDA Algorithm Opt. -41% > Other optimizations CUDA -88% > Bottleneck: I/O > Reading ephemeris data from file/DB > Writing conjunctions to file/DB > Future Work > Parallelize other parts of the CPS > Computation of Conjunction Risk, Orbit Propagation, ... > Recompute ephemeris instead of loading from file/DB 20 What about your program? > Can your program be divided into thousands of parallel (and equal) computations? > Is there any communication or cooperation necessary between threads? > Only efficient between threads of the same block (< 1024) > Is the computational effort required huge compared to the size of the data? > Does your program use libraries like BLAS or FFT? > Try cuBLAS and cuFFT > Be aware: The GPU has a very flat memory hierarchy / small caches > 64 KB L1 cache, 0 – 768 KB L2 cache 21 Questions ? > Marius Fehr [email protected] > Vicente Navarro [email protected] > Luis Martin [email protected] 22 The CUDA C Extension int vector_addition ( ) { > All threads execute the same int U = 1000; int V = 256; piece of code, the kernel int N = U * V; int a[N], b[N], c[N]; > Kernel replaces loop // fill a and b > int *cuda_a, *cuda_b, *cuda_c; Index of each thread computed from block and thread id and block cudaMalloc( (void**) &cuda_a, N * sizeof(int) ); cudaMalloc( (void**) &cuda_b, N * sizeof(int) ); dimension cudaMalloc( (void**) &cuda_c, N * sizeof(int) ); > Global memory can be accessed cudaMemcpyHtoD( cuda_a, a, N * sizeof(int) ); cudaMemcpyHtoD( cuda_b, b, N * sizeof(int) ); by every thread add<<< U , V >>>( cuda_a, cuda_b, cuda_c ); > Make sure there are no race cudaMemcpyDtoH( c, cuda_c, N * sizeof(int) ); conditions cudaFree( cuda_a ); cudaFree( cuda_b ); __global__ void add( int *a, int *b, int *c ){ cudaFree( cuda_c ); int idx = blockIdx.x * blockDim.x return 0; + threadIdx.x; } c [ idx ] = a [ idx ] + b [ idx ]; } 23 CUDA GPUs 3500 7 3000 6 2500 5 2000 4 1500 3 1000 2 500 1 0 0 Quadro 2000 Quadro 4000 Quadro 6000 Geforce GTX 580 Geforce GTX 680 Tesla C2075 GPU FP32 [Gflops] FP64 [Gflops] Price [$] Memory [GB] 24 > Creates a GPU-enabled version of the Fortran code > Works either using directives (like OpenMP) or automatic analysis and optimization > 30-day trial available Warps and Threads Threads are grouped in Warps 1 Warp = 32 Threads 1 Core: 1 instruction in 1 cycle BUT all cores in the same group Group 1 execute the same instruction at the Group 2 same time Single Instruction Multiple Thread Example: Fermi Architecture 2 groups x 16 Cores: 1 Warp in 1 cycle BUT not every core has it’s own Special Function Unit Shared Memory GPU Grid Data Data Blocks ... Threads ● Copy from global to Data shared memory and back . in . ● Global Typically 16 KB per block GPU Memory ● Very low latency Data Data ● Shared memory is only visible for threads inside ... the block ● CUDA provides tools for synchronization Texture Memory GPU Dedicated Hardware Grid Global Blocks GPU Memory Threads ● Global memory accessed by dedicated hardware, used for textures ● Read cache; ~ 6 – 8 KB per multiprocessor, optimized for spatial locality in texture coordinates ● Serves as read-through cache and supports multiple simultaneous reads through hardware accelerated filtering .
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages28 Page
-
File Size-