The Process of Parallelizing the Conjunction Prediction Algorithm of ESA’s SSA Conjunction Prediction Service using GPGPU
ESAC Trainee Project September 2012 – March 2013
Marius Fehr
Mentors: Vicente Navarro and Luis Martin
1 Space Situational Awareness
SWE – Space Weather
NEO – Near Earth Objects
SST – Space Surveillance and Tracking
2 Space Surveillance and Tracking
Catalog (JSpOC, US Air Force): 16’000 Objects > 10cm Estimates: 600’000 Objects > 1 cm 3 The Conjunction Prediction System
4 All vs. All Conjunction Analysis
1 2 3 4 5 6 7 [1,2] [1,3] [1,4] [1,5] ... [5,6] [5,7] [6,7] 1 2 3 4 5 6 ... 7
> Number of pairs grows quadratically with the number of objects
> The analyses of all objects pairs are independent → Huge potential for parallelism
> 10k objects could theoretically be analyzed in millions of parallel threads
> CPUs usually launch not more than up to a dozen of threads
> How can we exploit that ? 5 GPU – NVIDIAs Fermi Architecture
http://benchmarkreviews.com/images/reviews/processor/NVIDIA_Fermi/nvidia-fermi-gf100-gpu-shader-model-block-diagram-full.png 6 http://benchmarkreviews.com/images/reviews/processor/NVIDIA_Fermi/nvidia-fermi-gf100-gpu-block-diagram-benchmarkreviews.png CUDA – Grid, Blocks and Threads
Example: Matrix Multiplication 4 x 4 Matrices A and B 4 blocks with 4 threads each, 16 threads in total Grid Blocks Threads > Abstraction of Multiprocessors and Cores
> GPU distributes blocks to idle multiprocessors
> Idle/waiting threads are swapped out instantly
> Up to 65k x 65k x 65k blocks
> Up to 1024 threads per block
7 CUDA – Program Flow
CPU GPU
Kernel Kernel
Grid Blocks Threads
Memory GPU Memory
Data Data
Result Result
8 CAN – Conjunction Analysis
10k Objects List of Objects 50M pairs Prediction Period
Apogee-Perigee Filter 50M → 20M
Loop over Epochs
Load Ephemeris Data to Ephemeris Memory if necessary Data ( Files/DB )
Smart Sieve 20M → 40k
Linear Search
Find Time and Distance of Closest Approach
Calculate Collision Risk and write Conjunctions Conjunctions to Files/DB. ( Files/DB ) 9 Identifying Parallelizable Functions
10'000 Objects - 8 Days
Apogee-Perigee Filter
Loading Ephemerides OpenMP Interpolating in Smart Sieve
Smart Sieve
Interpolating in Linear Search
Linear Search
Find TCA
Conjunction Definition
Penetration Factor
Remaining Operations
0 100 200 300 400 500 600 700 800 900 Runtime [s] 10 Smart Sieve
Kernel
Object pairs that passed the Apogee-Perigee Filter [1,2] [1,3] [1,4] [1,5] [1,6] [1,7] [2,3] [2,4] ... [5,6] [5,7] [6,7]
... Filter 1
... Filter 2
... Filter 3
... Filter N
...
Increase potential pair counter with atomicAdd() Critical Section
> In general: Having a large number of threads compete for a resource is expensive
> BUT: only about 0.2% of all threads actually reach the critical section 11 Linear Search
Epoch
...
-1 -1 -1 -1
Interpolation Search for Sign Change
> Kernel 1: 1 Thread = 1 object :
> Interpolating the state vectors needed for current time step
> Kernel 2: 1 Thread = 1 potential pair :
> Searching for a sign change in the relative velocity 12 Linear Search
Epoch
...
-1 -1 -1 -1 -1
Interpolation Search for Sign Change
> Kernel 1: 1 Thread = 1 object :
> Interpolating the state vectors needed for current time step
> Kernel 2: 1 Thread = 1 potential pair :
> Searching for a sign change in the relative velocity 13 Linear Search
Epoch
...
-1 -1 -1 -1 -1 +1
Interpolation Search for Sign Change
> Kernel 1: 1 Thread = 1 object :
> Interpolating the state vectors needed for current time step
> Kernel 2: 1 Thread = 1 potential pair :
> Searching for a sign change in the relative velocity 14 Find TCA
Kernel
Potential Pairs
... Sign Change ?
Zero Finder
...
> Kernel: 1 Thread = 1 potential pair :
> Check if Linear Search found a time step with a sign change
> Start zero finder (Regula Falsi)
> Requires state vector interpolation for every intermediate step http://commons.wikimedia.org/wiki/File:Regula_falsi.gif 15 GPU Timeline
> Goal: Minimize memory transfers and (re)allocations
I Smart Sieve I LS I LS I LS Find TCA I Smart Sieve I LS I GPU
Load ephemeris, pairs and constants
Resize pot. pairs allocation if necessary
Apogee-Perigee Load Retrieve results Filter Ephemeris of this epoch
CPU Epoch 1 Epoch 2
Time
16 Test Environment
> CPU: Intel Xenon E5620 4 Cores @ 2.4 GHz > Memory: 6 GB > GPU: NVIDIA Geforce GTX 580
> 1.5 GB of Memory
> 512 CUDA Cores
> Compute Capability 2.0
> Faster double precision calculations with the Geforce 500 Series than the newer 600 Series
17 Computation Time - Results I
1400 Fortran C CUDA 1200
OpenMP Algorithm 1000 Optimization R u
n 800 t i m e
[ s ]
600
GPU 400
- 88 % 200
0 313 625 1250 2500 5000 10000 Number of Objects 18 Computation Time - Results II
10'000 Objects - 8 Days
CUDA Kernel Interpolating in Smart Sieve
CUDA Kernel Interpolating in Linear Search
CUDA Kernel Smart Sieve
Algorithm Optimization CUDA Kernel Find TCA
Algorithm Optimization Conjunction Definition Fortran C C CUDA CUDA Kernel Linear Search
Algorithm Optimization Penetration Factor
Apogee-Perigee Filter
Remaining Operations
CUDA Memory operations
0 50 100 150 200 250 300 350 400 450 500 Runtime [s] 19 Conclusion
> Considerable improvement of the 8 days, 10’000 objects, computation time all vs. all
> Parallelization with CUDA Algorithm Opt. -41% > Other optimizations CUDA -88% > Bottleneck: I/O
> Reading ephemeris data from file/DB
> Writing conjunctions to file/DB > Future Work
> Parallelize other parts of the CPS
> Computation of Conjunction Risk, Orbit Propagation, ... > Recompute ephemeris instead of loading from file/DB
20 What about your program?
> Can your program be divided into thousands of parallel (and equal) computations? > Is there any communication or cooperation necessary between threads?
> Only efficient between threads of the same block (< 1024) > Is the computational effort required huge compared to the size of the data? > Does your program use libraries like BLAS or FFT?
> Try cuBLAS and cuFFT > Be aware: The GPU has a very flat memory hierarchy / small caches
> 64 KB L1 cache, 0 – 768 KB L2 cache
21 Questions ?
> Marius Fehr [email protected]
> Vicente Navarro [email protected]
> Luis Martin [email protected]
22 The CUDA C Extension
int vector_addition ( ) { > All threads execute the same
int U = 1000; int V = 256; piece of code, the kernel int N = U * V; int a[N], b[N], c[N]; > Kernel replaces loop // fill a and b > int *cuda_a, *cuda_b, *cuda_c; Index of each thread computed from block and thread id and block cudaMalloc( (void**) &cuda_a, N * sizeof(int) ); cudaMalloc( (void**) &cuda_b, N * sizeof(int) ); dimension cudaMalloc( (void**) &cuda_c, N * sizeof(int) ); > Global memory can be accessed cudaMemcpyHtoD( cuda_a, a, N * sizeof(int) ); cudaMemcpyHtoD( cuda_b, b, N * sizeof(int) ); by every thread
add<<< U , V >>>( cuda_a, cuda_b, cuda_c ); > Make sure there are no race
cudaMemcpyDtoH( c, cuda_c, N * sizeof(int) ); conditions
cudaFree( cuda_a ); cudaFree( cuda_b ); __global__ void add( int *a, int *b, int *c ){ cudaFree( cuda_c ); int idx = blockIdx.x * blockDim.x return 0; + threadIdx.x; } c [ idx ] = a [ idx ] + b [ idx ]; }
23 CUDA GPUs
3500 7
3000 6
2500 5
2000 4
1500 3
1000 2
500 1
0 0 Quadro 2000 Quadro 4000 Quadro 6000 Geforce GTX 580 Geforce GTX 680 Tesla C2075
GPU FP32 [Gflops] FP64 [Gflops] Price [$] Memory [GB] 24 > Creates a GPU-enabled version of the Fortran code > Works either using directives (like OpenMP) or automatic analysis and optimization > 30-day trial available
Warps and Threads
Threads are grouped in Warps 1 Warp = 32 Threads
1 Core: 1 instruction in 1 cycle
BUT all cores in the same group Group 1 execute the same instruction at the Group 2 same time Single Instruction Multiple Thread
Example: Fermi Architecture 2 groups x 16 Cores: 1 Warp in 1 cycle
BUT not every core has it’s own Special Function Unit
Shared Memory
GPU
Grid Data Data Blocks ... Threads
● Copy from global to Data shared memory and back . . in . . . . ● Global Typically 16 KB per block GPU Memory ● Very low latency
Data Data ● Shared memory is only visible for threads inside ... the block
● CUDA provides tools for synchronization
Texture Memory
GPU
Dedicated Hardware Grid Global Blocks GPU Memory Threads
● Global memory accessed by dedicated hardware, used for textures
● Read cache; ~ 6 – 8 KB per multiprocessor, optimized for spatial locality in texture coordinates
● Serves as read-through cache and supports multiple simultaneous reads through hardware accelerated filtering