The Process of Parallelizing the Conjunction Prediction Algorithm of ESA’s SSA Conjunction Prediction Service using GPGPU

ESAC Trainee Project September 2012 – March 2013

Marius Fehr

Mentors: Vicente Navarro and Luis Martin

1 Space Situational Awareness

SWE – Space Weather

NEO – Near Earth Objects

SST – Space Surveillance and Tracking

2 Space Surveillance and Tracking

Catalog (JSpOC, US Air Force): 16’000 Objects > 10cm Estimates: 600’000 Objects > 1 cm 3 The Conjunction Prediction System

4 All vs. All Conjunction Analysis

1 2 3 4 5 6 7 [1,2] [1,3] [1,4] [1,5] ... [5,6] [5,7] [6,7] 1 2 3 4 5 6 ... 7

> Number of pairs grows quadratically with the number of objects

> The analyses of all objects pairs are independent → Huge potential for parallelism

> 10k objects could theoretically be analyzed in millions of parallel threads

> CPUs usually launch not more than up to a dozen of threads

> How can we exploit that ? 5 GPU – NVIDIAs Fermi Architecture

http://benchmarkreviews.com/images/reviews/processor/NVIDIA_Fermi/-fermi-gf100-gpu--model-block-diagram-full.png 6 http://benchmarkreviews.com/images/reviews/processor/NVIDIA_Fermi/nvidia-fermi-gf100-gpu-block-diagram-benchmarkreviews.png CUDA – Grid, Blocks and Threads

Example: Matrix Multiplication 4 x 4 Matrices A and B 4 blocks with 4 threads each, 16 threads in total Grid Blocks Threads > Abstraction of Multiprocessors and Cores

> GPU distributes blocks to idle multiprocessors

> Idle/waiting threads are swapped out instantly

> Up to 65k x 65k x 65k blocks

> Up to 1024 threads per block

7 CUDA – Program Flow

CPU GPU

Kernel Kernel

Grid Blocks Threads

Memory GPU Memory

Data Data

Result Result

8 CAN – Conjunction Analysis

10k Objects List of Objects 50M pairs Prediction Period

Apogee-Perigee Filter 50M → 20M

Loop over Epochs

Load Ephemeris Data to Ephemeris Memory if necessary Data ( Files/DB )

Smart Sieve 20M → 40k

Linear Search

Find Time and Distance of Closest Approach

Calculate Collision Risk and write Conjunctions Conjunctions to Files/DB. ( Files/DB ) 9 Identifying Parallelizable Functions

10'000 Objects - 8 Days

Apogee-Perigee Filter

Loading Ephemerides OpenMP Interpolating in Smart Sieve

Smart Sieve

Interpolating in Linear Search

Linear Search

Find TCA

Conjunction Definition

Penetration Factor

Remaining Operations

0 100 200 300 400 500 600 700 800 900 Runtime [s] 10 Smart Sieve

Kernel

Object pairs that passed the Apogee-Perigee Filter [1,2] [1,3] [1,4] [1,5] [1,6] [1,7] [2,3] [2,4] ... [5,6] [5,7] [6,7]

... Filter 1

... Filter 2

... Filter 3

... Filter N

...

Increase potential pair counter with atomicAdd() Critical Section

> In general: Having a large number of threads compete for a resource is expensive

> BUT: only about 0.2% of all threads actually reach the critical section 11 Linear Search

Epoch

...

-1 -1 -1 -1

Interpolation Search for Sign Change

> Kernel 1: 1 Thread = 1 object :

> Interpolating the state vectors needed for current time step

> Kernel 2: 1 Thread = 1 potential pair :

> Searching for a sign change in the relative velocity 12 Linear Search

Epoch

...

-1 -1 -1 -1 -1

Interpolation Search for Sign Change

> Kernel 1: 1 Thread = 1 object :

> Interpolating the state vectors needed for current time step

> Kernel 2: 1 Thread = 1 potential pair :

> Searching for a sign change in the relative velocity 13 Linear Search

Epoch

...

-1 -1 -1 -1 -1 +1

Interpolation Search for Sign Change

> Kernel 1: 1 Thread = 1 object :

> Interpolating the state vectors needed for current time step

> Kernel 2: 1 Thread = 1 potential pair :

> Searching for a sign change in the relative velocity 14 Find TCA

Kernel

Potential Pairs

... Sign Change ?

Zero Finder

...

> Kernel: 1 Thread = 1 potential pair :

> Check if Linear Search found a time step with a sign change

> Start zero finder (Regula Falsi)

> Requires state vector interpolation for every intermediate step http://commons.wikimedia.org/wiki/File:Regula_falsi.gif 15 GPU Timeline

> Goal: Minimize memory transfers and (re)allocations

I Smart Sieve I LS I LS I LS Find TCA I Smart Sieve I LS I GPU

Load ephemeris, pairs and constants

Resize pot. pairs allocation if necessary

Apogee-Perigee Load Retrieve results Filter Ephemeris of this epoch

CPU Epoch 1 Epoch 2

Time

16 Test Environment

> CPU: Intel Xenon E5620 4 Cores @ 2.4 GHz > Memory: 6 GB > GPU: NVIDIA Geforce GTX 580

> 1.5 GB of Memory

> 512 CUDA Cores

> Compute Capability 2.0

> Faster double precision calculations with the Geforce 500 Series than the newer 600 Series

17 Computation Time - Results I

1400 Fortran C CUDA 1200

OpenMP Algorithm 1000 Optimization R u

n 800 t i m e

[ s ]

600

GPU 400

- 88 % 200

0 313 625 1250 2500 5000 10000 Number of Objects 18 Computation Time - Results II

10'000 Objects - 8 Days

CUDA Kernel Interpolating in Smart Sieve

CUDA Kernel Interpolating in Linear Search

CUDA Kernel Smart Sieve

Algorithm Optimization CUDA Kernel Find TCA

Algorithm Optimization Conjunction Definition Fortran C C CUDA CUDA Kernel Linear Search

Algorithm Optimization Penetration Factor

Apogee-Perigee Filter

Remaining Operations

CUDA Memory operations

0 50 100 150 200 250 300 350 400 450 500 Runtime [s] 19 Conclusion

> Considerable improvement of the 8 days, 10’000 objects, computation time all vs. all

> Parallelization with CUDA Algorithm Opt. -41% > Other optimizations CUDA -88% > Bottleneck: I/O

> Reading ephemeris data from file/DB

> Writing conjunctions to file/DB > Future Work

> Parallelize other parts of the CPS

> Computation of Conjunction Risk, Orbit Propagation, ... > Recompute ephemeris instead of loading from file/DB

20 What about your program?

> Can your program be divided into thousands of parallel (and equal) computations? > Is there any communication or cooperation necessary between threads?

> Only efficient between threads of the same block (< 1024) > Is the computational effort required huge compared to the size of the data? > Does your program use libraries like BLAS or FFT?

> Try cuBLAS and cuFFT > Be aware: The GPU has a very flat memory hierarchy / small caches

> 64 KB L1 cache, 0 – 768 KB L2 cache

21 Questions ?

> Marius Fehr [email protected]

> Vicente Navarro [email protected]

> Luis Martin [email protected]

22 The CUDA C Extension

int vector_addition ( ) { > All threads execute the same

int U = 1000; int V = 256; piece of code, the kernel int N = U * V; int a[N], b[N], c[N]; > Kernel replaces loop // fill a and b > int *cuda_a, *cuda_b, *cuda_c; Index of each thread computed from block and thread id and block cudaMalloc( (void**) &cuda_a, N * sizeof(int) ); cudaMalloc( (void**) &cuda_b, N * sizeof(int) ); dimension cudaMalloc( (void**) &cuda_c, N * sizeof(int) ); > Global memory can be accessed cudaMemcpyHtoD( cuda_a, a, N * sizeof(int) ); cudaMemcpyHtoD( cuda_b, b, N * sizeof(int) ); by every thread

add<<< U , V >>>( cuda_a, cuda_b, cuda_c ); > Make sure there are no race

cudaMemcpyDtoH( c, cuda_c, N * sizeof(int) ); conditions

cudaFree( cuda_a ); cudaFree( cuda_b ); __global__ void add( int *a, int *b, int *c ){ cudaFree( cuda_c ); int idx = blockIdx.x * blockDim.x return 0; + threadIdx.x; } c [ idx ] = a [ idx ] + b [ idx ]; }

23 CUDA GPUs

3500 7

3000 6

2500 5

2000 4

1500 3

1000 2

500 1

0 0 2000 Quadro 4000 Quadro 6000 Geforce GTX 580 Geforce GTX 680 Tesla C2075

GPU FP32 [Gflops] FP64 [Gflops] Price [$] Memory [GB] 24 > Creates a GPU-enabled version of the Fortran code > Works either using directives (like OpenMP) or automatic analysis and optimization > 30-day trial available

Warps and Threads

Threads are grouped in Warps 1 Warp = 32 Threads

1 Core: 1 instruction in 1 cycle

BUT all cores in the same group Group 1 execute the same instruction at the Group 2 same time Single Instruction Multiple Thread

Example: Fermi Architecture 2 groups x 16 Cores: 1 Warp in 1 cycle

BUT not every core has it’s own Special Function Unit

Shared Memory

GPU

Grid Data Data Blocks ... Threads

● Copy from global to Data shared memory and back . . in . . . . ● Global Typically 16 KB per block GPU Memory ● Very low latency

Data Data ● Shared memory is only visible for threads inside ... the block

● CUDA provides tools for synchronization

Texture Memory

GPU

Dedicated Hardware Grid Global Blocks GPU Memory Threads

● Global memory accessed by dedicated hardware, used for textures

● Read cache; ~ 6 – 8 KB per multiprocessor, optimized for spatial locality in texture coordinates

● Serves as read-through cache and supports multiple simultaneous reads through hardware accelerated filtering