<<

HIGH PERFORMANCE COMPUTING OF FIBER SCATTERING SIMULATION

Leiming Yu, Yan Zhang, Xiang Gong, Nilay Roy, Lee Makowski and David Kaeli Northeastern University Boston, MA

GPGPU-8 February, 2015 San Francisco, CA Topics

• Motivation to accelerate Fiber Scattering Simulation • Proposed Solution: GPU • Simulation Algorithm • Optimization on a single GPU • Optimization on a cluster • Conclusion and Future Work • Q&A Motivation

 Cellulose is a rich bio-energy source  Previous studies focus on orientation instead of twisting or coiling, which could high volume  Fiber scattering simulation takes (too much) time  Acceleration approach Atom Volume Elapsed Time  Parallelize algorithm 135 (small) 2 seconds 10092 (medium) 1 hour  Parallelize implementation on a GPU 268992 (large) >1week [Intel i7-920] Proposed Solution: GPU

 Thousands of cores compute in SIMT fashion  High FLOPS  Fast Memory  Performance-oriented Programming Features  Special Instructions

[NVIDIA Kepler Memory Hierarchy]

Dynamic Parallelism Hyper-Q Unified Memory GPU Cluster

 32 compute nodes, max 24 nodes/user  One NVIDIA Tesla K20m  2496 CUDA cores  5GB GDDR5 on-board memory  Dual Intel Xeon CPU (ES-2650@2GHz)  32 logical cores  128 GB DRAM  10 GB/s backplane interconnect  Software  IBM MPI v8.1  CUDA 6.5  Platform Load Sharing Facility (LSF) scheduler Simulation Algorithm

Interval q Load Atom I

… … … … J = I + 1

Load Atom J …

Equatorial

Intensity

+ 1

… … …

I

… I =I Meridional J J = + 1 Intensity f_C f_H No J=L-1 ? f_O f_N Yes No I=L-2 ? Yes 1st Atom 2nd Atom (L-2) Atom (L-2) Atom END Optimizing on a single GPU

V0 (baseline) : global memory 7x-37x over CPU Load Atom I

J = I + 1  Loading atom coefficients, type and coordinates, is expensive! O(n2) Load Atom J

Equatorial V1 = V0 + constant memory + float4 +

Intensity 1.06x speedup, + 1

I page-locked memory I =I Meridional J J = + 1 works for small case Intensity

No J=L-1 ?  Limited 64KB constant memory, which hurts the Yes No medium and large case I=L-2 ? Yes Same speedup, END V1 = V1 + texture memory works for all the cases Optimizing on a single GPU

 Kernel launch overheads Load Atom I

J = I + 1 V2 = V1 + loop inside of kernel 1.08x speedup

Load Atom J  Compute-bound kernel Equatorial

Intensity + 1

I 1.75x for small case I =I Meridional J J = + 1 V3 = V2 + fast_math Intensity 1.54x for medium case No J=L-1 ?

Yes No  Can we do better ? I=L-2 ? Yes END NVIDIA Visual Profiler 15% occupancy Optimization on a single GPU

 How to assign workloads using Hyper-Q? SMX 1 SMX 8 SMX 1 SMX 8 Chunking Scheme: divide the workloads, SMX 2 SMX 9 SMX 2 SMX 9 unware of their volumes

SMX 3 SMX 10 SMX 3 SMX 10 6 6 5 4 5 SMX 4 SMX 11 SMX 4 SMX 11 4 3 2 1

SMX 5 SMX 12 SMX 5 SMX 12 3 2 SMX 6 SMX 13 SMX 6 SMX 13 1

SMX 7 K20m SMX 7 K20m  Each stream (L-1)/K, K CUDA streams, L atom list length

 Best case, 10.5x speedup over 1 CUDA stream

 Performance plateaus after 32 CUDA streams Optimization on a single GPU

 How about divide the workloads evenly? Greedy Partitioning Scheme  60% speedup on average 1: Initialize K Groups versus applying Chunking 2: Sort workloads W in non-increasing order  Performance dropped 3: for i in W do 30% as we scale from 12 to 14 streams 4: check workloads of Gj, where = 0:K 5: add i to min(Gj)  Worse than Chunking 6: end for after 12 streams 6 5 6 3 2 4 3 5 4 1 2 1 Optimization on a single GPU

 Reduce resource requirements to  Low Occupancy, High Throughput ? allow more concurrent streams Assigning twice computation to each ,

 Registers results in doubling execution time, though  Leverage to load scattering shrinking the tpb from 256 to 128. coefficients.  NVIDIA CUDA Compiler Option, –maxregcount  Reduce registers per thread from 31 to 21  Shared Memory V. Volkov’s proposed solution  Not the bottleneck, can allow max 16 blocks per SMX  Works for memory bound kernels  Threads Per Block  Threads stop on dependency not on memory access  MAX Threads per SMX / block size = 2048 / 256 = 8  Do more parallel work per thread  Excessive register usage Optimization on a single GPU

 Global Memory and L2 cache traffic is reduced by using read-only memory

 Concurrent streams increases the global memory and cache usage

 Math intrinsics increase Instructions Per Cycle and FLOPS

 Less instructions/workloads for each stream leads to low IPC and FLOPS

 13x speedup for small case  14x speedup for medium case Optimization on the Discovery Cluster

Baseline: MPI + OpenMP  Split workloads to each MPI  Greedy scheme is applied  Reduce atom factor calculation from 16 to 10 (Tc, Th, To, Tn)  Pre-compute coefficients to avoid extra computation  Launch OpenMP threads to explore shared memory parallelism  Final results are gathered by the root

MPI+OpenMP performs 1.3x better than MPI V1: 32 MPI processes / node * 24 nodes V2: (1 MPI processes + 32 OpenMP Threads) / node * 24 nodes Optimization on the Discovery Cluster

Customized Solution: MPI + GPU

 Performance flattens out after using 12 nodes for medium case  Communication overheads overweight the kernel execution  Speedup ratio decreases for large case Workloads Distribution:  Kernel computation is reduced Greedy for MPI processes and Chunking for Hyper-Q  Overheads are inevitable Hybrid computation CPU + GPU:  2.5x speedup for medium case Assign the smallest job to CPU, instead of dynamic 27.7x speedup for large case, compared to MPI+OpenMP Conclusions and Future Work

 Fiber scattering simulation for discovering large-scale renewable energy resources

 Apply GPU technology for better performance  Read-only memory  Math intrinsics  Workload Distribution  Programming Feature: Hyper-Q  Cluster solution (MPI + GPU vs. MPI + OpenMP)  Reduce simulation time from weeks to minutes  Better cooperative CPU + GPU scheme  Streaming simulation  Pattern recognition on simulated results  Continue the next step of decoding cellulose structure Q&A Thank you for your attention!

Leiming Yu [email protected]