HIGH PERFORMANCE COMPUTING OF FIBER SCATTERING SIMULATION
Leiming Yu, Yan Zhang, Xiang Gong, Nilay Roy, Lee Makowski and David Kaeli Northeastern University Boston, MA
GPGPU-8 February, 2015 San Francisco, CA Topics
• Motivation to accelerate Fiber Scattering Simulation • Proposed Solution: GPU • Simulation Algorithm • Optimization on a single GPU • Optimization on a cluster • Conclusion and Future Work • Q&A Motivation
Cellulose is a rich bio-energy source Previous studies focus on orientation instead of twisting or coiling, which could yield high volume Fiber scattering simulation takes (too much) time Acceleration approach Atom Volume Elapsed Time Parallelize algorithm 135 (small) 2 seconds 10092 (medium) 1 hour Parallelize implementation on a GPU 268992 (large) >1week [Intel i7-920] Proposed Solution: GPU
Thousands of cores compute in SIMT fashion High FLOPS Fast Memory Performance-oriented Programming Features Special Instructions
[NVIDIA Kepler Memory Hierarchy]
Dynamic Parallelism Hyper-Q Unified Memory GPU Cluster
32 compute nodes, max 24 nodes/user One NVIDIA Tesla K20m 2496 CUDA cores 5GB GDDR5 on-board memory Dual Intel Xeon CPU (ES-2650@2GHz) 32 logical cores 128 GB DRAM 10 GB/s backplane interconnect Software IBM MPI v8.1 CUDA 6.5 Platform Load Sharing Facility (LSF) scheduler Simulation Algorithm
Interval q Load Atom I
… … … … J = I + 1
Load Atom J …
Equatorial
Intensity
+ 1
… … …
I
… I =I Meridional J J = + 1 Intensity f_C f_H No J=L-1 ? f_O f_N Yes No I=L-2 ? Yes 1st Atom 2nd Atom (L-2) Atom (L-2) Atom END Optimizing on a single GPU
V0 (baseline) : global memory 7x-37x speedup over CPU Load Atom I
J = I + 1 Loading atom coefficients, type and coordinates, is expensive! O(n2) Load Atom J
Equatorial V1 = V0 + constant memory + float4 +
Intensity 1.06x speedup, + 1
I page-locked memory I =I Meridional J J = + 1 works for small case Intensity
No J=L-1 ? Limited 64KB constant memory, which hurts the Yes No medium and large case I=L-2 ? Yes Same speedup, END V1 = V1 + texture memory works for all the cases Optimizing on a single GPU
Kernel launch overheads Load Atom I
J = I + 1 V2 = V1 + loop inside of kernel 1.08x speedup
Load Atom J Compute-bound kernel Equatorial
Intensity + 1
I 1.75x for small case I =I Meridional J J = + 1 V3 = V2 + fast_math Intensity 1.54x for medium case No J=L-1 ?
Yes No Can we do better ? I=L-2 ? Yes END NVIDIA Visual Profiler 15% occupancy Optimization on a single GPU
How to assign workloads using Hyper-Q? SMX 1 SMX 8 SMX 1 SMX 8 Chunking Scheme: divide the workloads, SMX 2 SMX 9 SMX 2 SMX 9 unware of their volumes
SMX 3 SMX 10 SMX 3 SMX 10 6 6 5 4 5 SMX 4 SMX 11 SMX 4 SMX 11 4 3 2 1
SMX 5 SMX 12 SMX 5 SMX 12 3 2 SMX 6 SMX 13 SMX 6 SMX 13 1
SMX 7 K20m SMX 7 K20m Each stream (L-1)/K, K CUDA streams, L atom list length
Best case, 10.5x speedup over 1 CUDA stream
Performance plateaus after 32 CUDA streams Optimization on a single GPU
How about divide the workloads evenly? Greedy Partitioning Scheme 60% speedup on average 1: Initialize K Groups versus applying Chunking 2: Sort workloads W in non-increasing order Performance dropped 3: for i in W do 30% as we scale from 12 to 14 streams 4: check workloads of Gj, where = 0:K 5: add i to min(Gj) Worse than Chunking 6: end for after 12 streams 6 5 6 3 2 4 3 5 4 1 2 1 Optimization on a single GPU
Reduce resource requirements to Low Occupancy, High Throughput ? allow more concurrent streams Assigning twice computation to each thread,
Registers results in doubling execution time, though Leverage shared memory to load scattering shrinking the tpb from 256 to 128. coefficients. NVIDIA CUDA Compiler Option, –maxregcount Reduce registers per thread from 31 to 21 Shared Memory V. Volkov’s proposed solution Not the bottleneck, can allow max 16 blocks per SMX Works for memory bound kernels Threads Per Block Threads stop on dependency not on memory access MAX Threads per SMX / block size = 2048 / 256 = 8 Do more parallel work per thread Excessive register usage Optimization on a single GPU
Global Memory and L2 cache traffic is reduced by using read-only memory
Concurrent streams increases the global memory and cache usage
Math intrinsics increase Instructions Per Cycle and FLOPS
Less instructions/workloads for each stream leads to low IPC and FLOPS
13x speedup for small case 14x speedup for medium case Optimization on the Discovery Cluster
Baseline: MPI + OpenMP Split workloads to each MPI process Greedy scheme is applied Reduce atom factor calculation from 16 to 10 (Tc, Th, To, Tn) Pre-compute coefficients to avoid extra computation Launch OpenMP threads to explore shared memory parallelism Final results are gathered by the root
MPI+OpenMP performs 1.3x better than MPI V1: 32 MPI processes / node * 24 nodes V2: (1 MPI processes + 32 OpenMP Threads) / node * 24 nodes Optimization on the Discovery Cluster
Customized Solution: MPI + GPU
Performance flattens out after using 12 nodes for medium case Communication overheads overweight the kernel execution Speedup ratio decreases for large case Workloads Distribution: Kernel computation is reduced Greedy for MPI processes and Chunking for Hyper-Q Overheads are inevitable Hybrid computation CPU + GPU: 2.5x speedup for medium case Assign the smallest job to CPU, instead of dynamic scheduling 27.7x speedup for large case, compared to MPI+OpenMP Conclusions and Future Work
Fiber scattering simulation for discovering large-scale renewable energy resources
Apply GPU technology for better performance Read-only memory Math intrinsics Workload Distribution Programming Feature: Hyper-Q Cluster solution (MPI + GPU vs. MPI + OpenMP) Reduce simulation time from weeks to minutes Better cooperative CPU + GPU scheme Streaming simulation Pattern recognition on simulated results Continue the next step of decoding cellulose structure Q&A Thank you for your attention!
Leiming Yu [email protected]