Locality-Aware Automatic Parallelization for GPGPU with Openhmpp Directives
Total Page:16
File Type:pdf, Size:1020Kb
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 — Amsterdam, Netherlands Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. 100,000 Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) 24,129 Intel Core Duo Extreme 2 cores, 3.0 GHz 21,871 Intel Core 2 Extreme 2 cores, 2.9 GHz 19,484 10,000 AMD Athlon 64, 2.8 GHz 14,387 AMD Athlon, 2.6 GHz 11,865 Intel Xeon EE 3.2 GHz 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 6,681 IBM Power4, 1.3 GHz 4,195 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer 8400 6/575, 575 MHz 21264 1,267 1000 993 AlphaServer 4000 5/600, 600 MHz 21164 649 Digital Alphastation 5/500, 500 MHz 481 Digital Alphastation 5/300, 300 MHz 280 22%/year Digital Alphastation 4/266, 266 MHz 183 IBM POWERstation 100, 150 MHz 117 100 Digital 3000 AXP/500, 150 MHz 80 HP 9000/750, 66 MHz 51 IBM RS6000/540, 30 MHz 24 52%/year Performance (vs. VAX-11/780) MIPS M2000, 25 MHz 18 MIPS M/120, 16.7 MHz 10 13 Sun-4/260, 16.7 MHz 9 VAX 8700, 22 MHz 5 AX-11/780, 5 MHz 25%/year 1.5, VAX-11/785 1 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 David A. Patterson and John L. Hennessy. ! Computer Organization and Design: ! The Parallel Challenge The Hardware/Software Interface. ! Elsevier, 2014. J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. ACCELERATORS/CO-PROCESSORS 80 70 60 ATI AMD 50 Intel 40 SYSTEMS 30 NVIDIA 20 10 Clearspeed CSX600 Cell 2006 2007 2008 2009 2010 2011 2012 2013 2014 General Purpose The TOP500 List. ! Computing with GPUs June 2014. J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. GPGPU with CUDA • First GPGPU programs look like graphics applications ! • CUDA enables the use of C CUDA kernel: specifies the operation of a single GPU thread • Main ideas: 1.Lightweight parallel threads in hierarchy: grid, block 2.Shared memory 3.Barriers J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. GPU Programming Features in CUDA 1 Threadification 2 Thread grouping: warps 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. GPGPU with OpenHMPP • Directive-based approaches provide several advantages: • More readable codes • Only one file • Independent from the hardware platform • Reasonable performance J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. GPGPU with OpenHMPP • Directive-based approaches provide several advantages: • More readable codes • Only one file • Independent from the hardware platform • Reasonable performance explicit control of software-managed caches J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. GPGPU with OpenHMPP • Directive-based approaches provide several advantages: • More readable codes • Only one file • Independent from the hardware platform • Reasonable performance explicit control of software-managed caches standard loop transformations J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. GPU Programming Features with OpenHMPP 1 Threadification 2 Thread grouping 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency gridify advancedLoad delegatedStore permute 7 Occupancy unroll fuse tile shared 8 Threads per block J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. DOMAIN-SPECIFIC CONCEPT LEVEL (problem solving methods diKernel: and application domain) DOMAIN-INDEPENDENT Domain- CONCEPT LEVEL (programming practice) Independent SEMANTIC LEVEL (control flow and Computational data dependence graphs) SYNTACTIC LEVEL (abstract syntax tree) Kernel TEXT LEVEL (ASCII code) •Characterizes the computations carried out in a program without being affected by how they are coded M. Arenaz et al. XARK: An Extensible Framework for Automatic Recognition of Computational Kernels. ACM Transactions on Programming Languages and •Exposes multiple levels of Systems, 30(6), 2008. parallelism J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. Building the KIR • Non-statement based, high-level, hierarchical IR 1.diKernel recognition on the DDG 2.Identification of flow dependences 3.Hierarchy of execution scopes reflecting the computational stages & diKernel classification J.M. Andión et al. A Novel Compiler Support for Automatic Parallelization on Multicore Systems. Parallel Computing, 39(9), 2013. J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. shaded to be omitted in the discovering of Example of KIR: CONV3D parallelism 1.int i, j, k, size_x, size_y, size_z; 2.float coefx,coefy,coefz,*input,*output; 3. ROOT EXECUTION SCOPE 4.for (i = 0; i < size_x; i++) { 5. for (j = 0; j < size_y; j++) { ES_fori,j,k (Fig. 1a, lines 4-32) 6. for (k = 0; k < size_z; k++) { 7. float tempx = input[i][j][k]+coefx* 8. ( K < tempx7 > 9. input[i-1][j][k]+input[i+1][j][k]+ scalar assignment 10. input[i-2][j][k]+input[i+2][j][k]+ 11. input[i-3][j][k]+input[i+3][j][k]+ 12. input[i-4][j][k]+input[i+4][j][k] 13. ); 14. float tempy = input[i][j][k]+coefy* K < tempy14 > 15. ( scalar assignment 16. input[i][j-1][k]+input[i][j+1][k]+ 17. input[i][j-2][k]+input[i][j+2][k]+ 18. input[i][j-3][k]+input[i][j+3][k]+ 19. input[i][j-4][k]+input[i][j+4][k] 20. ); K < tempz21 > 21. float tempz = input[i][j][k]+coefz* scalar assignment 22. ( 23. input[i][j][k-1]+input[i][j][k+1]+ 24. input[i][j][k-2]+input[i][j][k+2]+ 25. input[i][j][k-3]+input[i][j][k+3]+ 26. input[i][j][k-4]+input[i][j][k+4] K < output28 > 27. ); regular reduction 28. output[i][j][k] = 29. output[i][j][k]+tempx+tempy+tempz; 30. } 31. } 32.} J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. GPU Programming Features addressed by our Automatic Technique 1 Threadification 2 Thread grouping 3 Minimization of CPU-GPU data transfers 4 Coalescing 5 Maximization of the usage of registers and shared memory 6 Divergency 7 Occupancy 8 Threads per block J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. Detection of coalesced accesses to the GPU global memory CHRECS_x = [{0,+,1}][{0,+,1}] Locality-Aware Automatick Parallelization for GPGPU with OpenHMPP Directives 1 // only for_i is threadified 1 // only for_j is threadified 2 for (i = 0; i <= N; i++) { 2 for (j = 0; j <= N; j++) { 3 for (j = 0; j <= N; j++) { 3 for (i = 0; i <= N; i++) { 4 ... x[i][j] ... 4 ... x[i][j] ... 5 } 5 } 6 } 6 } (a) Source code S1. (c) Source code S2. T0 T1 T2 T0 T1 T2 (i=0)(i=1)(i=2) ( j=0)(j=1)(j=2) j=0 x[0][0] x[1][0] x[2][0] i=0 x[0][0] x[0][1] x[0][2] j=1 x[0][1] x[1][1] x[2][1] i=1 x[1][0] x[1][1] x[1][2] j=2 x[0][2] x[1][2] x[2][2] i=2 x[2][0] x[2][1] x[2][2] ..