Performance Portability Via Nim Metaprogramming (Pdf)

Performance portability via Nim metaprogramming Xiao-Yong Jin and James C. Osborn Argonne Leadership Computing Facility, Argonne National Laboratory A DOE ECP effort from Lattice QCD August 23 DOE COE Performance Portability Meeting 2017 Denver, Colorado “Complexity” seems to be a lot like “energy”: you can transfer it from the end user to one/some of the other players, but the total amount seems to remain pretty much constant for a given task. —Ran, 5 Mar 2000 Outline • Benchmark code in Nim • Application domain: Lattice Gauge Theory • Nim programming language • onGPU • Benchmark result • P100 + XEON • KNL ֞ ֞ ֞ ֞ Benchmark ֜֝֓ ֛֝֒ ƅ֑֛֜ ֛֑֜ 3: proc test(vecLen, memLen: static[int]; N: int) = 4: var 5: x = newColorMatrixArray(vecLen,memLen,N) Define complex 6: y = newColorMatrixArray(vecLen,memLen,N) 7: z = newColorMatrixArray(vecLen,memLen,N) 3×3 matrix field 32: threads: CPU threads 33: x := 0 34: y := 1 Set diagonal elements 35: z := 2 36: timeit "CPU": x += y * z Benchmark 37: 38: timeit "GPU5": 39: onGpu(N, 32): Run statement block on GPU 40: x += y * z 41: timeit "GPU6": onGpu(N, 64): x += y * z Rerun with 42: timeit "GPU7": onGpu(N, 128): x += y * z different T/B 43: 44: threads: timeit "CPU": x += y * z Back to CPU threads 4 CPU threads • Takes a block of code statements • Wraps in a function with lexically scoped thread local objects • References to variables outside the code block are managed by Nim • Runs the function under omp parallel directive • A custom iterator over the array indices takes care of actual data parallel operations x += y * z x[i] += (y * z)[i] 5 AST based overloading for data parallel ops 1: type ArrayIndex* = SomeInteger or ShortVectorIndex 2: 3: template indexArray*(x: ArrayObj, i: ArrayIndex): untyped = 4: x.p[i] 5: 6: macro indexArray*(x: ArrayObj{call}, y: ArrayIndex): untyped = 7: result = newCall(ident($x[0])) 8: for i in 1..<x.len: 9: let xi = x[i] 10: result.add( quote do: 11: indexArray(`xi`,`y`) ) 12: 13: template `[]`*(x: ArrayObj, i: ArrayIndex): untyped = 14: indexArray(x, i) • When an ArrayObj is indexed, if the object is a function call, the indexing goes inside the call x[i] += (y * z)[i] x[i] += y[i] * z[i] 6 Lattice gauge theory { ٨ {٧ Ǫ ǣ { ٨ Ǫ ǣ { ٨ Ǫ Ǐߠ { ٧ Ǫ Ǐߨ Ǫ Ǐߨ { ٧ Ǫ ǣ Ǫ Ǐߠ Ǫ ǣ \naG <InRN • Large 4D (5D) grid of small vectors/matrices with homogeneous stencil operations — large sparse linear algebra 7 Nim • Modern (since 2008) language • “Efficient Expressive Elegant” • Statically typed systems language (full access to low- level objects & code) with type inference • Generates C or C++ code & compile with any compiler • Integrated build system (no Makefile necessary): copy main program, modify, compile • https://nim-lang.org 8 Nim—both low-level and high-level • Low-level efficiency • Can manually manage memory instead of GC • Cross module inlining and constant unfolding • Whole program dead code elimination • High-level wrappers & libraries • gmp, bignum, nimblas, linalg(LAPACK), … • bindings to GTK2, the Windows API, the POSIX API, OpenGL, SDL, Cairo, Python, Lua, TCL, X11, libzip, PCRE, libcurl, mySQL, SQLite, … • exportC to create static/dynamic libraries • NimScript: shell-like scripting • Used in compiler for compile-time evaluation • Available to plug in to application and can interface with rest of application 9 Nim—metaprogramming • Templates: in-line code substitutions, also allows overloading, completely hygienic (if desired) • Generics: applies to types, procedures, templates, and macros also allows type-classes, concepts • Macros: similar to Lisp: syntax tree of arguments passed to macro at compile time (type checked or untyped) • AST based overloading: allows specialization based on the AST of the arguments 10 New framework: QEX (Quantum EXpressions) • Data parallel library for tensor objects on a lattice including shifts, reductions • Mostly in Nim, with USQCD SciDAC C libraries • High level interface in development • Available on https://github.com/jcosborn/qex • Performance portability study: cudanim • Supports arrays on both CPU and GPU • Checkout https://github.com/jcosborn/cudanim 11 onGpu 1: template onGpu*(nn,tpb: untyped, body: untyped): untyped = 2: block: Copy memory to GPU 3: var v = packVars(body, getGpuPtr) if necessary 4: type ByCopy {.bycopy.} [T] = object 5: d: T collect pointers in v 6: proc kern(xx: ByCopy[type(v)]) {.cudaGlobal.} = 7: template deref(k: int): untyped = xx.d[k] 8: substVars(body, deref) Generate kernel function 9: let ni = nn.int32 10: let threadsPerBlock = tpb.int32 11: let blocksPerGrid = (ni+threadsPerBlock-1) div threadsPerBlock 12: cudaLaunch(kern, blocksPerGrid, threadsPerBlock, v) 13: discard cudaDeviceSynchronize() 14: template onGpu*(nn: untyped, body: untyped): untyped = 15: onGpu(nn, 64, body) 16: template onGpu*(body: untyped): untyped = 17: onGpu(512*64, 64, body) 12 The generated kern 1: proc kern(xx: ByCopy[type(v)]) 2: {.codegenDecl: "__global__ $# $#$#".} = # Some definitions omitted Inlines all 13: inlineProcs: 14: template deref(k: int): untyped = procedure calls 15: xx.d[k] 16: substVars((x += y * z), deref) • xx is an object holding pointers to GPU memory • substVars with the help of deref transforms x += y * z deref(0)+=deref(1)*deref(2) xx.d[0]+=xx.d[1]*xx.d[2] 13 Memory Linear memory address unit ptr: ptr+v: Inner ptr+2v: vector GPU register Structure Coalesced in-memory data layout AoSoAoS 14 7 . $16 7 . $16 7 . $16 7 . $16 7 . $16 7 . $16 7 . $16 7 . $16 7 . (16 5# 7 . (16 5# 7 . (16 5# 7 . (16 5# 7 . (16 5# 7 . (16 5# &ʱFDUJWF CBOEXJEUI (#T 7 . (16 5# 7 . (16 5# .FNPSZ GPPUQSJOU ,# ֞ ֞ ֞ ֞ ֜֝֓ ֛֝֒ Tesla P100 + ֑֛֜ ƅ֑֛֜ 2x Xeon E5-2687WV2 CUDA 8.0.61 + GCC 4.8.5 15 7 . $16 7 . $16 7 . $16 7 . $16 7 . $16 7 . $16 &ʱFDUJWF CBOEXJEUI (#T .FNPSZ GPPUQSJOU ,# ֞ ֞ ֞ ֞ ֜֝֓ ֛֝֒ KNL (Xeon Phi 7210) ֑֛֜ ƅ֑֛֜ Flat MCDRAM GCC 7.1.0 16 Summary & Outlook • Nim metaprogramming helps hiding architecture differences under a unified data parallel API • Toy benchmark saturates GPU bandwidth • Considering API to use both CPU & GPU in a heterogeneous setting • Many possibilities of using AST • Apply AST based optimizations (inlining, loop unrolling, temporary variable elimination), across multiple statements • Specialization at different levels • Craft application specific AST transformations • Help general purpose compiler with application specifics Backup slides GPU offloading (1 of 4) 1: template cudaDefs(body: untyped): untyped {.dirty.} = 2: var gridDim{.global,importC,noDecl.}: CudaDim3 3: var blockIdx{.global,importC,noDecl.}: CudaDim3 4: var blockDim{.global,importC,noDecl.}: CudaDim3 5: var threadIdx{.global,importC,noDecl.}: CudaDim3 6: template getGridDim: untyped {.used.} = gridDim 7: template getBlockIdx: untyped {.used.} = blockIdx 8: template getBlockDim: untyped {.used.} = blockDim 9: template getThreadIdx: untyped {.used.} = threadIdx 10: template getThreadNum: untyped {.used.} = blockDim.x * blockIdx.x + threadIdx.x 11: template getNumThreads: untyped {.used.} = gridDim.x * blockDim.x 12: bind inlineProcs 13: inlineProcs: 14: body • Overloaded template definitions • Inline all Nim procedures in body to prepare for generating kernel function 19 GPU offloading (2 of 4) 34: macro cuda*(s,p: untyped): auto = 35: let ss = s.strVal 36: p.expectKind nnkProcDef 37: result = p 38: result.addPragma parseExpr("{.codegenDecl:\""&ss&" $# $#$#\".}")[0] 39: result.body = getAst(cudaDefs(result.body)) 40: var sl = newStmtList() 41: sl.add( quote do: 42: {.push checks: off.} 43: {.push stacktrace: off.} ) 44: sl.add result 45: result = sl 46: template cudaGlobal*(p: untyped): auto = cuda("__global__",p) • Convert a procedure definition, p, with the overloaded templates in cudaDefs • Turn it in to a __global__ kernel 20 GPU offloading (3 of 4) 16: template cudaLaunch*(p: proc; blocksPerGrid,threadsPerBlock: SomeInteger; 17: arg: varargs[pointer,dataAddr]) = 18: var pp: proc = p 19: var gridDim, blockDim: CudaDim3 20: gridDim.x = blocksPerGrid 21: gridDim.y = 1 22: gridDim.z = 1 23: blockDim.x = threadsPerBlock 24: blockDim.y = 1 25: blockDim.z = 1 26: var args: array[arg.len, pointer] 27: for i in 0..<arg.len: args[i] = arg[i] 28: #echo "really launching kernel" 29: let err = cudaLaunchKernel(pp, gridDim, blockDim, addr args[0]) 30: if err: 31: echo err 32: quit cast[cint](err) • Calls the cuda function, cudaLaunchKernel, with a passed in procedure, p 21 GPU offloading (4 of 4) 48: template onGpu*(nn,tpb: untyped, body: untyped): untyped = 49: block: 50: var v = packVars(body, getGpuPtr) 51: type ByCopy {.bycopy.} [T] = object 52: d: T 53: proc kern(xx: ByCopy[type(v)]) {.cudaGlobal.} = 54: template deref(k: int): untyped = xx.d[k] 55: substVars(body, deref) 56: let ni = nn.int32 57: let threadsPerBlock = tpb.int32 58: let blocksPerGrid = (ni+threadsPerBlock-1) div threadsPerBlock 59: cudaLaunch(kern, blocksPerGrid, threadsPerBlock, v) 60: discard cudaDeviceSynchronize() 61: template onGpu*(nn: untyped, body: untyped): untyped = onGpu(nn, 64, body) 62: template onGpu*(body: untyped): untyped = onGpu(512*64, 64, body) • Take a body of code chunk and put it in a kernel definition, kern • kern calls cudaGlobal to setup other definitions, and takes care of syncing CPU memory to GPU 22.

Performance Portability Via Nim Metaprogramming (Pdf)

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support