Julia A Fresh Approach to GPU Computing What is Julia?

function mandel(z) Technical computing language c = z maxiter = 80 for n = 1:maxiter if abs(z) > 2 High-level like Python return n-1 end z = z^2 + c end Performance of C return maxiter end Julia for GPU programming What is GPU programming?

High-level TensorFlow, Keras Flux.jl, Knet.jl

ArrayFire, Thrust GPUArrays.jl

cuBLAS, cuDNN CuArrays.jl

CUB, MGPU CUDAnative.jl CUDA C Low-level Programming with libraries

cuBLAS.jl a = CuArray(Float32,2) b = curand(Float32, 2) cuFFT.jl

cuRAND.jl CuArrays.jl a*a cuSPARSE.jl fft(a) cuDNN.jl qrfact(a) cuSOLVER.jl softmax(a) Programming with kernels

Much harder! Designed for performance

Multiple dispatch function foo(x) if isa(x, Int64) … foo(x::Int64) = … elseif isa(x, Float64) … foo(x::Float64) = … end end Designed for performance

Type inference function sigmoid(x) temp = exp(-x) return (1 / (1+temp)) end Designed for performance

Type inference function sigmoid(x::Int) temp = exp(-x)::Float64 return (1 / (1+temp))::Float64 end Designed for performance

Type inference function sigmoid(x::Float32) temp = exp(-x)::Float32 return (1 / (1+temp))::Float32 end

Machine-native types Designed for performance

Multiple dispatch

Type inference High-quality Machine-native types stand-alone machine code

Specializing JIT compiler Extensible language

Inspect & Inject

Machine Source AST Julia IR LLVM IR code

Configure How does it look?

function vadd(a, b, c) No DSL, no subset, just Julia i = threadIdx().x c[i] = a[i] + b[i] return end CUDA abstraction level a = CuArray(randn(2,2)) b = CuArray(randn(2,2)) c = similar(a) Performance parity @cuda threads=4 vadd(a,b,c) How does it run? How does it work?

function vadd(a, b, c) vadd i = threadIdx().x c[i] = a[i] + b[i] vadd return CuArray{Float64,2} end

LLVM IR a = CuArray(randn(2,2)) b = CuArray(randn(2,2)) c = similar(a) PTX @cuda threads=4 vadd(a,b,c) How does it work?

vadd Run time JIT compiler

vadd vadd CuArray{Float64,2} CuArray{Int32,3} Fully transparent LLVM IR LLVM IR

PTX PTX No overhead!

High-level GPU programming

Great performance

Clean & concise ( (a .+ b) ./ d ) .- e

Generic code function vadd(a, b, c) i = threadIdx().x c[i] = a[i] + b[i] return end

W = randn(2, 10) b = randn(2)

f(x) = softmax(W * x .+ b)

model = Chain( Dense(10, 5, σ), From GPU kernels, to differentiable algorithms, to Dense(5, 2), high-level stacking. All on one platform. softmax)

Pkg.add(“Flux”) The Julia Magic

Everything just works with everything else!

Differential Equations

Everything You Build

CUDA Automatic Differentiation

All the HPC Tooling

Differential Operations Equations Research

JuliaDB.jl & StructOfArrays.jl DistributedArrays.jl DataFrames.jl

Generic programming is extremely powerful. function model(tree) if isleaf(tree) tree.value else model(tree.left) + model(tree.right) end

Case Studies JuliaCon – 300 Attendees, 150 Talks Julia https://github.com/JuliaGPU/ NVIDIA Parallel Forall blog

https://github.com/FluxML/