Distributed computing with Julia (Day 2)

May 23rd, 2018 09:00-11:00 UNISA

Przemek Szufel https://szufel.pl/ Materials for this course https://szufel.pl/unisa/ Day 2 Agenda

• Parallelizing Julia on a single machine. • SIMD in Julia • Threading • Configuring the threading mechanism in Julia • multithreaded code efficiency issues • Multiprocessing • Local multiprocessing • parallelizing loops • introduction to interprocess communication issues JuliaBox – easiest way to start (pure cloud https://juliabox.com) Learning more about Julia

• Website: https://julialang.org/ • Learning materials: https://julialang.org/learning/ • Where it is taught: https://julialang.org/teaching/ • Blogs about Julia: https://www.juliabloggers.com/ • Julia forum: https://discourse.julialang.org/ • Q&A for Julia: https://stackoverflow.com/questions/tagged/julia-lang Parallelization options in programming languages • Single instruction, multiple data (SIMD) • Green-threads • Multi-threading • Language • Libraries • Multi-processing • single machine • distributed (cluster) • distributed (cluster) via external tools SIMD

• Single instruction, multiple data (SIMD) describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Such machines exploit data level parallelism, but not concurrency: there are simultaneous (parallel) computations, but only a single process (instruction) at a given moment. Source: https://en.wikipedia.org/wiki/SIMD Data level parallelism # 1_dot/dot_simd.jl

function dot1(x, y) s = 0.0 for i in 1:length(x) @inbounds s += x[i]*y[i] end s end

function dot2(x, y) s = 0.0 @simd for i in 1:length(x) @inbounds s += x[i]*y[i] end s Image source: https://en.wikipedia.org/wiki/SIMD end Dot product: output

$ julia 1_dot/dot_simd.jl dot1 elapsed time: 0.832743291 seconds dot2 elapsed time: 0.303591816 seconds Green threading

• In , green threads are threads that are scheduled by a runtime or virtual machine (VM) instead of natively by the underlying . Green threads emulate multithreaded environments without relying on any native OS capabilities, and they are managed in instead of kernel space, enabling them to work in environments that do not have native support.

https://en.wikipedia.org/wiki/Green_threads A simple web server with green threading 2_web/webserver.jl

@async begin server = listen(8080) while true sock = accept(server) @async begin data = readline(sock) print("Got request\n",data,"\n") header = "\nHTTP/1.1 200 OK\nContent-Type: text/html\n\n" message = string("Hello from Julia at ",now(),"") write(sock,string(header, message)) close(sock) end end end Comparison of parallelism types

Threading Multiprocessing • Single process (cheap) • Multiple processes • Shared memory • Separate memory • Number of threads running • Number of processes running simultaneously limited by simultaneously limited by cluster number of processors size • Possible issues with locking and • Possible issues if inter-process false sharing communication is needed Threading Simple example – threading 3_sum/sum_thread.jl

Single threaded Multithreading function ssum(x) function tsum(x) r, c = size(x) r, c = size(x) y = zeros(c) y = zeros(c) for i in 1:c Threads.@threads for i in 1:c for j in 1:r for j in 1:r y[i] += x[j, i] y[i] += x[j, i] end end end end y y end end Sum: output $ 3_sum/run_sum_thread.sh threads: 1 1.147527 seconds (4.71 k allocations: 420.445 KiB) 1.132901 seconds (6 allocations: 156.484 KiB) 1.207195 seconds (10.22 k allocations: 696.149 KiB) 1.179634 seconds (7 allocations: 156.531 KiB) threads: 2 1.147714 seconds (4.71 k allocations: 420.445 KiB) 1.133718 seconds (6 allocations: 156.484 KiB) 0.620536 seconds (10.22 k allocations: 696.149 KiB) 0.592958 seconds (7 allocations: 156.531 KiB) threads: 16 1.147191 seconds (4.71 k allocations: 420.445 KiB) 1.132812 seconds (6 allocations: 156.484 KiB) 0.175705 seconds (10.22 k allocations: 696.149 KiB) delta is compilation time 0.084011 seconds (7 allocations: 156.531 KiB) Threading: synchronization 4_locking/locking.jl Increment 푥 107 times using threads: • Atomic operations • SpinLock (busy waiting) • Mutex (OS provided lock)

$ julia 4_locking/run_locking.sh Locking: output on c4.4xlarge (16 vCPU) 1 thread 16 threads f_bad f_bad

10000000 950043

0.498997 seconds (10.01 M allocations: 153.318 MiB, 49.89% gc time) 0.449196 seconds (1.63 M allocations: 27.759 MiB)

10000000 630661

0.198711 seconds (10.00 M allocations: 152.580 MiB, 3.04% gc time) 0.922549 seconds (1.52 M allocations: 26.963 MiB, 61.86% gc time) f_atomic f_atomic

10000000 10000000

0.082628 seconds (7.54 k allocations: 403.376 KiB) 0.217921 seconds (7.54 k allocations: 403.376 KiB)

10000000 10000000

0.059487 seconds (11 allocations: 288 bytes) 0.187748 seconds (12 allocations: 688 bytes) f_spin f_spin

10000000 10000000

0.286315 seconds (10.01 M allocations: 153.074 MiB, 2.25% gc time) 2.238537 seconds (10.01 M allocations: 153.074 MiB, 15.81% gc time)

10000000 10000000

0.257490 seconds (10.00 M allocations: 152.580 MiB, 1.52% gc time) 1.602330 seconds (10.00 M allocations: 152.581 MiB, 19.85% gc time) f_mutex f_mutex

10000000 10000000

0.557977 seconds (10.01 M allocations: 153.260 MiB, 1.17% gc time) 4.862945 seconds (10.01 M allocations: 153.260 MiB, 3.67% gc time)

10000000 10000000

0.491197 seconds (10.00 M allocations: 152.580 MiB, 1.02% gc time) 4.662214 seconds (10.00 M allocations: 152.580 MiB) Threading: false sharing 5_falsesharing/falsesharing.jl • Calculate sum of 12 × 108 ones (should be 12 × 108 ) • False sharing: threads modify independent variables sharing the same cache line • Caution: • threading performance is sometimes hard to predict • adding cores does not always help

$ 5_falsesharing/run_falsesharing.sh False sharing: output on c4.4xlarge (16 vCPU)

2.5

2

1.5

1

0.5

0

8 1 2 4

16 32 64

128 256 512

1024 2048 4096 1 thread 2 threads 3 threads 4 threads Exercise I (6_sums/sums.jl)

Parallelize the code summing 108 random numbers. Compare the performance with in-built 푠푢푚 function sums: output on c4.4xlarge (16 vCPU) $ sh ~/fields/2_3/5_sums/solution/run_sums.sh threads: 1 sum 0.070023 seconds (1 allocation: 16 bytes) s 0.116564 seconds (1 allocation: 16 bytes) s2 0.117371 seconds (3 allocations: 144 bytes) s3 0.071282 seconds (1 allocation: 16 bytes) s4 0.069734 seconds (47 allocations: 1.891 KiB) threads: 2 sum 0.069787 seconds (1 allocation: 16 bytes) s 0.116970 seconds (1 allocation: 16 bytes) s2 0.061721 seconds (3 allocations: 144 bytes) s3 0.071210 seconds (1 allocation: 16 bytes) s4 0.039898 seconds (48 allocations: 1.953 KiB) threads: 16 sum 0.080793 seconds (1 allocation: 16 bytes) s 0.113512 seconds (1 allocation: 16 bytes) s2 0.020983 seconds (3 allocations: 256 bytes) s3 0.078284 seconds (1 allocation: 16 bytes) s4 0.019935 seconds (52 allocations: 2.578 KiB) Egxample – multiprocessing 7_rand/rand_process.jl using BenchmarkTools using BenchmarkTools function s_rand() n = 10^4 function p_rand() x = 0.0 n = 10^4 for i in 1:n x = @parallel (+) for i in 1:n x += sum(rand(10^4)) sum(rand(10^4)) end end x / n x / n end end

@time s_rand() @time p_rand() @time s_rand() @time p_rand()

$ julia -p $(nproc) rand_process.jl Parallelizing Julia code

• @parallel • @spawnat • @everywhere • @async • @sync • fetch() Rand: output

$ 3_rand/run_rand_process.sh 0.381071 seconds (46.21 k allocations: 765.124 MiB, 37.20% gc time) 0.161149 seconds (20.00 k allocations: 763.703 MiB, 9.64% gc time) 1.661893 seconds (230.81 k allocations: 12.494 MiB, 0.15% gc time) delta is compilation and 0.092413 seconds (1.89 k allocations: 155.766 KiB) process spawning time Full example – Asian option pricing 8_asianoption/* • An Asian option (or average value option) is a special type of option contract. For Asian options the payoff is determined by the average underlying price over some pre-set period of time.

• An asset with known price 푋0 at time 0 • By 푋푡 denote asset price at time 푡 • We have to calculate value 푣 of Asian option executable at time 푇:

푣 = 퐸 exp −푟푇 max 푋ത − 퐾, 0 1 푇 where 푋ത = ׬ 푋 푑푡 푇 0 푡 What is geometric Brownian motion (GBM)?

Formally: 푋푝 ln 푋푞 has normal distribution

Intuitively: percentage price change is normally distributed Numerical approximation of 푣

• Replace 푋ത by its approximation in 푚 discrete periods 푚 1 푥ො = ෍ 푋 , Δ = 푇/푚 푚 푖Δ 푖=1 • Assume that process 푋푡 is geometric Brownian motion with drift 푟 and volatility 휎2 휎2 푋 = 푋 exp 푟 − Δ + 휎 Δ푍 , 푍 ∼ 푁 0,1 푖+1 Δ 푖Δ 2 푖 푖 • Average 푛 independent samples of exp −푟푇 max 푥ො − 퐾, 0 Example implementations 8_asianoption/*

• asianoption.jl • single CPU • asianoption_thread.jl • threaded • asianoption_parallel.jl • pmap • asianoption_parallel2.jl • @parallel Single process (Julia) function v_asian_sample(T, r, K, σ, X₀, m::Integer) X = X₀ hatx = zero(X) Δ = T / m for i in 1:m X *= exp((r-σ^2/2)*Δ + σ*√Δ*randn()) hatx += X end exp(-r*T)*max(hatx/m - K, 0) end function v_asian(T, r, K, σ, X₀, m, n) mean(v_asian_sample(T, r, K, σ, X₀, m) for I in 1:n) end Julia: using all cores of your cluster by adding a single @parallel command function v_asian(T, r, K, σ, X₀, m, n) res = @parallel (+) for i in 1:n X = X₀ hatx = zero(X) Δ = T / m for i in 1:m X *= exp((r-σ^2/2)*Δ + σ*√Δ*randn()) hatx += X end exp(-r*T)*max(hatx/m - K, 0) end res / n end Asian option: output

$ 4_asianoption/run_asianoption.sh one CPU => 3.065627163 (2.042850249101069) threads: 1 => 3.017402595 (2.124517036279159) pmap: 1 => 3.025395902 (2.105147813902863) @parallel 1 => 3.368927703 (2.074390346543304) threads: 4 => 0.75593904 (2.086799284142133) pmap: 4 => 0.772638171 (2.10637071660848) @parallel 4 => 0.844037915 (2.0170305511851563) threads: 16 => 0.339850353 (2.0950297040301478) pmap: 16 => 0.340659614 (2.0348591986978257) @parallel 16 => 0.3597416 (1.9945444607076115) Threading is sensitive to details

Comment out lines 12 do 14 in asianoptions_thread.jl File. # is a line comment sign Asian option: removed deepcopy

$ 4_asianoption/run_asianoption.sh one CPU => 3.058250557 (2.059417988239589) threads: 1 => 3.221794995 (2.124517036279159) pmap: 1 => 3.062728093 (1.9882438408137915) @parallel 1 => 3.357377681 (2.115851059772093) threads: 4 => 0.752315223 (2.086799284142133) pmap: 4 => 0.769645512 (2.0033415255787226) @parallel 4 => 0.845096571 (1.995403114618923) threads: 16 => 0.906846006 (2.0950297040301478) pmap: 16 => 0.338896529 (1.947256674358549) @parallel 16 => 0.440467925 (2.0661612812557806) Asian option: R (dump from the console)

> v_asian_sample <- function(T, r, K, sigma, X0, m) { + Delta <- T/m + X <- rnorm(m, (r-sigma^2/2)*Delta, sigma*sqrt(Delta)) + exp(-r*T)*max(mean(exp(cumsum(X)))*X0 - K, 0) + } > v_asian <- function(T, r, K, sigma, X0, m, n) { + v <- replicate(n, v_asian_sample(T, r, K, sigma, X0, m)) + mean(v) + } > system.time(print(v_asian(1, 0.05, 55, 0.3, 50, 20000, 10000))) [1] 2.109691 user system elapsed 33.30 0.00 33.29 Exercise II (9_pi/spi.jl)

You have a code approximating 휋 using 4 × 108 iterations.

Write a multithreaded and multiprocessing version of the code and compare the performance. Pi: output

$ 5_pi/solution/run_pi.sh one CPU 1.736201 seconds (5.39 k allocations: 374.625 KiB) 1.703619 seconds (5 allocations: 176 bytes) processes: 16 1.197691 seconds (831.79 k allocations: 44.760 MiB, 4.50% gc time) 0.216941 seconds (1.67 k allocations: 242.125 KiB) threads: 16 0.287703 seconds (15.46 k allocations: 852.435 KiB) 0.222791 seconds (8 allocations: 640 bytes) Sudoku

⇒ Sudoku as an MIP optimization problem

• Decision variables • 푥 푖, 푗, 푘 ∈ {0,1}, where 푖, 푗, 푘 ∈ {1,2, … , 9} • If 푥 푖, 푗, 푘 then cell (푖, 푗) contains number 푘 • Constraints 9 1. σ푘=1 푥 푖, 푗, 푘 = 1 (cell must contain exactly one number) 9 2. σ푖=1 푥 푖, 푗, 푘 = 1 (each column contains exactly one entry of each number) 9 3. σ푗=1 푥 푖, 푗, 푘 = 1 (each row contains exactly one entry of each number) 푎+2 푏+2 4. σ푖=푎 σ푗=푏 푥 푖, 푗, 푘 = 1, where 푎, 푏 ∈ 1,4,7 (each 3 × 3 square contains exactly one entry of each number) 5. 푥 푖, 푗, 푘 = 1 for cells that are filled in the given problem https://www.kaggle.com/bryanpark/sudoku

• 1,00,000 sudoku problems with solutions (sudoku.csv) quizzes,solutions 004300209005009001070060043006002087190007400050083000600000105003508690042910300, 864371259325849761971265843436192587198657432257483916689734125713528694542916378 040100050107003960520008000000000017000906800803050620090060543600080700250097100, 346179258187523964529648371965832417472916835813754629798261543631485792254397186 600120384008459072000006005000264030070080006940003000310000050089700000502000190, 695127384138459672724836915851264739273981546946573821317692458489715263562348197 497200000100400005000016098620300040300900000001072600002005870000600004530097061, 497258316186439725253716498629381547375964182841572639962145873718623954534897261 005910308009403060027500100030000201000820007006007004000080000640150700890000420, 465912378189473562327568149738645291954821637216397854573284916642159783891736425 • Our challenge: write a sudoku solver and check if its output is correct One line of data

004300209005009001070060043006002087190007400050083000600000105003508690042910300, 864371259325849761971265843436192587198657432257483916689734125713528694542916378

004300209 864371259 005009001 325849761 070060043 971265843 006002087 436192587 190007400 ⇒ 198657432 050083000 257483916 600000105 689734125 003508690 713528694 042910300 542916378 Typical approach for distributed processing

• Define a “reasonably large” work package • In our case one job is 1,000 sudoku problems (~10 seconds) • We have 100 such jobs (~20 mins on a single core)

• Julia distributed cluster manager Running single process

$ julia A_sudoku/sudoku_master.jl Loading code on worker nodes ... elapsed time: 10.534820255 seconds Executing computations ... elapsed time: 703.569770363 seconds All OK Running multiple processes

• $ julia –p 16 A_sudoku/sudoku_master.jl END OF THE DAY 2…