Warm-Up: Computations Using Multiple Cores on a Single Machine

Distributed computing with Julia (Day 2) May 23rd, 2018 09:00-11:00 UNISA Przemek Szufel https://szufel.pl/ Materials for this course https://szufel.pl/unisa/ Day 2 Agenda • Parallelizing Julia on a single machine. • SIMD in Julia • Threading • Configuring the threading mechanism in Julia • multithreaded code efficiency issues • Multiprocessing • Local multiprocessing • parallelizing loops • introduction to interprocess communication issues JuliaBox – easiest way to start (pure cloud https://juliabox.com) Learning more about Julia • Website: https://julialang.org/ • Learning materials: https://julialang.org/learning/ • Where it is taught: https://julialang.org/teaching/ • Blogs about Julia: https://www.juliabloggers.com/ • Julia forum: https://discourse.julialang.org/ • Q&A for Julia: https://stackoverflow.com/questions/tagged/julia-lang Parallelization options in programming languages • Single instruction, multiple data (SIMD) • Green-threads • Multi-threading • Language • Libraries • Multi-processing • single machine • distributed (cluster) • distributed (cluster) via external tools SIMD • Single instruction, multiple data (SIMD) describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Such machines exploit data level parallelism, but not concurrency: there are simultaneous (parallel) computations, but only a single process (instruction) at a given moment. Source: https://en.wikipedia.org/wiki/SIMD Data level parallelism # 1_dot/dot_simd.jl function dot1(x, y) s = 0.0 for i in 1:length(x) @inbounds s += x[i]*y[i] end s end function dot2(x, y) s = 0.0 @simd for i in 1:length(x) @inbounds s += x[i]*y[i] end s Image source: https://en.wikipedia.org/wiki/SIMD end Dot product: output $ julia 1_dot/dot_simd.jl dot1 elapsed time: 0.832743291 seconds dot2 elapsed time: 0.303591816 seconds Green threading • In computer programming, green threads are threads that are scheduled by a runtime library or virtual machine (VM) instead of natively by the underlying operating system. Green threads emulate multithreaded environments without relying on any native OS capabilities, and they are managed in user space instead of kernel space, enabling them to work in environments that do not have native thread support. https://en.wikipedia.org/wiki/Green_threads A simple web server with green threading 2_web/webserver.jl @async begin server = listen(8080) while true sock = accept(server) @async begin data = readline(sock) print("Got request\n",data,"\n") header = "\nHTTP/1.1 200 OK\nContent-Type: text/html\n\n" message = string("<html><body>Hello from Julia at ",now(),"</body></html>") write(sock,string(header, message)) close(sock) end end end Comparison of parallelism types Threading Multiprocessing • Single process (cheap) • Multiple processes • Shared memory • Separate memory • Number of threads running • Number of processes running simultaneously limited by simultaneously limited by cluster number of processors size • Possible issues with locking and • Possible issues if inter-process false sharing communication is needed Threading Simple example – threading 3_sum/sum_thread.jl Single threaded Multithreading function ssum(x) function tsum(x) r, c = size(x) r, c = size(x) y = zeros(c) y = zeros(c) for i in 1:c Threads.@threads for i in 1:c for j in 1:r for j in 1:r y[i] += x[j, i] y[i] += x[j, i] end end end end y y end end Sum: output $ 3_sum/run_sum_thread.sh threads: 1 1.147527 seconds (4.71 k allocations: 420.445 KiB) 1.132901 seconds (6 allocations: 156.484 KiB) 1.207195 seconds (10.22 k allocations: 696.149 KiB) 1.179634 seconds (7 allocations: 156.531 KiB) threads: 2 1.147714 seconds (4.71 k allocations: 420.445 KiB) 1.133718 seconds (6 allocations: 156.484 KiB) 0.620536 seconds (10.22 k allocations: 696.149 KiB) 0.592958 seconds (7 allocations: 156.531 KiB) threads: 16 1.147191 seconds (4.71 k allocations: 420.445 KiB) 1.132812 seconds (6 allocations: 156.484 KiB) 0.175705 seconds (10.22 k allocations: 696.149 KiB) delta is compilation time 0.084011 seconds (7 allocations: 156.531 KiB) Threading: synchronization 4_locking/locking.jl Increment 푥 107 times using threads: • Atomic operations • SpinLock (busy waiting) • Mutex (OS provided lock) $ julia 4_locking/run_locking.sh Locking: output on c4.4xlarge (16 vCPU) 1 thread 16 threads f_bad f_bad 10000000 950043 0.498997 seconds (10.01 M allocations: 153.318 MiB, 49.89% gc time) 0.449196 seconds (1.63 M allocations: 27.759 MiB) 10000000 630661 0.198711 seconds (10.00 M allocations: 152.580 MiB, 3.04% gc time) 0.922549 seconds (1.52 M allocations: 26.963 MiB, 61.86% gc time) f_atomic f_atomic 10000000 10000000 0.082628 seconds (7.54 k allocations: 403.376 KiB) 0.217921 seconds (7.54 k allocations: 403.376 KiB) 10000000 10000000 0.059487 seconds (11 allocations: 288 bytes) 0.187748 seconds (12 allocations: 688 bytes) f_spin f_spin 10000000 10000000 0.286315 seconds (10.01 M allocations: 153.074 MiB, 2.25% gc time) 2.238537 seconds (10.01 M allocations: 153.074 MiB, 15.81% gc time) 10000000 10000000 0.257490 seconds (10.00 M allocations: 152.580 MiB, 1.52% gc time) 1.602330 seconds (10.00 M allocations: 152.581 MiB, 19.85% gc time) f_mutex f_mutex 10000000 10000000 0.557977 seconds (10.01 M allocations: 153.260 MiB, 1.17% gc time) 4.862945 seconds (10.01 M allocations: 153.260 MiB, 3.67% gc time) 10000000 10000000 0.491197 seconds (10.00 M allocations: 152.580 MiB, 1.02% gc time) 4.662214 seconds (10.00 M allocations: 152.580 MiB) Threading: false sharing 5_falsesharing/falsesharing.jl • Calculate sum of 12 × 108 ones (should be 12 × 108 ) • False sharing: threads modify independent variables sharing the same cache line • Caution: • threading performance is sometimes hard to predict • adding cores does not always help $ 5_falsesharing/run_falsesharing.sh 0.5 1.5 2.5 False sharing: output on c4.4xlarge (16 vCPU) (16 onc4.4xlarge output sharing: False 0 1 2 1 2 4 1 thread 8 16 2 threads 32 3 threads 64 128 4 threads 256 512 1024 2048 4096 Exercise I (6_sums/sums.jl) Parallelize the code summing 108 random numbers. Compare the performance with in-built 푠푢푚 function sums: output on c4.4xlarge (16 vCPU) $ sh ~/fields/2_3/5_sums/solution/run_sums.sh threads: 1 sum 0.070023 seconds (1 allocation: 16 bytes) s 0.116564 seconds (1 allocation: 16 bytes) s2 0.117371 seconds (3 allocations: 144 bytes) s3 0.071282 seconds (1 allocation: 16 bytes) s4 0.069734 seconds (47 allocations: 1.891 KiB) threads: 2 sum 0.069787 seconds (1 allocation: 16 bytes) s 0.116970 seconds (1 allocation: 16 bytes) s2 0.061721 seconds (3 allocations: 144 bytes) s3 0.071210 seconds (1 allocation: 16 bytes) s4 0.039898 seconds (48 allocations: 1.953 KiB) threads: 16 sum 0.080793 seconds (1 allocation: 16 bytes) s 0.113512 seconds (1 allocation: 16 bytes) s2 0.020983 seconds (3 allocations: 256 bytes) s3 0.078284 seconds (1 allocation: 16 bytes) s4 0.019935 seconds (52 allocations: 2.578 KiB) Egxample – multiprocessing 7_rand/rand_process.jl using BenchmarkTools using BenchmarkTools function s_rand() n = 10^4 function p_rand() x = 0.0 n = 10^4 for i in 1:n x = @parallel (+) for i in 1:n x += sum(rand(10^4)) sum(rand(10^4)) end end x / n x / n end end @time s_rand() @time p_rand() @time s_rand() @time p_rand() $ julia -p $(nproc) rand_process.jl Parallelizing Julia code • @parallel • @spawnat • @everywhere • @async • @sync • fetch() Rand: output $ 3_rand/run_rand_process.sh 0.381071 seconds (46.21 k allocations: 765.124 MiB, 37.20% gc time) 0.161149 seconds (20.00 k allocations: 763.703 MiB, 9.64% gc time) 1.661893 seconds (230.81 k allocations: 12.494 MiB, 0.15% gc time) delta is compilation and 0.092413 seconds (1.89 k allocations: 155.766 KiB) process spawning time Full example – Asian option pricing 8_asianoption/* • An Asian option (or average value option) is a special type of option contract. For Asian options the payoff is determined by the average underlying price over some pre-set period of time. • An asset with known price 푋0 at time 0 • By 푋푡 denote asset price at time 푡 • We have to calculate value 푣 of Asian option executable at time 푇: 푣 = 퐸 exp −푟푇 max 푋ത − 퐾, 0 1 푇 where 푋ത = ׬ 푋 푑푡 푇 0 푡 What is geometric Brownian motion (GBM)? Formally: 푋푝 ln 푋푞 has normal distribution Intuitively: percentage price change is normally distributed Numerical approximation of 푣 • Replace 푋ത by its approximation in 푚 discrete periods 푚 1 푥ො = ෍ 푋 , Δ = 푇/푚 푚 푖Δ 푖=1 • Assume that process 푋푡 is geometric Brownian motion with drift 푟 and volatility 휎2 휎2 푋 = 푋 exp 푟 − Δ + 휎 Δ푍 , 푍 ∼ 푁 0,1 푖+1 Δ 푖Δ 2 푖 푖 • Average 푛 independent samples of exp −푟푇 max 푥ො − 퐾, 0 Example implementations 8_asianoption/* • asianoption.jl • single CPU • asianoption_thread.jl • threaded • asianoption_parallel.jl • pmap • asianoption_parallel2.jl • @parallel Single process (Julia) function v_asian_sample(T, r, K, σ, X₀, m::Integer) X = X₀ hatx = zero(X) Δ = T / m for i in 1:m X *= exp((r-σ^2/2)*Δ + σ*√Δ*randn()) hatx += X end exp(-r*T)*max(hatx/m - K, 0) end function v_asian(T, r, K, σ, X₀, m, n) mean(v_asian_sample(T, r, K, σ, X₀, m) for I in 1:n) end Julia: using all cores of your cluster by adding a single @parallel command function v_asian(T, r, K, σ, X₀, m, n) res = @parallel (+) for i in 1:n X = X₀ hatx = zero(X) Δ = T / m for i in 1:m X *= exp((r-σ^2/2)*Δ + σ*√Δ*randn()) hatx += X end exp(-r*T)*max(hatx/m - K, 0) end res / n end Asian option: output $ 4_asianoption/run_asianoption.sh one CPU => 3.065627163 (2.042850249101069) threads: 1 => 3.017402595 (2.124517036279159) pmap: 1 => 3.025395902 (2.105147813902863) @parallel 1 => 3.368927703 (2.074390346543304) threads: 4 => 0.75593904 (2.086799284142133) pmap: 4 => 0.772638171 (2.10637071660848) @parallel 4 => 0.844037915 (2.0170305511851563) threads: 16 => 0.339850353 (2.0950297040301478) pmap: 16 => 0.340659614 (2.0348591986978257) @parallel 16 => 0.3597416 (1.9945444607076115) Threading is sensitive to details Comment out lines 12 do 14 in asianoptions_thread.jl File.

Load more