2.5D Algorithms for Distributed-Memory Computing

Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 2.5D algorithms for distributed-memory computing Edgar Solomonik UC Berkeley July, 2012 Edgar Solomonik 2.5D algorithms 1/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Outline Introduction Strong scaling 2.5D dense linear algebra 2.5D matrix multiplication 2.5D LU factorization 2.5D QR factorization All-pairs shortest-paths Symmetric tensor contractions Conclusion Edgar Solomonik 2.5D algorithms 2/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Strong scaling Symmetric tensor contractions Conclusion Solving science problems faster Parallel computers can solve bigger problems I weak scaling Parallel computers can also solve a fixed problem faster I strong scaling Obstacles to strong scaling I may increase relative cost of communication I may hurt load balance How to reduce communication and maintain load balance? I reduce (minimize) communication along the critical path I exploit the network topology Edgar Solomonik 2.5D algorithms 3/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Strong scaling Symmetric tensor contractions Conclusion Topology-aware multicasts (BG/P vs Cray) 1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 BG/P XE6 4096 XT5 2048 1024 512 Bandwidth (MB/sec) 256 128 8 64 512 4096 #nodes Edgar Solomonik 2.5D algorithms 4/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Strong scaling Symmetric tensor contractions Conclusion 2D rectangular multicasts trees 2D 4X4 Torus Spanning tree 1 Spanning tree 2 root Spanning tree 3 Spanning tree 4 All 4 trees combined Edgar Solomonik 2.5D algorithms 5/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Blocking matrix multiplication B A B A B A B A Edgar Solomonik 2.5D algorithms 6/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D matrix multiplication [Cannon 69], [Van De Geijn and Watts 97] B A O(n3=p) flops 16 CPUs (4x4) 2 p B CPU CPU CPU CPU O(n = p) words moved CPU CPU CPU CPU p A CPU CPU CPU CPU O( p) messages CPU CPU CPU CPU B O(n2=p) bytes of memory A B A Edgar Solomonik 2.5D algorithms 7/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 3D matrix multiplication [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89], [McColl and Tiskin 99] 64 CPUs (4x4x4) CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU 3 B CPU CPU CPU CPU O(n =p) flops A CPU CPU CPU CPU CPU CPU CPU CPU O(n2=p2=3) words moved B CPU CPU CPU CPU CPU CPU CPU CPU O(1) messages A CPU CPU CPU CPU 2 2=3 B CPU CPU CPU CPU O(n =p ) bytes of memory CPU CPU CPU CPU CPU CPU CPU CPU A B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A 4 copies of matrices Edgar Solomonik 2.5D algorithms 8/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D matrix multiplication [McColl and Tiskin 99] B 32 CPUs (4x4x2) A CPU CPU CPU CPU O(n3=p) flops CPU CPU CPU CPU p B CPU CPU CPU CPU O(n2= c · p) words moved CPU CPU CPU CPU A q CPU CPU CPU CPU O( p=c3) messages B CPU CPU CPU CPU CPU CPU CPU CPU O(c · n2=p) bytes of memory CPU CPU CPU CPU A 2 copies of matrices B A Edgar Solomonik 2.5D algorithms 9/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D strong scaling n = dimension, p = #processors, c = #copies of data 1=3 I must satisfy 1 ≤ c ≤ p I special case: c = 1 yields 2D algorithm 1=3 I special case: c = p yields 3D algorithm cost(2.5D MM(p; c)) = O(n3=p) flops p + O(n2= c · p) words moved q + O( p=c3) messages∗ *ignoring log(p) factors Edgar Solomonik 2.5D algorithms 10/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D strong scaling n = dimension, p = #processors, c = #copies of data 1=3 I must satisfy 1 ≤ c ≤ p I special case: c = 1 yields 2D algorithm 1=3 I special case: c = p yields 3D algorithm cost(2D MM(p)) = O(n3=p) flops p + O(n2= p) words moved p + O( p) messages∗ = cost(2.5D MM(p; 1)) *ignoring log(p) factors Edgar Solomonik 2.5D algorithms 11/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D strong scaling n = dimension, p = #processors, c = #copies of data 1=3 I must satisfy 1 ≤ c ≤ p I special case: c = 1 yields 2D algorithm 1=3 I special case: c = p yields 3D algorithm cost(2.5D MM(c · p; c)) = O(n3=(c · p)) flops p + O(n2=(c · p)) words moved p + O( p=c) messages = cost(2D MM(p))=c perfect strong scaling Edgar Solomonik 2.5D algorithms 12/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Strong scaling matrix multiplication 2.5D MM on BG/P (n=65,536) 100 2.5D MM 2D MM 80 ScaLAPACK PDGEMM 60 40 20 Percentage of machine peak 0 256 512 1024 2048 p Edgar Solomonik 2.5D algorithms 13/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D MM on 65,536 cores 2.5D MM on 16,384 nodes of BG/P 60 2.5D MM 2D MM 50 40 30 20 Percentage of flops peak 10 0 8192 32768 131072 n Edgar Solomonik 2.5D algorithms 14/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Cost breakdown of MM on 65,536 cores Matrix multiplication on 16,384 nodes of BG/P 1.4 communication 1.2 idle 95% reduction in comm computation 1 0.8 0.6 0.4 0.2 Execution time normalized by 2D 0 n=8192, 2Dn=8192, 2.5D n=131072, 2Dn=131072, 2.5D Edgar Solomonik 2.5D algorithms 15/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D recursive LU A = L · U where L is lower-triangular and U is upper-triangular I A 2.5D recursive algorithm with no pivoting [A. Tiskin 2002] I Tiskin gives algorithm under the BSP model I Bulk Synchronous Parallel I considers communication and synchronization I We give an alternative distributed-memory adaptation and implementation I Also, we lower-bound the latency cost Edgar Solomonik 2.5D algorithms 16/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization A Edgar Solomonik 2.5D algorithms 17/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization U₀₀ L₀₀ Edgar Solomonik 2.5D algorithms 18/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization U L Edgar Solomonik 2.5D algorithms 19/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization U L S=A-LU Edgar Solomonik 2.5D algorithms 20/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic decomposition 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Edgar Solomonik 2.5D algorithms 21/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic LU factorization Edgar Solomonik 2.5D algorithms 22/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic LU factorization U L Edgar Solomonik 2.5D algorithms 23/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic LU factorization U L S=A-LU Edgar Solomonik 2.5D algorithms 24/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion A new latency lower bound for LU I Relate volume to surface area to diameter k₀ A₀₀ I For block size n=d LU does k₁ A₁₁ 3 2 I Ω(n =d ) flops 2 k₂ A₂₂ I Ω(n =d) words k₃ A₃₃ I Ω(d) msgs n I Now pick d (=latency cost) p k₄ A₄₄ critical path I d = Ω( p) to minimize flops p I d = Ω( c · p) to minimize

Load more