2.5D Algorithms for Distributed-Memory Computing

2.5D Algorithms for Distributed-Memory Computing

Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 2.5D algorithms for distributed-memory computing Edgar Solomonik UC Berkeley July, 2012 Edgar Solomonik 2.5D algorithms 1/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Outline Introduction Strong scaling 2.5D dense linear algebra 2.5D matrix multiplication 2.5D LU factorization 2.5D QR factorization All-pairs shortest-paths Symmetric tensor contractions Conclusion Edgar Solomonik 2.5D algorithms 2/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Strong scaling Symmetric tensor contractions Conclusion Solving science problems faster Parallel computers can solve bigger problems I weak scaling Parallel computers can also solve a fixed problem faster I strong scaling Obstacles to strong scaling I may increase relative cost of communication I may hurt load balance How to reduce communication and maintain load balance? I reduce (minimize) communication along the critical path I exploit the network topology Edgar Solomonik 2.5D algorithms 3/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Strong scaling Symmetric tensor contractions Conclusion Topology-aware multicasts (BG/P vs Cray) 1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 BG/P XE6 4096 XT5 2048 1024 512 Bandwidth (MB/sec) 256 128 8 64 512 4096 #nodes Edgar Solomonik 2.5D algorithms 4/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Strong scaling Symmetric tensor contractions Conclusion 2D rectangular multicasts trees 2D 4X4 Torus Spanning tree 1 Spanning tree 2 root Spanning tree 3 Spanning tree 4 All 4 trees combined Edgar Solomonik 2.5D algorithms 5/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Blocking matrix multiplication B A B A B A B A Edgar Solomonik 2.5D algorithms 6/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D matrix multiplication [Cannon 69], [Van De Geijn and Watts 97] B A O(n3=p) flops 16 CPUs (4x4) 2 p B CPU CPU CPU CPU O(n = p) words moved CPU CPU CPU CPU p A CPU CPU CPU CPU O( p) messages CPU CPU CPU CPU B O(n2=p) bytes of memory A B A Edgar Solomonik 2.5D algorithms 7/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 3D matrix multiplication [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89], [McColl and Tiskin 99] 64 CPUs (4x4x4) CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU 3 B CPU CPU CPU CPU O(n =p) flops A CPU CPU CPU CPU CPU CPU CPU CPU O(n2=p2=3) words moved B CPU CPU CPU CPU CPU CPU CPU CPU O(1) messages A CPU CPU CPU CPU 2 2=3 B CPU CPU CPU CPU O(n =p ) bytes of memory CPU CPU CPU CPU CPU CPU CPU CPU A B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A 4 copies of matrices Edgar Solomonik 2.5D algorithms 8/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D matrix multiplication [McColl and Tiskin 99] B 32 CPUs (4x4x2) A CPU CPU CPU CPU O(n3=p) flops CPU CPU CPU CPU p B CPU CPU CPU CPU O(n2= c · p) words moved CPU CPU CPU CPU A q CPU CPU CPU CPU O( p=c3) messages B CPU CPU CPU CPU CPU CPU CPU CPU O(c · n2=p) bytes of memory CPU CPU CPU CPU A 2 copies of matrices B A Edgar Solomonik 2.5D algorithms 9/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D strong scaling n = dimension, p = #processors, c = #copies of data 1=3 I must satisfy 1 ≤ c ≤ p I special case: c = 1 yields 2D algorithm 1=3 I special case: c = p yields 3D algorithm cost(2.5D MM(p; c)) = O(n3=p) flops p + O(n2= c · p) words moved q + O( p=c3) messages∗ *ignoring log(p) factors Edgar Solomonik 2.5D algorithms 10/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D strong scaling n = dimension, p = #processors, c = #copies of data 1=3 I must satisfy 1 ≤ c ≤ p I special case: c = 1 yields 2D algorithm 1=3 I special case: c = p yields 3D algorithm cost(2D MM(p)) = O(n3=p) flops p + O(n2= p) words moved p + O( p) messages∗ = cost(2.5D MM(p; 1)) *ignoring log(p) factors Edgar Solomonik 2.5D algorithms 11/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D strong scaling n = dimension, p = #processors, c = #copies of data 1=3 I must satisfy 1 ≤ c ≤ p I special case: c = 1 yields 2D algorithm 1=3 I special case: c = p yields 3D algorithm cost(2.5D MM(c · p; c)) = O(n3=(c · p)) flops p + O(n2=(c · p)) words moved p + O( p=c) messages = cost(2D MM(p))=c perfect strong scaling Edgar Solomonik 2.5D algorithms 12/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Strong scaling matrix multiplication 2.5D MM on BG/P (n=65,536) 100 2.5D MM 2D MM 80 ScaLAPACK PDGEMM 60 40 20 Percentage of machine peak 0 256 512 1024 2048 p Edgar Solomonik 2.5D algorithms 13/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D MM on 65,536 cores 2.5D MM on 16,384 nodes of BG/P 60 2.5D MM 2D MM 50 40 30 20 Percentage of flops peak 10 0 8192 32768 131072 n Edgar Solomonik 2.5D algorithms 14/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Cost breakdown of MM on 65,536 cores Matrix multiplication on 16,384 nodes of BG/P 1.4 communication 1.2 idle 95% reduction in comm computation 1 0.8 0.6 0.4 0.2 Execution time normalized by 2D 0 n=8192, 2Dn=8192, 2.5D n=131072, 2Dn=131072, 2.5D Edgar Solomonik 2.5D algorithms 15/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D recursive LU A = L · U where L is lower-triangular and U is upper-triangular I A 2.5D recursive algorithm with no pivoting [A. Tiskin 2002] I Tiskin gives algorithm under the BSP model I Bulk Synchronous Parallel I considers communication and synchronization I We give an alternative distributed-memory adaptation and implementation I Also, we lower-bound the latency cost Edgar Solomonik 2.5D algorithms 16/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization A Edgar Solomonik 2.5D algorithms 17/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization U₀₀ L₀₀ Edgar Solomonik 2.5D algorithms 18/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization U L Edgar Solomonik 2.5D algorithms 19/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization U L S=A-LU Edgar Solomonik 2.5D algorithms 20/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic decomposition 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Edgar Solomonik 2.5D algorithms 21/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic LU factorization Edgar Solomonik 2.5D algorithms 22/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic LU factorization U L Edgar Solomonik 2.5D algorithms 23/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic LU factorization U L S=A-LU Edgar Solomonik 2.5D algorithms 24/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion A new latency lower bound for LU I Relate volume to surface area to diameter k₀ A₀₀ I For block size n=d LU does k₁ A₁₁ 3 2 I Ω(n =d ) flops 2 k₂ A₂₂ I Ω(n =d) words k₃ A₃₃ I Ω(d) msgs n I Now pick d (=latency cost) p k₄ A₄₄ critical path I d = Ω( p) to minimize flops p I d = Ω( c · p) to minimize

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    65 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us