Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion
2.5D algorithms for distributed-memory computing
Edgar Solomonik
UC Berkeley
July, 2012
Edgar Solomonik 2.5D algorithms 1/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Outline
Introduction Strong scaling
2.5D dense linear algebra 2.5D matrix multiplication 2.5D LU factorization 2.5D QR factorization
All-pairs shortest-paths
Symmetric tensor contractions
Conclusion
Edgar Solomonik 2.5D algorithms 2/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Strong scaling Symmetric tensor contractions Conclusion Solving science problems faster
Parallel computers can solve bigger problems
I weak scaling Parallel computers can also solve a fixed problem faster
I strong scaling Obstacles to strong scaling
I may increase relative cost of communication
I may hurt load balance How to reduce communication and maintain load balance?
I reduce (minimize) communication along the critical path
I exploit the network topology
Edgar Solomonik 2.5D algorithms 3/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Strong scaling Symmetric tensor contractions Conclusion Topology-aware multicasts (BG/P vs Cray)
1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 BG/P XE6 4096 XT5
2048
1024
512 Bandwidth (MB/sec) 256
128 8 64 512 4096 #nodes
Edgar Solomonik 2.5D algorithms 4/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Strong scaling Symmetric tensor contractions Conclusion 2D rectangular multicasts trees 2D 4X4 Torus Spanning tree 1 Spanning tree 2 root
Spanning tree 3 Spanning tree 4 All 4 trees combined
Edgar Solomonik 2.5D algorithms 5/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Blocking matrix multiplication
B A
B A B
A B
A
Edgar Solomonik 2.5D algorithms 6/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D matrix multiplication [Cannon 69], [Van De Geijn and Watts 97]
B A O(n3/p) flops 16 CPUs (4x4) 2 √ B CPU CPU CPU CPU O(n / p) words moved CPU CPU CPU CPU √ A CPU CPU CPU CPU O( p) messages CPU CPU CPU CPU B O(n2/p) bytes of memory
A B
A
Edgar Solomonik 2.5D algorithms 7/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 3D matrix multiplication [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89], [McColl and Tiskin 99]
64 CPUs (4x4x4)
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU 3 B CPU CPU CPU CPU O(n /p) flops A CPU CPU CPU CPU CPU CPU CPU CPU O(n2/p2/3) words moved B CPU CPU CPU CPU CPU CPU CPU CPU O(1) messages A CPU CPU CPU CPU 2 2/3 B CPU CPU CPU CPU O(n /p ) bytes of memory CPU CPU CPU CPU
CPU CPU CPU CPU A B CPU CPU CPU CPU CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU A 4 copies of matrices Edgar Solomonik 2.5D algorithms 8/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D matrix multiplication [McColl and Tiskin 99]
B 32 CPUs (4x4x2) A CPU CPU CPU CPU O(n3/p) flops CPU CPU CPU CPU √ B CPU CPU CPU CPU O(n2/ c · p) words moved CPU CPU CPU CPU A q CPU CPU CPU CPU O( p/c3) messages B CPU CPU CPU CPU CPU CPU CPU CPU O(c · n2/p) bytes of memory CPU CPU CPU CPU
A 2 copies of matrices B
A
Edgar Solomonik 2.5D algorithms 9/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D strong scaling
n = dimension, p = #processors, c = #copies of data 1/3 I must satisfy 1 ≤ c ≤ p
I special case: c = 1 yields 2D algorithm 1/3 I special case: c = p yields 3D algorithm
cost(2.5D MM(p, c)) = O(n3/p) flops √ + O(n2/ c · p) words moved q + O( p/c3) messages∗
*ignoring log(p) factors
Edgar Solomonik 2.5D algorithms 10/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D strong scaling
n = dimension, p = #processors, c = #copies of data 1/3 I must satisfy 1 ≤ c ≤ p
I special case: c = 1 yields 2D algorithm 1/3 I special case: c = p yields 3D algorithm
cost(2D MM(p)) = O(n3/p) flops √ + O(n2/ p) words moved √ + O( p) messages∗ = cost(2.5D MM(p, 1))
*ignoring log(p) factors
Edgar Solomonik 2.5D algorithms 11/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D strong scaling
n = dimension, p = #processors, c = #copies of data 1/3 I must satisfy 1 ≤ c ≤ p
I special case: c = 1 yields 2D algorithm 1/3 I special case: c = p yields 3D algorithm
cost(2.5D MM(c · p, c)) = O(n3/(c · p)) flops √ + O(n2/(c · p)) words moved √ + O( p/c) messages = cost(2D MM(p))/c
perfect strong scaling
Edgar Solomonik 2.5D algorithms 12/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Strong scaling matrix multiplication
2.5D MM on BG/P (n=65,536) 100 2.5D MM 2D MM 80 ScaLAPACK PDGEMM
60
40
20 Percentage of machine peak
0 256 512 1024 2048 p
Edgar Solomonik 2.5D algorithms 13/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D MM on 65,536 cores
2.5D MM on 16,384 nodes of BG/P 60 2.5D MM 2D MM 50
40
30
20
Percentage of flops peak 10
0 8192 32768 131072 n
Edgar Solomonik 2.5D algorithms 14/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Cost breakdown of MM on 65,536 cores
Matrix multiplication on 16,384 nodes of BG/P 1.4 communication 1.2 idle 95% reduction in comm computation 1
0.8
0.6
0.4
0.2
Execution time normalized by 2D 0 n=8192, 2Dn=8192, 2.5D n=131072, 2Dn=131072, 2.5D
Edgar Solomonik 2.5D algorithms 15/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D recursive LU
A = L · U where L is lower-triangular and U is upper-triangular
I A 2.5D recursive algorithm with no pivoting [A. Tiskin 2002] I Tiskin gives algorithm under the BSP model
I Bulk Synchronous Parallel I considers communication and synchronization
I We give an alternative distributed-memory adaptation and implementation
I Also, we lower-bound the latency cost
Edgar Solomonik 2.5D algorithms 16/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization A
Edgar Solomonik 2.5D algorithms 17/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization
U₀₀ L₀₀
Edgar Solomonik 2.5D algorithms 18/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization
U
L
Edgar Solomonik 2.5D algorithms 19/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization
U
L S=A-LU
Edgar Solomonik 2.5D algorithms 20/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic decomposition
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
Edgar Solomonik 2.5D algorithms 21/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic LU factorization
Edgar Solomonik 2.5D algorithms 22/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic LU factorization U
L
Edgar Solomonik 2.5D algorithms 23/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic LU factorization U
L S=A-LU
Edgar Solomonik 2.5D algorithms 24/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion A new latency lower bound for LU
I Relate volume to surface area to diameter k₀ A₀₀ I For block size n/d LU does k₁ A₁₁ 3 2 I Ω(n /d ) flops 2 k₂ A₂₂ I Ω(n /d) words k₃ A₃₃ I Ω(d) msgs n I Now pick d (=latency cost) √ k₄ A₄₄ critical path I d = Ω( p) to minimize flops √ I d = Ω( c · p) to minimize words kd-1 Ad-1,d-1
I More generally, n latency · bandwidth = n2
Edgar Solomonik 2.5D algorithms 25/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization
U₀₀ L₀₀
U₀₀ U₀₃ L₀₀
U₀₀ U₀₃ L₀₀
U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀
Edgar Solomonik 2.5D algorithms 26/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization
U₀₀ L₀₀ (B)
U₀₀ U₀₃ L₀₀
U₀₀ U₀₃ L₀₀
U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀
Edgar Solomonik 2.5D algorithms 27/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization
U₀₀ L₀₀ (B)
U₀₀ U₀₃ L₀₀
U₀₀ U₀₃ L₀₀
U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀
U (D) (C)
L
Edgar Solomonik 2.5D algorithms 28/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization
Look at how this update is distributed.
What does it remind you of?
Edgar Solomonik 2.5D algorithms 29/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization
Look at how this update is distributed.
B A
B A Same 3D update B A in multiplication B
A
Edgar Solomonik 2.5D algorithms 30/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU strong scaling (without pivoting)
2.5D LU on BG/P (n=65,536) 100 2.5D LU (no pvt) 2D LU (no pvt) 80
60
40
20 Percentage of machine peak
0 256 512 1024 2048 p
Edgar Solomonik 2.5D algorithms 31/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU with pivoting
A = P · L · U, where P is a permutation matrix
I 2.5D generic pairwise elimination (neighbor/pairwise pivoting or Givens rotations (QR)) [A. Tiskin 2007]
I pairwise pivoting does not produce an explicit L I pairwise pivoting may have stability issues for large matrices I Our approach uses tournament pivoting, which is more stable than pairwise pivoting and gives L explicitly
I pass up rows of A instead of U to avoid error accumulation
Edgar Solomonik 2.5D algorithms 32/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Tournament pivoting (CA-pivoting) {P, L, U} ← CA-pivot(A,n) if n ≤ b then base case {P, L, U} = partial-pivot(A) else recursive case T T [A1 , A2 ] = A {P1, L1, U1} = CA-pivot(A1) T T T [R1 , R2 ] = P1 A1 {P2, L2, U2} = CA-pivot(A2) T T T [S1 , S2 ] = P2 A2 T T {Pr , L, U} = partial-pivot([R1 , S1 ]) Form P from Pr , P1 and P2 end if
Edgar Solomonik 2.5D algorithms 33/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Tournament pivoting
Partial pivoting is not communication-optimal on a blocked matrix
I requires message/synchronization for each column
I O(n) messages needed Tournament pivoting is communication-optimal
I performs a tournament to determine best pivot row candidates
I passes up ’best rows’ of A
Edgar Solomonik 2.5D algorithms 34/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization with tournament pivoting
PA₀
PA₃ PA₂ PA₁ PA₀
P L U P L P P U L L P U P U L L P U U L U P L U
Edgar Solomonik 2.5D algorithms 35/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization with tournament pivoting
PA₀
PA₃ PA₂ PA₁ PA₀
U L U₀₁ L ₁₀
U L U₀₁ L ₁₀
U L U₀₁ L ₁₀
P L L ₄₀ U P L L ₃₀ P P
U L ₂₀ L L P U U P U L U₀₁ L L P U L ₁₀ U L U P L U
Edgar Solomonik 2.5D algorithms 36/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization with tournament pivoting
PA₀
PA₃ PA₂ PA₁ PA₀
U L U₀₁ L ₁₀
U L U₀₁ L ₁₀
U L U₀₁ L ₁₀
P L L ₄₀ U P L L ₃₀ P P
U L ₂₀ L L P U U P U L U₀₁ L L P U L ₁₀ U L U P L U
U L U₀₁ Update L ₁₀
U L U₀₁ Update L ₁₀
U L U₀₁ Update L ₁₀ Update L ₄₀ Update L ₃₀ Update L ₂₀ U L U₀₁ Update L ₁₀
Edgar Solomonik 2.5D algorithms 37/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization with tournament pivoting
PA₀
PA₃ PA₂ PA₁ PA₀
U L U₀₁ L ₁₀
U L U₀₁ L ₁₀
U L U₀₁ L ₁₀
P L L ₄₀ U P L L ₃₀ P P
U L ₂₀ L L P U U P U L U₀₁ L L P U L ₁₀ U L U P L U
U L U₀₁ Update U₀₀ L ₁₀ L₀₀
U L U₀₁ Update U₀₀ L ₁₀ U₀₃ L₀₀
U L U₀₁ Update U₀₀ L ₁₀ U₀₂ Update L₀₀ L ₄₀ Update L ₃₀ Update L₃₀ L ₂₀ U L₂₀ L U₀₁ Update L₁₀ U₀₀ L ₁₀ U₀₁ L₀₀
Edgar Solomonik 2.5D algorithms 38/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Strong scaling of 2.5D LU with tournament pivoting
2.5D LU on BG/P (n=65,536) 100 2.5D LU (CA-pvt) 2D LU (CA-pvt) 80 ScaLAPACK PDGETRF
60
40
20 Percentage of machine peak
0 256 512 1024 2048 p
Edgar Solomonik 2.5D algorithms 39/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU on 65,536 cores
LU on 16,384 nodes of BG/P (n=131,072) 100 communication idle 80 compute
60
40 2X faster Time (sec)
20 2X faster
0 NO-pivot 2DNO-pivot 2.5D CA-pivot 2DCA-pivot 2.5D
Edgar Solomonik 2.5D algorithms 40/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D QR factorization
A = Q · R where Q is orthogonal R is upper-triangular
I 2.5D QR using Givens rotations (generic pairwise elimination) is given by [A. Tiskin 2007]
I Tiskin minimizes latency and bandwidth by working on slanted panels
I 2.5D QR cannot be done with right-looking updates as 2.5D LU due to non-commutativity of orthogonalization updates
Edgar Solomonik 2.5D algorithms 41/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D QR factorization using the YT representation
The YT representation of Householder QR factorization is more work efficient when computing only R
I We give an algorithm that performs 2.5D QR using the YT representation
I The algorithm performs left-looking updates on Y
I Householder with YT needs fewer computation (roughly 2x) than Givens rotations
I Our approach achieves optimal bandwidth cost, but has O(n) latency
Edgar Solomonik 2.5D algorithms 42/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D QR using YT representation
Edgar Solomonik 2.5D algorithms 43/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Latency-optimal 2.5D QR
To reduce latency, we can employ the TSQR algorithm 1. Given n-by-b panel partition into 2b-by-b blocks 2. Perform QR on each 2b-by-b block 3. Stack computed Rs into n/2-by-b panel and recursve 4. Q given in hierarchical representation Need YT representation from hierarchical Q
I Take Y to be the first b columns of Q minus the identity T −1 I Define T = (Y Y − I )
I Sacrifices triangular structure of T
I Conjecture: stable if Q diagonal elements selected to be negative
Edgar Solomonik 2.5D algorithms 44/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion All-pairs shortest-paths
Given input graph G = (V , E)
I Find shortest paths between each pair of nodes vi , vj I Reduces to semiring matrix multiplication with a dependency along k
I Computational structure is similar to LU factorization Semiring matrix multiplication (SMM)
I Replace scalar multiply with scalar add
I Replace scalar add with scalar min
I Depending on processor can require more instructions
Edgar Solomonik 2.5D algorithms 45/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion A recursive algorithm for all-pairs shortest-paths
Known alogirhtm for recusively computing APSP: 1. Given adjacency matrix A of graph G
2. Recursve on block A11
3. Compute SMM A12 ← A11 · A12
4. Compute SMM A21 ← A21 · A11
5. Compute SMM A22 ← A21 · A12
6. Recursve on block A22
7. Compute SMM A21 ← A22 · A21
8. Compute SMM A12 ← A12 · A22
9. Compute SMM A11 ← A12 · A21
Edgar Solomonik 2.5D algorithms 46/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Block-cyclic recursive parallelization
Cyclic step Blocked step
P11 P12 P13 P14 P11 P12 P11 P12 P13 P14 n/4 P21 P22 P23 P24 P21 P22
n P21 P22 P23 P24 n/2 n/4 P31 P32 P33 P34 P31 P32 P33 P34 P41 P42 P43 P44 n/2 P41 P42 P43 P44 n
Edgar Solomonik 2.5D algorithms 47/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 2.5D APSP
2.5D recursive parallelization is straight-forward
I Perform ’cyclic-steps’ using a 2.5D process grid
I Decompose ’blocked-steps’ using an octant of the grid
I Switch to 2D algorithm when grid is 2D
I Minimizes latency and bandwidth!
Edgar Solomonik 2.5D algorithms 48/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 2.5D APSP strong scaling performance
Strong scaling of DC-APSP on Hopper 7000 2.5D n=32,768 6000 2D n=32,768 2.5D n=8,192 2D n=8,192 5000
4000
GFlops 3000
2000
1000
1 4 16 64 256 1024 Number of compute nodes (p)
Edgar Solomonik 2.5D algorithms 49/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Block-size gives latency-bandwidth tradewoff
Performance of 2D DC-APSP on 256 nodes of Hopper (n=8,192) 700 2D DC-APSP 600
500
400
GFlops 300
200
100
1 2 4 8 16 32 64 128 256 block size (b)
Edgar Solomonik 2.5D algorithms 50/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Blocked vs block-cyclic vs cyclic decompositions
Blocked layout Block-cyclic layout Cyclic layout
Green denotes fill (unique values) Red denotes padding / load imbalance
Edgar Solomonik 2.5D algorithms 51/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Cyclops Tensor Framework (CTF)
Big idea:
I decompose tensors cyclically among processors
I pick cyclic phase to preserve partial/full symmetric structure
Edgar Solomonik 2.5D algorithms 52/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 3D tensor contraction
Edgar Solomonik 2.5D algorithms 53/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 3D tensor cyclic decomposition
Edgar Solomonik 2.5D algorithms 54/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 3D tensor mapping
Edgar Solomonik 2.5D algorithms 55/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion A cyclic layout is still challenging
I In order to retain partial symmetry, all symmetric dimensions of a tensor must be mapped with the same cyclic phase
I The contracted dimensions of A and B must be mapped with the same phase
I And yet the virtual mapping, needs to be mapped to a physical topology, which can be any shape
Edgar Solomonik 2.5D algorithms 56/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Virtual processor grid dimensions
I Our virtual cyclic topology is somewhat restrictive and the physical topology is very restricted I Virtual processor grid dimensions serve as a new level of indirection
I If a tensor dimension must have a certain cyclic phase, adjust physical mapping by creating a virtual processor dimension I Allows physical processor grid to be ’stretchable’
Edgar Solomonik 2.5D algorithms 57/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Virtual processor grid construction Matrix multiply on 2x3 processor grid. Red lines represent virtualized part of processor grid. Elements assigned to blocks by cyclic phase. B A C X =
Edgar Solomonik 2.5D algorithms 58/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 2.5D algorithms for tensors
We incorporate data replication for communication minimization into CTF
I Replicate only one tensor/matrix (minimize bandwidth but not latency)
I Autotune over mappings to all possible physical topologies
I Select mapping with least amount of communication
I Achieve minimal communication for tensors of widely different sizes
Edgar Solomonik 2.5D algorithms 59/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Preliminary contraction results on Blue Gene/P
Cyclops TF strong scaling
1.3E+02 4D sym (CCSD) 6D sym (CCSDT) 6.4E+01 8D sym (CCSDTQ) 3.2E+01 1.6E+01 8.0E+00 4.0E+00 2.0E+00 1.0E+00 256 512 1024 2048 Percentage of theoretical symmetric peak #nodes
Edgar Solomonik 2.5D algorithms 60/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Preliminary Coupled Cluster results on Blue Gene/Q
A Coupled Cluster with Double exictations (CCD) implementations is up and running
I Already scaled on up to 1024 nodes of BG/Q, up to 400 virtual orbitals
I Preliminary results already indicate performance matching NWChem
I Several major optimizations still in-progress
I Expecting significantly better scalability than any existing software
Edgar Solomonik 2.5D algorithms 61/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Conclusion
Our contributions: I 2.5D mapping of matrix multiplication
I Optimal according to lower bounds [Irony, Tiskin, Toledo 04] and [Aggarwal, Chandra, and Snir 90]
I A new latency lower bound for LU I Communication-optimal 2.5D LU, QR, and APSP
I Both are bandwidth-optimal according to general lower bound [Ballard, Demmel, Holtz, Schwartz 10] I LU is latency-optimal according to new lower bound I Cyclops Tensor Framework
I Runtime autotuning to minimize communication I Topology-aware mapping in any dimension with symmetry considerations
Edgar Solomonik 2.5D algorithms 62/ 62 Rectangular collectives
Backup slides
Edgar Solomonik 2.5D algorithms 63/ 62 Rectangular collectives
Performance of multicast (BG/P vs Cray)
1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 BG/P XE6 4096 XT5
2048
1024
512 Bandwidth (MB/sec) 256
128 8 64 512 4096 #nodes
Edgar Solomonik 2.5D algorithms 64/ 62 Rectangular collectives
Why the performance discrepancy in multicasts?
I Cray machines use binomial multicasts
I Form spanning tree from a list of nodes I Route copies of message down each branch I Network contention degrades utilization on a 3D torus I BG/P uses rectangular multicasts
I Require network topology to be a k-ary n-cube I Form 2n edge-disjoint spanning trees
I Route in different dimensional order I Use both directions of bidirectional network
Edgar Solomonik 2.5D algorithms 65/ 62