<<

Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion

2.5D algorithms for distributed-memory computing

Edgar Solomonik

UC Berkeley

July, 2012

Edgar Solomonik 2.5D algorithms 1/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Outline

Introduction Strong scaling

2.5D dense linear algebra 2.5D multiplication 2.5D LU factorization 2.5D QR factorization

All-pairs shortest-paths

Symmetric tensor contractions

Conclusion

Edgar Solomonik 2.5D algorithms 2/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Strong scaling Symmetric tensor contractions Conclusion Solving science problems faster

Parallel computers can solve bigger problems

I weak scaling Parallel computers can also solve a fixed problem faster

I strong scaling Obstacles to strong scaling

I may increase relative cost of communication

I may hurt load balance How to reduce communication and maintain load balance?

I reduce (minimize) communication along the critical path

I exploit the network topology

Edgar Solomonik 2.5D algorithms 3/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Strong scaling Symmetric tensor contractions Conclusion Topology-aware multicasts (BG/P vs Cray)

1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 BG/P XE6 4096 XT5

2048

1024

512 Bandwidth (MB/sec) 256

128 8 64 512 4096 #nodes

Edgar Solomonik 2.5D algorithms 4/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Strong scaling Symmetric tensor contractions Conclusion 2D rectangular multicasts trees 2D 4X4 Torus Spanning tree 1 Spanning tree 2 root

Spanning tree 3 Spanning tree 4 All 4 trees combined

Edgar Solomonik 2.5D algorithms 5/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Blocking matrix multiplication

B A

B A B

A B

A

Edgar Solomonik 2.5D algorithms 6/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D matrix multiplication [Cannon 69], [Van De Geijn and Watts 97]

B A O(n3/p) flops 16 CPUs (4x4) 2 √ B CPU CPU CPU CPU O(n / p) words moved CPU CPU CPU CPU √ A CPU CPU CPU CPU O( p) messages CPU CPU CPU CPU B O(n2/p) bytes of memory

A B

A

Edgar Solomonik 2.5D algorithms 7/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 3D matrix multiplication [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89], [McColl and Tiskin 99]

64 CPUs (4x4x4)

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU 3 B CPU CPU CPU CPU O(n /p) flops A CPU CPU CPU CPU CPU CPU CPU CPU O(n2/p2/3) words moved B CPU CPU CPU CPU CPU CPU CPU CPU O(1) messages A CPU CPU CPU CPU 2 2/3 B CPU CPU CPU CPU O(n /p ) bytes of memory CPU CPU CPU CPU

CPU CPU CPU CPU A B CPU CPU CPU CPU CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU A 4 copies of matrices Edgar Solomonik 2.5D algorithms 8/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D matrix multiplication [McColl and Tiskin 99]

B 32 CPUs (4x4x2) A CPU CPU CPU CPU O(n3/p) flops CPU CPU CPU CPU √ B CPU CPU CPU CPU O(n2/ c · p) words moved CPU CPU CPU CPU A q CPU CPU CPU CPU O( p/c3) messages B CPU CPU CPU CPU CPU CPU CPU CPU O(c · n2/p) bytes of memory CPU CPU CPU CPU

A 2 copies of matrices B

A

Edgar Solomonik 2.5D algorithms 9/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D strong scaling

n = dimension, p = #processors, c = #copies of data 1/3 I must satisfy 1 ≤ c ≤ p

I special case: c = 1 yields 2D algorithm 1/3 I special case: c = p yields 3D algorithm

cost(2.5D MM(p, c)) = O(n3/p) flops √ + O(n2/ c · p) words moved q + O( p/c3) messages∗

*ignoring log(p) factors

Edgar Solomonik 2.5D algorithms 10/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D strong scaling

n = dimension, p = #processors, c = #copies of data 1/3 I must satisfy 1 ≤ c ≤ p

I special case: c = 1 yields 2D algorithm 1/3 I special case: c = p yields 3D algorithm

cost(2D MM(p)) = O(n3/p) flops √ + O(n2/ p) words moved √ + O( p) messages∗ = cost(2.5D MM(p, 1))

*ignoring log(p) factors

Edgar Solomonik 2.5D algorithms 11/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D strong scaling

n = dimension, p = #processors, c = #copies of data 1/3 I must satisfy 1 ≤ c ≤ p

I special case: c = 1 yields 2D algorithm 1/3 I special case: c = p yields 3D algorithm

cost(2.5D MM(c · p, c)) = O(n3/(c · p)) flops √ + O(n2/(c · p)) words moved √ + O( p/c) messages = cost(2D MM(p))/c

perfect strong scaling

Edgar Solomonik 2.5D algorithms 12/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Strong scaling matrix multiplication

2.5D MM on BG/P (n=65,536) 100 2.5D MM 2D MM 80 ScaLAPACK PDGEMM

60

40

20 Percentage of machine peak

0 256 512 1024 2048 p

Edgar Solomonik 2.5D algorithms 13/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D MM on 65,536 cores

2.5D MM on 16,384 nodes of BG/P 60 2.5D MM 2D MM 50

40

30

20

Percentage of flops peak 10

0 8192 32768 131072 n

Edgar Solomonik 2.5D algorithms 14/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Cost breakdown of MM on 65,536 cores

Matrix multiplication on 16,384 nodes of BG/P 1.4 communication 1.2 idle 95% reduction in comm computation 1

0.8

0.6

0.4

0.2

Execution time normalized by 2D 0 n=8192, 2Dn=8192, 2.5D n=131072, 2Dn=131072, 2.5D

Edgar Solomonik 2.5D algorithms 15/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D recursive LU

A = L · U where L is lower-triangular and U is upper-triangular

I A 2.5D recursive algorithm with no pivoting [A. Tiskin 2002] I Tiskin gives algorithm under the BSP model

I Bulk Synchronous Parallel I considers communication and synchronization

I We give an alternative distributed-memory adaptation and implementation

I Also, we lower-bound the latency cost

Edgar Solomonik 2.5D algorithms 16/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization A

Edgar Solomonik 2.5D algorithms 17/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization

U₀₀ L₀₀

Edgar Solomonik 2.5D algorithms 18/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization

U

L

Edgar Solomonik 2.5D algorithms 19/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D blocked LU factorization

U

L S=A-LU

Edgar Solomonik 2.5D algorithms 20/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic decomposition

8 8 8 8

8 8 8 8

8 8 8 8

8 8 8 8

Edgar Solomonik 2.5D algorithms 21/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic LU factorization

Edgar Solomonik 2.5D algorithms 22/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic LU factorization U

L

Edgar Solomonik 2.5D algorithms 23/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2D block-cyclic LU factorization U

L S=A-LU

Edgar Solomonik 2.5D algorithms 24/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion A new latency lower bound for LU

I Relate volume to surface to k₀ A₀₀ I For block size n/d LU does k₁ A₁₁ 3 2 I Ω(n /d ) flops 2 k₂ A₂₂ I Ω(n /d) words k₃ A₃₃ I Ω(d) msgs n I Now pick d (=latency cost) √ k₄ A₄₄ critical path I d = Ω( p) to minimize flops √ I d = Ω( c · p) to minimize words kd-1 Ad-1,d-1

I More generally, n latency · bandwidth = n2

Edgar Solomonik 2.5D algorithms 25/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization

U₀₀ L₀₀

U₀₀ U₀₃ L₀₀

U₀₀ U₀₃ L₀₀

U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀

Edgar Solomonik 2.5D algorithms 26/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization

U₀₀ L₀₀ (B)

U₀₀ U₀₃ L₀₀

U₀₀ U₀₃ L₀₀

U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀

Edgar Solomonik 2.5D algorithms 27/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization

U₀₀ L₀₀ (B)

U₀₀ U₀₃ L₀₀

U₀₀ U₀₃ L₀₀

U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀

U (D) (C)

L

Edgar Solomonik 2.5D algorithms 28/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization

Look at how this update is distributed.

What does it remind you of?

Edgar Solomonik 2.5D algorithms 29/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization

Look at how this update is distributed.

B A

B A Same 3D update B A in multiplication B

A

Edgar Solomonik 2.5D algorithms 30/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU strong scaling (without pivoting)

2.5D LU on BG/P (n=65,536) 100 2.5D LU (no pvt) 2D LU (no pvt) 80

60

40

20 Percentage of machine peak

0 256 512 1024 2048 p

Edgar Solomonik 2.5D algorithms 31/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU with pivoting

A = P · L · U, where P is a permutation matrix

I 2.5D generic pairwise elimination (neighbor/pairwise pivoting or Givens rotations (QR)) [A. Tiskin 2007]

I pairwise pivoting does not produce an explicit L I pairwise pivoting may have stability issues for large matrices I Our approach uses tournament pivoting, which is more stable than pairwise pivoting and gives L explicitly

I pass up rows of A instead of U to avoid error accumulation

Edgar Solomonik 2.5D algorithms 32/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Tournament pivoting (CA-pivoting) {P, L, U} ← CA-pivot(A,n) if n ≤ b then base case {P, L, U} = partial-pivot(A) else recursive case T T [A1 , A2 ] = A {P1, L1, U1} = CA-pivot(A1) T T T [R1 , R2 ] = P1 A1 {P2, L2, U2} = CA-pivot(A2) T T T [S1 , S2 ] = P2 A2 T T {Pr , L, U} = partial-pivot([R1 , S1 ]) Form P from Pr , P1 and P2 end if

Edgar Solomonik 2.5D algorithms 33/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Tournament pivoting

Partial pivoting is not communication-optimal on a blocked matrix

I requires message/synchronization for each column

I O(n) messages needed Tournament pivoting is communication-optimal

I performs a tournament to determine best pivot row candidates

I passes up ’best rows’ of A

Edgar Solomonik 2.5D algorithms 34/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization with tournament pivoting

PA₀

PA₃ PA₂ PA₁ PA₀

P L U P L P P U L L P U P U L L P U U L U P L U

Edgar Solomonik 2.5D algorithms 35/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization with tournament pivoting

PA₀

PA₃ PA₂ PA₁ PA₀

U L U₀₁ L ₁₀

U L U₀₁ L ₁₀

U L U₀₁ L ₁₀

P L L ₄₀ U P L L ₃₀ P P

U L ₂₀ L L P U U P U L U₀₁ L L P U L ₁₀ U L U P L U

Edgar Solomonik 2.5D algorithms 36/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization with tournament pivoting

PA₀

PA₃ PA₂ PA₁ PA₀

U L U₀₁ L ₁₀

U L U₀₁ L ₁₀

U L U₀₁ L ₁₀

P L L ₄₀ U P L L ₃₀ P P

U L ₂₀ L L P U U P U L U₀₁ L L P U L ₁₀ U L U P L U

U L U₀₁ Update L ₁₀

U L U₀₁ Update L ₁₀

U L U₀₁ Update L ₁₀ Update L ₄₀ Update L ₃₀ Update L ₂₀ U L U₀₁ Update L ₁₀

Edgar Solomonik 2.5D algorithms 37/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU factorization with tournament pivoting

PA₀

PA₃ PA₂ PA₁ PA₀

U L U₀₁ L ₁₀

U L U₀₁ L ₁₀

U L U₀₁ L ₁₀

P L L ₄₀ U P L L ₃₀ P P

U L ₂₀ L L P U U P U L U₀₁ L L P U L ₁₀ U L U P L U

U L U₀₁ Update U₀₀ L ₁₀ L₀₀

U L U₀₁ Update U₀₀ L ₁₀ U₀₃ L₀₀

U L U₀₁ Update U₀₀ L ₁₀ U₀₂ Update L₀₀ L ₄₀ Update L ₃₀ Update L₃₀ L ₂₀ U L₂₀ L U₀₁ Update L₁₀ U₀₀ L ₁₀ U₀₁ L₀₀

Edgar Solomonik 2.5D algorithms 38/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Strong scaling of 2.5D LU with tournament pivoting

2.5D LU on BG/P (n=65,536) 100 2.5D LU (CA-pvt) 2D LU (CA-pvt) 80 ScaLAPACK PDGETRF

60

40

20 Percentage of machine peak

0 256 512 1024 2048 p

Edgar Solomonik 2.5D algorithms 39/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D LU on 65,536 cores

LU on 16,384 nodes of BG/P (n=131,072) 100 communication idle 80 compute

60

40 2X faster Time (sec)

20 2X faster

0 NO-pivot 2DNO-pivot 2.5D CA-pivot 2DCA-pivot 2.5D

Edgar Solomonik 2.5D algorithms 40/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D QR factorization

A = Q · R where Q is orthogonal R is upper-triangular

I 2.5D QR using Givens rotations (generic pairwise elimination) is given by [A. Tiskin 2007]

I Tiskin minimizes latency and bandwidth by working on slanted panels

I 2.5D QR cannot be done with right-looking updates as 2.5D LU due to non-commutativity of orthogonalization updates

Edgar Solomonik 2.5D algorithms 41/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D QR factorization using the YT representation

The YT representation of Householder QR factorization is more work efficient when computing only R

I We give an algorithm that performs 2.5D QR using the YT representation

I The algorithm performs left-looking updates on Y

I Householder with YT needs fewer computation (roughly 2x) than Givens rotations

I Our approach achieves optimal bandwidth cost, but has O(n) latency

Edgar Solomonik 2.5D algorithms 42/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion 2.5D QR using YT representation

Edgar Solomonik 2.5D algorithms 43/ 62 Introduction 2.5D dense linear algebra 2.5D matrix multiplication All-pairs shortest-paths 2.5D LU factorization Symmetric tensor contractions 2.5D QR factorization Conclusion Latency-optimal 2.5D QR

To reduce latency, we can employ the TSQR algorithm 1. Given n-by-b panel partition into 2b-by-b blocks 2. Perform QR on each 2b-by-b block 3. Stack computed Rs into n/2-by-b panel and recursve 4. Q given in hierarchical representation Need YT representation from hierarchical Q

I Take Y to be the first b columns of Q minus the identity T −1 I Define T = (Y Y − I )

I Sacrifices triangular structure of T

I Conjecture: stable if Q diagonal elements selected to be negative

Edgar Solomonik 2.5D algorithms 44/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion All-pairs shortest-paths

Given input graph G = (V , E)

I Find shortest paths between each pair of nodes vi , vj I Reduces to semiring matrix multiplication with a dependency along k

I Computational structure is similar to LU factorization Semiring matrix multiplication (SMM)

I Replace multiply with scalar add

I Replace scalar add with scalar min

I Depending on processor can require more instructions

Edgar Solomonik 2.5D algorithms 45/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion A recursive algorithm for all-pairs shortest-paths

Known alogirhtm for recusively computing APSP: 1. Given adjacency matrix A of graph G

2. Recursve on block A11

3. Compute SMM A12 ← A11 · A12

4. Compute SMM A21 ← A21 · A11

5. Compute SMM A22 ← A21 · A12

6. Recursve on block A22

7. Compute SMM A21 ← A22 · A21

8. Compute SMM A12 ← A12 · A22

9. Compute SMM A11 ← A12 · A21

Edgar Solomonik 2.5D algorithms 46/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Block-cyclic recursive parallelization

Cyclic step Blocked step

P11 P12 P13 P14 P11 P12 P11 P12 P13 P14 n/4 P21 P22 P23 P24 P21 P22

n P21 P22 P23 P24 n/2 n/4 P31 P32 P33 P34 P31 P32 P33 P34 P41 P42 P43 P44 n/2 P41 P42 P43 P44 n

Edgar Solomonik 2.5D algorithms 47/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 2.5D APSP

2.5D recursive parallelization is straight-forward

I Perform ’cyclic-steps’ using a 2.5D process grid

I Decompose ’blocked-steps’ using an octant of the grid

I Switch to 2D algorithm when grid is 2D

I Minimizes latency and bandwidth!

Edgar Solomonik 2.5D algorithms 48/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 2.5D APSP strong scaling performance

Strong scaling of DC-APSP on Hopper 7000 2.5D n=32,768 6000 2D n=32,768 2.5D n=8,192 2D n=8,192 5000

4000

GFlops 3000

2000

1000

1 4 16 64 256 1024 Number of compute nodes (p)

Edgar Solomonik 2.5D algorithms 49/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Block-size gives latency-bandwidth tradewoff

Performance of 2D DC-APSP on 256 nodes of Hopper (n=8,192) 700 2D DC-APSP 600

500

400

GFlops 300

200

100

1 2 4 8 16 32 64 128 256 block size (b)

Edgar Solomonik 2.5D algorithms 50/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Blocked vs block-cyclic vs cyclic decompositions

Blocked layout Block-cyclic layout Cyclic layout

Green denotes fill (unique values) Red denotes padding / load imbalance

Edgar Solomonik 2.5D algorithms 51/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Cyclops Tensor Framework (CTF)

Big idea:

I decompose tensors cyclically among processors

I pick cyclic phase to preserve partial/full symmetric structure

Edgar Solomonik 2.5D algorithms 52/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 3D tensor contraction

Edgar Solomonik 2.5D algorithms 53/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 3D tensor cyclic decomposition

Edgar Solomonik 2.5D algorithms 54/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 3D tensor mapping

Edgar Solomonik 2.5D algorithms 55/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion A cyclic layout is still challenging

I In order to retain partial , all symmetric dimensions of a tensor must be mapped with the same cyclic phase

I The contracted dimensions of A and B must be mapped with the same phase

I And yet the virtual mapping, needs to be mapped to a physical topology, which can be any

Edgar Solomonik 2.5D algorithms 56/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Virtual processor grid dimensions

I Our virtual cyclic topology is somewhat restrictive and the physical topology is very restricted I Virtual processor grid dimensions serve as a new level of indirection

I If a tensor dimension must have a certain cyclic phase, adjust physical mapping by creating a virtual processor dimension I Allows physical processor grid to be ’stretchable’

Edgar Solomonik 2.5D algorithms 57/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Virtual processor grid construction Matrix multiply on 2x3 processor grid. Red lines represent virtualized part of processor grid. Elements assigned to blocks by cyclic phase. B A C X =

Edgar Solomonik 2.5D algorithms 58/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion 2.5D algorithms for tensors

We incorporate data replication for communication minimization into CTF

I Replicate only one tensor/matrix (minimize bandwidth but not latency)

I Autotune over mappings to all possible physical topologies

I Select mapping with least amount of communication

I Achieve minimal communication for tensors of widely different sizes

Edgar Solomonik 2.5D algorithms 59/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Preliminary contraction results on Blue Gene/P

Cyclops TF strong scaling

1.3E+02 4D sym (CCSD) 6D sym (CCSDT) 6.4E+01 8D sym (CCSDTQ) 3.2E+01 1.6E+01 8.0E+00 4.0E+00 2.0E+00 1.0E+00 256 512 1024 2048 Percentage of theoretical symmetric peak #nodes

Edgar Solomonik 2.5D algorithms 60/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Preliminary Coupled Cluster results on Blue Gene/Q

A Coupled Cluster with Double exictations (CCD) implementations is up and running

I Already scaled on up to 1024 nodes of BG/Q, up to 400 virtual orbitals

I Preliminary results already indicate performance matching NWChem

I Several major optimizations still in-progress

I Expecting significantly better scalability than any existing software

Edgar Solomonik 2.5D algorithms 61/ 62 Introduction 2.5D dense linear algebra All-pairs shortest-paths Symmetric tensor contractions Conclusion Conclusion

Our contributions: I 2.5D mapping of matrix multiplication

I Optimal according to lower bounds [Irony, Tiskin, Toledo 04] and [Aggarwal, Chandra, and Snir 90]

I A new latency lower bound for LU I Communication-optimal 2.5D LU, QR, and APSP

I Both are bandwidth-optimal according to general lower bound [Ballard, Demmel, Holtz, Schwartz 10] I LU is latency-optimal according to new lower bound I Cyclops Tensor Framework

I Runtime autotuning to minimize communication I Topology-aware mapping in any dimension with symmetry considerations

Edgar Solomonik 2.5D algorithms 62/ 62 Rectangular collectives

Backup slides

Edgar Solomonik 2.5D algorithms 63/ 62 Rectangular collectives

Performance of multicast (BG/P vs Cray)

1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 BG/P XE6 4096 XT5

2048

1024

512 Bandwidth (MB/sec) 256

128 8 64 512 4096 #nodes

Edgar Solomonik 2.5D algorithms 64/ 62 Rectangular collectives

Why the performance discrepancy in multicasts?

I Cray machines use binomial multicasts

I Form spanning tree from a list of nodes I Route copies of message down each branch I Network contention degrades utilization on a 3D torus I BG/P uses rectangular multicasts

I Require network topology to be a k-ary n-cube I Form 2n edge-disjoint spanning trees

I Route in different dimensional order I Use both directions of bidirectional network

Edgar Solomonik 2.5D algorithms 65/ 62