<<

Introduction 2.5D multiplication 2.5D LU factorization Conclusion

Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms

Edgar Solomonik and James Demmel

UC Berkeley

September 1st, 2011

Edgar Solomonik and James Demmel 2.5D algorithms 1/ 36 Introduction 2.5D matrix multiplication 2.5D LU factorization Conclusion Outline

Introduction Strong scaling

2.5D matrix multiplication Strong scaling matrix multiplication Performing faster at scale

2.5D LU factorization Communication-optimal LU without pivoting Communication-optimal LU with pivoting

Conclusion

Edgar Solomonik and James Demmel 2.5D algorithms 2/ 36 Introduction 2.5D matrix multiplication Strong scaling 2.5D LU factorization Conclusion Solving science problems faster

Parallel computers can solve bigger problems I weak scaling Parallel computers can also solve a xed problem faster I strong scaling Obstacles to strong scaling I may increase relative cost of communication I may hurt load balance

Edgar Solomonik and James Demmel 2.5D algorithms 3/ 36 Introduction 2.5D matrix multiplication Strong scaling 2.5D LU factorization Conclusion Achieving strong scaling

How to reduce communication and maintain load balance? I reduce communication along the critical path Communicate less I avoid unnecessary communication Communicate smarter I know your network topology

Edgar Solomonik and James Demmel 2.5D algorithms 4/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion Strong scaling matrix multiplication

Matrix multiplication on BG/P (n=65,536) 100 2.5D MM 2D MM 80

60

40

20 Percentage of machine peak

0 256 512 1024 2048 #nodes

Edgar Solomonik and James Demmel 2.5D algorithms 5/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion Blocking matrix multiplication

B A

B A B

A B

A

Edgar Solomonik and James Demmel 2.5D algorithms 6/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2D matrix multiplication [Cannon 69], [Van De Geijn and Watts 97]

B A 16 CPUs (4x4) B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU CPU CPU CPU CPU B

A B

A

Edgar Solomonik and James Demmel 2.5D algorithms 7/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 3D matrix multiplication [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89]

64 CPUs (4x4x4)

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

B CPU CPU CPU CPU A CPU CPU CPU CPU

CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU

CPU CPU CPU CPU A B CPU CPU CPU CPU CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU A 4 copies of matrices

Edgar Solomonik and James Demmel 2.5D algorithms 8/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D matrix multiplication

B 32 CPUs (4x4x2) A CPU CPU CPU CPU

CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU

CPU CPU CPU CPU

A 2 copies of matrices B

A

Edgar Solomonik and James Demmel 2.5D algorithms 9/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D strong scaling

n = dimension, p = #processors, c = #copies of data

I must satisfy 1  c  p1=3 I special case: c = 1 yields 2D algorithm I special case: c = p1=3 yields 3D algorithm

cost(2.5D MM(p; c)) = O(n3=p) ops p + O(n2= c ¡ p) words moved q + O( p=c3) messages£

*ignoring log(p) factors

Edgar Solomonik and James Demmel 2.5D algorithms 10/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D strong scaling

n = dimension, p = #processors, c = #copies of data

I must satisfy 1  c  p1=3 I special case: c = 1 yields 2D algorithm I special case: c = p1=3 yields 3D algorithm

cost(2D MM(p)) = O(n3=p) ops p + O(n2= p) words moved p + O( p) messages£ = cost(2.5D MM(p; 1))

*ignoring log(p) factors

Edgar Solomonik and James Demmel 2.5D algorithms 11/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D strong scaling

n = dimension, p = #processors, c = #copies of data

I must satisfy 1  c  p1=3 I special case: c = 1 yields 2D algorithm I special case: c = p1=3 yields 3D algorithm

cost(2.5D MM(c ¡ p; c)) = O(n3=(c ¡ p)) ops p + O(n2=(c ¡ p)) words moved p + O( p=c) messages = cost(2D MM(p))=c

perfect strong scaling

Edgar Solomonik and James Demmel 2.5D algorithms 12/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D MM on 65,536 cores

Matrix multiplication on 16,384 nodes of BG/P 100 2.5D MM 2D MM 80 2.7X faster

60 Using c=16 matrix copies

40

12X faster 20 Percentage of machine peak

0 8192 131072 n

Edgar Solomonik and James Demmel 2.5D algorithms 13/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion Cost breakdown of MM on 65,536 cores

Matrix multiplication on 16,384 nodes of BG/P 1.4 communication 1.2 idle 95% reduction in comm computation 1

0.8

0.6

0.4

0.2

Execution time normalized by 2D 0 n=8192, 2Dn=8192, 2.5D n=131072, 2Dn=131072, 2.5D

Edgar Solomonik and James Demmel 2.5D algorithms 14/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU strong scaling (without pivoting)

LU without pivoting on BG/P (n=65,536) 100 ideal scaling 2.5D LU 2D LU 80

60

40

20 Percentage of machine peak

0 256 512 1024 2048 #nodes

Edgar Solomonik and James Demmel 2.5D algorithms 15/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization A

Edgar Solomonik and James Demmel 2.5D algorithms 16/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization

U₀₀ L₀₀

Edgar Solomonik and James Demmel 2.5D algorithms 17/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization

U

L

Edgar Solomonik and James Demmel 2.5D algorithms 18/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization

U

L S=A-LU

Edgar Solomonik and James Demmel 2.5D algorithms 19/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic decomposition

8 8 8 8

8 8 8 8

8 8 8 8

8 8 8 8

Edgar Solomonik and James Demmel 2.5D algorithms 20/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic LU factorization

Edgar Solomonik and James Demmel 2.5D algorithms 21/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic LU factorization

U

L

Edgar Solomonik and James Demmel 2.5D algorithms 22/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic LU factorization

U

L S=A-LU

Edgar Solomonik and James Demmel 2.5D algorithms 23/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization

U₀₀ L₀₀

U₀₀ U₀₃ L₀₀

U₀₀ U₀₃ L₀₀

U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀

Edgar Solomonik and James Demmel 2.5D algorithms 24/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization

U₀₀ L₀₀ (B)

U₀₀ U₀₃ L₀₀

U₀₀ U₀₃ L₀₀

U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀

Edgar Solomonik and James Demmel 2.5D algorithms 25/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization

U₀₀ L₀₀ (B)

U₀₀ U₀₃ L₀₀

U₀₀ U₀₃ L₀₀

U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀

U (D) (C)

L

Edgar Solomonik and James Demmel 2.5D algorithms 26/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization

Look at how this update is distributed.

What does it remind you of?

Edgar Solomonik and James Demmel 2.5D algorithms 27/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization

Look at how this update is distributed.

B A

B A Same 3D update B A in multiplication B

A

Edgar Solomonik and James Demmel 2.5D algorithms 28/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion Communication-avoiding pivoting

Partial pivoting is not communication-optimal on a blocked matrix I require message/synchronization for each column I O(n) messages required Tournament pivoting or Communication-Avoiding (CA) pivoting I performs a tournament to determine best pivot row candidates I blocked CA-pivoting algorithm is communication-optimal

Edgar Solomonik and James Demmel 2.5D algorithms 29/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion Strong scaling of 2.5D LU with tournament pivoting

LU with tournament pivoting on BG/P (n=65,536) 100 ideal scaling 2.5D LU 2D LU 80 ScaLAPACK PDGETRF

60

40

20 Percentage of machine peak

0 256 512 1024 2048 #nodes

Edgar Solomonik and James Demmel 2.5D algorithms 30/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization with tournament pivoting

PA₀

PA₃ PA₂ PA₁ PA₀

P L U P L P P U L L P U P U L L P U U L U P L U

Edgar Solomonik and James Demmel 2.5D algorithms 31/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization with tournament pivoting

PA₀

PA₃ PA₂ PA₁ PA₀

U L U₀₁ L ₁₀

U L U₀₁ L ₁₀

U L U₀₁ L ₁₀

P L L ₄₀ U P L L ₃₀ P P

U L ₂₀ L L P U U P U L U₀₁ L L P U L ₁₀ U L U P L U

Edgar Solomonik and James Demmel 2.5D algorithms 32/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization with tournament pivoting

PA₀

PA₃ PA₂ PA₁ PA₀

U L U₀₁ L ₁₀

U L U₀₁ L ₁₀

U L U₀₁ L ₁₀

P L L ₄₀ U P L L ₃₀ P P

U L ₂₀ L L P U U P U L U₀₁ L L P U L ₁₀ U L U P L U

U L U₀₁ Update L ₁₀

U L U₀₁ Update L ₁₀

U L U₀₁ Update L ₁₀ Update L ₄₀ Update L ₃₀ Update L ₂₀ U L U₀₁ Update L ₁₀

Edgar Solomonik and James Demmel 2.5D algorithms 33/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization with tournament pivoting

PA₀

PA₃ PA₂ PA₁ PA₀

U L U₀₁ L ₁₀

U L U₀₁ L ₁₀

U L U₀₁ L ₁₀

P L L ₄₀ U P L L ₃₀ P P

U L ₂₀ L L P U U P U L U₀₁ L L P U L ₁₀ U L U P L U

U L U₀₁ Update U₀₀ L ₁₀ L₀₀

U L U₀₁ Update U₀₀ L ₁₀ U₀₃ L₀₀

U L U₀₁ Update U₀₀ L ₁₀ U₀₂ Update L₀₀ L ₄₀ Update L ₃₀ Update L₃₀ L ₂₀ U L₂₀ L U₀₁ Update L₁₀ U₀₀ L ₁₀ U₀₁ L₀₀

Edgar Solomonik and James Demmel 2.5D algorithms 34/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU on 65,536 cores

LU on 16,384 nodes of BG/P (n=131,072) 100 communication idle 80 compute

60

40 2X faster Time (sec)

20 2X faster

0 NO-pivot 2DNO-pivot 2.5D CA-pivot 2DCA-pivot 2.5D

Edgar Solomonik and James Demmel 2.5D algorithms 35/ 36 Introduction 2.5D matrix multiplication 2.5D LU factorization Conclusion Conclusion Our contributions: I 2.5D mapping of matrix multiplication I Optimal according to lower bounds in [Irony, Tiskin, Toledo 04] and [Aggarwal, Chandra, and Snir 90] I A new latency lower bound for LU I Communication-optimal 2.5D LU I Bandwidth-optimal according to general lower bound [Ballard, Demmel, Holtz, Schwartz 10] I Latency-optimal according to new lower bound Open questions: I 2.5D Householder QR Re ections: I Replication allows better strong scaling I Topology-aware mapping cuts communication costs

Edgar Solomonik and James Demmel 2.5D algorithms 36/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Backup slides

Edgar Solomonik and James Demmel 2.5D algorithms 37/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation A new latency lower bound for LU

p LU with O( P=c3) messages? k₀ A₀₀ I For block size n=d LU does k₁ A₁₁ I (n3=d2) ops I (n2=d) words k₂ A₂₂ I (d) msgs k₃ A₃₃ I Now pick d (=latency cost) n p I d = ( P) to minimize k₄ A₄₄ critical path ops p I d = ( c ¡ P) to minimize words kd-1 Ad-1,d-1 No dice. Lets minimize n bandwidth.

Edgar Solomonik and James Demmel 2.5D algorithms 38/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Performance of multicast (BG/P vs Cray)

1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 BG/P XE6 4096 XT5

2048

1024

512 Bandwidth (MB/sec) 256

128 8 64 512 4096 #nodes

Edgar Solomonik and James Demmel 2.5D algorithms 39/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Why the performance discrepancy in multicasts?

I Cray machines use binomial multicasts I Form spanning tree from a list of nodes I Route copies of message down each branch I Network contention degrades utilization on a 3D torus I BG/P uses rectangular multicasts I Require network topology to be a k-ary n-cube I Form 2n edge-disjoint spanning trees I Route in di erent dimensional order I Use both directions of bidirectional network

Edgar Solomonik and James Demmel 2.5D algorithms 40/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation 2D rectangular multicasts trees

2D 4X4 Torus Spanning tree 1 Spanning tree 2 root

Spanning tree 3 Spanning tree 4 All 4 trees combined

Edgar Solomonik and James Demmel 2.5D algorithms 41/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Model veri cation: one dimension

DCMF Broadcast on a ring of 8 nodes of BG/P

t model 1000 rect DCMF rectangle dput tbnm model DCMF binomial 800

600

400 Bandwidth (MB/sec) 200

1 8 64 512 4096 32768 262144 msg size (KB)

Edgar Solomonik and James Demmel 2.5D algorithms 42/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Model veri cation: two dimensions

DCMF Broadcast on 64 (8x8) nodes of BG/P

2000 trect model DCMF rectangle dput tbnm model DCMF binomial 1500

1000

Bandwidth (MB/sec) 500

0 1 8 64 512 4096 32768 262144 msg size (KB)

Edgar Solomonik and James Demmel 2.5D algorithms 43/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Model veri cation: three dimensions

DCMF Broadcast on 512 (8x8x8) nodes of BG/P 3000 trect model Faraj et al data 2500 DCMF rectangle dput tbnm model DCMF binomial 2000

1500

1000 Bandwidth (MB/sec) 500

0 1 8 64 512 4096 32768 262144 msg size (KB)

Edgar Solomonik and James Demmel 2.5D algorithms 44/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Another look at that rst plot

Just how much better are rectangular algorithms on P = 4096 nodes? 1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 I BG/P Binomial collectives on XE6 XE6 4096 XT5 I 1/30th of link bandwidth 2048 I Rectangular collectives on 1024 BG/P 512 Bandwidth (MB/sec) I 4.3X the link bandwidth 256 I Over 120X improvement 128 8 64 512 4096 in eciency! #nodes How can we apply this?

Edgar Solomonik and James Demmel 2.5D algorithms 45/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Bridging dense linear algebra techniques and applications

Target application: tensor contractions in electronic structure calculations (quantum chemistry) I Often memory constrained I Most target tensors are oddly shaped I Need support for high dimensional tensors I Need handling of partial/full tensor I Would like to use communication avoiding ideas (blocking, 2.5D, topology-awareness)

Edgar Solomonik and James Demmel 2.5D algorithms 46/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Decoupling memory usage and topology-awareness

I 2.5D algorithms couple memory usage and virtual topology I c copies of a matrix implies c processor layers I Instead, we can nest 2D and/or 2.5D algorithms I Higher-dimensional algorithms allow smarter topology aware mapping

Edgar Solomonik and James Demmel 2.5D algorithms 47/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation 4D SUMMA-Cannon How do we map to a 3D partition without using more memory I SUMMA (bcast-based) on MM on 512 nodes of BG/P 2D layers 100 2.5D MM 4D MM I Cannon (send-based) along 80 2D MM (Cannon) PBLAS MM third dimension 60 I Cannon calls SUMMA as sub-routine 40 I Minimize inecient

Percentage of flops peak 20 (non-rectangular) 0 communication 4096 8192 16384 32768 65536 131072 I Allow better overlap matrix dimension I Treats MM as a 4D tensor contraction

Edgar Solomonik and James Demmel 2.5D algorithms 48/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation is a problem I A fully symmetric tensor of dimenson d requires only nd =d! storage I Symmetry signi cantly complicates sequential implementation I Irregular indexing makes alignment and unrolling dicult I Generalizing over all partial-symmetries is expensive I Blocked or block-cyclic virtual processor decmpositions give irregular or imbalanced virtual grids Blocked Block-cyclic

P0 P1 P2 P3

Edgar Solomonik and James Demmel 2.5D algorithms 49/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Solving the symmetry problem I A cyclic decomposition allows balanced and regular blocking of symmetric tensors I If the cyclic-phase is the same in each symmetric dimension, each sub-tensor retains the symmetry of the whole tensor

Cyclic

Edgar Solomonik and James Demmel 2.5D algorithms 50/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation A generalized cyclic layout is still challenging

I In order to retain partial symmetry, all symmetric dimensions of a tensor must be mapped with the same cyclic phase I The contracted dimensions of A and B must be mapped with the same phase I And yet the virtual mapping, needs to be mapped to a physical topology, which can be any

Edgar Solomonik and James Demmel 2.5D algorithms 51/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Virtual processor grid dimensions

I Our virtual cyclic topology is somewhat restrictive and the physical topology is very restricted I Virtual processor grid dimensions serve as a new level of indirection I If a tensor dimension must have a certain cyclic phase, adjust physical mapping by creating a virtual processor dimension I Allows physical processor grid to be 'stretchable'

Edgar Solomonik and James Demmel 2.5D algorithms 52/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Constructing a virtual processor grid for MM

Matrix multiply on 2x3 processor grid. Red lines represent virtualized part of processor grid. Elements assigned to blocks by cyclic phase. B A C X =

Edgar Solomonik and James Demmel 2.5D algorithms 53/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Unfolding the processor grid

I Higher-dimensional fully-symmetric tensors can be mapped onto a lower-dimensional processor grid via creation of new virtual dimensions I Lower-dimensional tensors can be mapped onto a higher-dimensional processor grid via by unfolding (serializing) pairs of processor dimensions I However, when possible, replication is better than unfolding, since unfolded processor grids can lead to an unbalanced mapping

Edgar Solomonik and James Demmel 2.5D algorithms 54/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation A basic parallel algorithm for symmetric tensor contractions

1. Arrange processor grid in any k-ary n-cube shape 2. Map (via unfold & virt) both A and B cyclically along the dimensions being contracted 3. Map (via unfold & virt) the remaining dimensions of A and B cyclically 4. For each tensor dimension contracted over, recursively mulitply the tensors along the mapping I Each contraction dimension is represented with a nested call to a local multiply or a parallel algorithm (e.g. Cannon)

Edgar Solomonik and James Demmel 2.5D algorithms 55/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Tensor library structure

The library supports arbitrary-dimensional parallel tensor contractions with any symmetries on n-cuboid processor torus partitions 1. Load tensor data by (global rank, value) pairs 2. Once a contraction is de ned, map participating tensors 3. Distribute or reshue tensor data/pairs 4. Construct contraction algorithm with recursive function/args pointers 5. Contract the sub-tensors with a user-de ned sequential contract function 6. Output (global rank, value) pairs on request

Edgar Solomonik and James Demmel 2.5D algorithms 56/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Current tensor library status

I Dense and symmetric remapping/repadding/contractions implemented I Currently functional only for dense tensors, but with full symmetric logic I Can perform automatic mapping with physical and virtual dimensions, but cannot unfold processor dimensions yet I Complete library interface implemented, including basic auxillary functions (e.g. map/reduce, sum, etc.)

Edgar Solomonik and James Demmel 2.5D algorithms 57/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Next implementation steps

I Currently integrating library with a SCF method code that uses dense contractions I Get symmetric redistribution working correctly I Automatic unfolding of processor dimensions I Implement mapping by replication to enable 2.5D algorithms I Much basic performance debugging/optimization left to do I More optimization needed for sequential symmetric contractions

Edgar Solomonik and James Demmel 2.5D algorithms 58/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Very preliminary contraction library results Contracts tensors of size 64x64x256x256 in 1 second on 2K nodes

Strong scaling of dense contraction on BG/P 64x64x256x256 100 no rephase rephase every contraction repad every contraction 80

60

40

20 Percentage of machine peak

0 256 512 1024 2048 #nodes

Edgar Solomonik and James Demmel 2.5D algorithms 59/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Potential bene t of unfolding Unfolding smallest two BG/P torus dimensions improves performance.

Strong scaling of dense contraction on BG/P 64x64x256x256 100 no-rephase 2D no-rephase 3D 80

60

40

20 Percentage of machine peak

0 256 512 1024 2048 #nodes Edgar Solomonik and James Demmel 2.5D algorithms 60/ 36