Introduction 2.5D matrix multiplication 2.5D LU factorization Conclusion
Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms
Edgar Solomonik and James Demmel
UC Berkeley
September 1st, 2011
Edgar Solomonik and James Demmel 2.5D algorithms 1/ 36 Introduction 2.5D matrix multiplication 2.5D LU factorization Conclusion Outline
Introduction Strong scaling
2.5D matrix multiplication Strong scaling matrix multiplication Performing faster at scale
2.5D LU factorization Communication-optimal LU without pivoting Communication-optimal LU with pivoting
Conclusion
Edgar Solomonik and James Demmel 2.5D algorithms 2/ 36 Introduction 2.5D matrix multiplication Strong scaling 2.5D LU factorization Conclusion Solving science problems faster
Parallel computers can solve bigger problems I weak scaling Parallel computers can also solve a xed problem faster I strong scaling Obstacles to strong scaling I may increase relative cost of communication I may hurt load balance
Edgar Solomonik and James Demmel 2.5D algorithms 3/ 36 Introduction 2.5D matrix multiplication Strong scaling 2.5D LU factorization Conclusion Achieving strong scaling
How to reduce communication and maintain load balance? I reduce communication along the critical path Communicate less I avoid unnecessary communication Communicate smarter I know your network topology
Edgar Solomonik and James Demmel 2.5D algorithms 4/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion Strong scaling matrix multiplication
Matrix multiplication on BG/P (n=65,536) 100 2.5D MM 2D MM 80
60
40
20 Percentage of machine peak
0 256 512 1024 2048 #nodes
Edgar Solomonik and James Demmel 2.5D algorithms 5/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion Blocking matrix multiplication
B A
B A B
A B
A
Edgar Solomonik and James Demmel 2.5D algorithms 6/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2D matrix multiplication [Cannon 69], [Van De Geijn and Watts 97]
B A 16 CPUs (4x4) B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU CPU CPU CPU CPU B
A B
A
Edgar Solomonik and James Demmel 2.5D algorithms 7/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 3D matrix multiplication [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89]
64 CPUs (4x4x4)
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
B CPU CPU CPU CPU A CPU CPU CPU CPU
CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU
CPU CPU CPU CPU A B CPU CPU CPU CPU CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU A 4 copies of matrices
Edgar Solomonik and James Demmel 2.5D algorithms 8/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D matrix multiplication
B 32 CPUs (4x4x2) A CPU CPU CPU CPU
CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU
CPU CPU CPU CPU
A 2 copies of matrices B
A
Edgar Solomonik and James Demmel 2.5D algorithms 9/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D strong scaling
n = dimension, p = #processors, c = #copies of data
I must satisfy 1 c p1=3 I special case: c = 1 yields 2D algorithm I special case: c = p1=3 yields 3D algorithm
cost(2.5D MM(p; c)) = O(n3=p) ops p + O(n2= c ¡ p) words moved q + O( p=c3) messages£
*ignoring log(p) factors
Edgar Solomonik and James Demmel 2.5D algorithms 10/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D strong scaling
n = dimension, p = #processors, c = #copies of data
I must satisfy 1 c p1=3 I special case: c = 1 yields 2D algorithm I special case: c = p1=3 yields 3D algorithm
cost(2D MM(p)) = O(n3=p) ops p + O(n2= p) words moved p + O( p) messages£ = cost(2.5D MM(p; 1))
*ignoring log(p) factors
Edgar Solomonik and James Demmel 2.5D algorithms 11/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D strong scaling
n = dimension, p = #processors, c = #copies of data
I must satisfy 1 c p1=3 I special case: c = 1 yields 2D algorithm I special case: c = p1=3 yields 3D algorithm
cost(2.5D MM(c ¡ p; c)) = O(n3=(c ¡ p)) ops p + O(n2=(c ¡ p)) words moved p + O( p=c) messages = cost(2D MM(p))=c
perfect strong scaling
Edgar Solomonik and James Demmel 2.5D algorithms 12/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D MM on 65,536 cores
Matrix multiplication on 16,384 nodes of BG/P 100 2.5D MM 2D MM 80 2.7X faster
60 Using c=16 matrix copies
40
12X faster 20 Percentage of machine peak
0 8192 131072 n
Edgar Solomonik and James Demmel 2.5D algorithms 13/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion Cost breakdown of MM on 65,536 cores
Matrix multiplication on 16,384 nodes of BG/P 1.4 communication 1.2 idle 95% reduction in comm computation 1
0.8
0.6
0.4
0.2
Execution time normalized by 2D 0 n=8192, 2Dn=8192, 2.5D n=131072, 2Dn=131072, 2.5D
Edgar Solomonik and James Demmel 2.5D algorithms 14/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU strong scaling (without pivoting)
LU without pivoting on BG/P (n=65,536) 100 ideal scaling 2.5D LU 2D LU 80
60
40
20 Percentage of machine peak
0 256 512 1024 2048 #nodes
Edgar Solomonik and James Demmel 2.5D algorithms 15/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization A
Edgar Solomonik and James Demmel 2.5D algorithms 16/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization
U₀₀ L₀₀
Edgar Solomonik and James Demmel 2.5D algorithms 17/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization
U
L
Edgar Solomonik and James Demmel 2.5D algorithms 18/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization
U
L S=A-LU
Edgar Solomonik and James Demmel 2.5D algorithms 19/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic decomposition
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
Edgar Solomonik and James Demmel 2.5D algorithms 20/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic LU factorization
Edgar Solomonik and James Demmel 2.5D algorithms 21/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic LU factorization
U
L
Edgar Solomonik and James Demmel 2.5D algorithms 22/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic LU factorization
U
L S=A-LU
Edgar Solomonik and James Demmel 2.5D algorithms 23/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization
U₀₀ L₀₀
U₀₀ U₀₃ L₀₀
U₀₀ U₀₃ L₀₀
U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀
Edgar Solomonik and James Demmel 2.5D algorithms 24/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization
U₀₀ L₀₀ (B)
U₀₀ U₀₃ L₀₀
U₀₀ U₀₃ L₀₀
U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀
Edgar Solomonik and James Demmel 2.5D algorithms 25/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization
U₀₀ L₀₀ (B)
U₀₀ U₀₃ L₀₀
U₀₀ U₀₃ L₀₀
U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀
U (D) (C)
L
Edgar Solomonik and James Demmel 2.5D algorithms 26/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization
Look at how this update is distributed.
What does it remind you of?
Edgar Solomonik and James Demmel 2.5D algorithms 27/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization
Look at how this update is distributed.
B A
B A Same 3D update B A in multiplication B
A
Edgar Solomonik and James Demmel 2.5D algorithms 28/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion Communication-avoiding pivoting
Partial pivoting is not communication-optimal on a blocked matrix I require message/synchronization for each column I O(n) messages required Tournament pivoting or Communication-Avoiding (CA) pivoting I performs a tournament to determine best pivot row candidates I blocked CA-pivoting algorithm is communication-optimal
Edgar Solomonik and James Demmel 2.5D algorithms 29/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion Strong scaling of 2.5D LU with tournament pivoting
LU with tournament pivoting on BG/P (n=65,536) 100 ideal scaling 2.5D LU 2D LU 80 ScaLAPACK PDGETRF
60
40
20 Percentage of machine peak
0 256 512 1024 2048 #nodes
Edgar Solomonik and James Demmel 2.5D algorithms 30/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization with tournament pivoting
PA₀
PA₃ PA₂ PA₁ PA₀
P L U P L P P U L L P U P U L L P U U L U P L U
Edgar Solomonik and James Demmel 2.5D algorithms 31/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization with tournament pivoting
PA₀
PA₃ PA₂ PA₁ PA₀
U L U₀₁ L ₁₀
U L U₀₁ L ₁₀
U L U₀₁ L ₁₀
P L L ₄₀ U P L L ₃₀ P P
U L ₂₀ L L P U U P U L U₀₁ L L P U L ₁₀ U L U P L U
Edgar Solomonik and James Demmel 2.5D algorithms 32/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization with tournament pivoting
PA₀
PA₃ PA₂ PA₁ PA₀
U L U₀₁ L ₁₀
U L U₀₁ L ₁₀
U L U₀₁ L ₁₀
P L L ₄₀ U P L L ₃₀ P P
U L ₂₀ L L P U U P U L U₀₁ L L P U L ₁₀ U L U P L U
U L U₀₁ Update L ₁₀
U L U₀₁ Update L ₁₀
U L U₀₁ Update L ₁₀ Update L ₄₀ Update L ₃₀ Update L ₂₀ U L U₀₁ Update L ₁₀
Edgar Solomonik and James Demmel 2.5D algorithms 33/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization with tournament pivoting
PA₀
PA₃ PA₂ PA₁ PA₀
U L U₀₁ L ₁₀
U L U₀₁ L ₁₀
U L U₀₁ L ₁₀
P L L ₄₀ U P L L ₃₀ P P
U L ₂₀ L L P U U P U L U₀₁ L L P U L ₁₀ U L U P L U
U L U₀₁ Update U₀₀ L ₁₀ L₀₀
U L U₀₁ Update U₀₀ L ₁₀ U₀₃ L₀₀
U L U₀₁ Update U₀₀ L ₁₀ U₀₂ Update L₀₀ L ₄₀ Update L ₃₀ Update L₃₀ L ₂₀ U L₂₀ L U₀₁ Update L₁₀ U₀₀ L ₁₀ U₀₁ L₀₀
Edgar Solomonik and James Demmel 2.5D algorithms 34/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU on 65,536 cores
LU on 16,384 nodes of BG/P (n=131,072) 100 communication idle 80 compute
60
40 2X faster Time (sec)
20 2X faster
0 NO-pivot 2DNO-pivot 2.5D CA-pivot 2DCA-pivot 2.5D
Edgar Solomonik and James Demmel 2.5D algorithms 35/ 36 Introduction 2.5D matrix multiplication 2.5D LU factorization Conclusion Conclusion Our contributions: I 2.5D mapping of matrix multiplication I Optimal according to lower bounds in [Irony, Tiskin, Toledo 04] and [Aggarwal, Chandra, and Snir 90] I A new latency lower bound for LU I Communication-optimal 2.5D LU I Bandwidth-optimal according to general lower bound [Ballard, Demmel, Holtz, Schwartz 10] I Latency-optimal according to new lower bound Open questions: I 2.5D Householder QR Re ections: I Replication allows better strong scaling I Topology-aware mapping cuts communication costs
Edgar Solomonik and James Demmel 2.5D algorithms 36/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Backup slides
Edgar Solomonik and James Demmel 2.5D algorithms 37/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation A new latency lower bound for LU
p LU with O( P=c3) messages? k₀ A₀₀ I For block size n=d LU does k₁ A₁₁ I (n3=d2) ops I (n2=d) words k₂ A₂₂ I (d) msgs k₃ A₃₃ I Now pick d (=latency cost) n p I d = ( P) to minimize k₄ A₄₄ critical path ops p I d = ( c ¡ P) to minimize words kd-1 Ad-1,d-1 No dice. Lets minimize n bandwidth.
Edgar Solomonik and James Demmel 2.5D algorithms 38/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Performance of multicast (BG/P vs Cray)
1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 BG/P XE6 4096 XT5
2048
1024
512 Bandwidth (MB/sec) 256
128 8 64 512 4096 #nodes
Edgar Solomonik and James Demmel 2.5D algorithms 39/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Why the performance discrepancy in multicasts?
I Cray machines use binomial multicasts I Form spanning tree from a list of nodes I Route copies of message down each branch I Network contention degrades utilization on a 3D torus I BG/P uses rectangular multicasts I Require network topology to be a k-ary n-cube I Form 2n edge-disjoint spanning trees I Route in di erent dimensional order I Use both directions of bidirectional network
Edgar Solomonik and James Demmel 2.5D algorithms 40/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation 2D rectangular multicasts trees
2D 4X4 Torus Spanning tree 1 Spanning tree 2 root
Spanning tree 3 Spanning tree 4 All 4 trees combined
Edgar Solomonik and James Demmel 2.5D algorithms 41/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Model veri cation: one dimension
DCMF Broadcast on a ring of 8 nodes of BG/P
t model 1000 rect DCMF rectangle dput tbnm model DCMF binomial 800
600
400 Bandwidth (MB/sec) 200
1 8 64 512 4096 32768 262144 msg size (KB)
Edgar Solomonik and James Demmel 2.5D algorithms 42/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Model veri cation: two dimensions
DCMF Broadcast on 64 (8x8) nodes of BG/P
2000 trect model DCMF rectangle dput tbnm model DCMF binomial 1500
1000
Bandwidth (MB/sec) 500
0 1 8 64 512 4096 32768 262144 msg size (KB)
Edgar Solomonik and James Demmel 2.5D algorithms 43/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Model veri cation: three dimensions
DCMF Broadcast on 512 (8x8x8) nodes of BG/P 3000 trect model Faraj et al data 2500 DCMF rectangle dput tbnm model DCMF binomial 2000
1500
1000 Bandwidth (MB/sec) 500
0 1 8 64 512 4096 32768 262144 msg size (KB)
Edgar Solomonik and James Demmel 2.5D algorithms 44/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Another look at that rst plot
Just how much better are rectangular algorithms on P = 4096 nodes? 1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 I BG/P Binomial collectives on XE6 XE6 4096 XT5 I 1/30th of link bandwidth 2048 I Rectangular collectives on 1024 BG/P 512 Bandwidth (MB/sec) I 4.3X the link bandwidth 256 I Over 120X improvement 128 8 64 512 4096 in eciency! #nodes How can we apply this?
Edgar Solomonik and James Demmel 2.5D algorithms 45/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Bridging dense linear algebra techniques and applications
Target application: tensor contractions in electronic structure calculations (quantum chemistry) I Often memory constrained I Most target tensors are oddly shaped I Need support for high dimensional tensors I Need handling of partial/full tensor symmetries I Would like to use communication avoiding ideas (blocking, 2.5D, topology-awareness)
Edgar Solomonik and James Demmel 2.5D algorithms 46/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Decoupling memory usage and topology-awareness
I 2.5D algorithms couple memory usage and virtual topology I c copies of a matrix implies c processor layers I Instead, we can nest 2D and/or 2.5D algorithms I Higher-dimensional algorithms allow smarter topology aware mapping
Edgar Solomonik and James Demmel 2.5D algorithms 47/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation 4D SUMMA-Cannon How do we map to a 3D partition without using more memory I SUMMA (bcast-based) on MM on 512 nodes of BG/P 2D layers 100 2.5D MM 4D MM I Cannon (send-based) along 80 2D MM (Cannon) PBLAS MM third dimension 60 I Cannon calls SUMMA as sub-routine 40 I Minimize inecient
Percentage of flops peak 20 (non-rectangular) 0 communication 4096 8192 16384 32768 65536 131072 I Allow better overlap matrix dimension I Treats MM as a 4D tensor contraction
Edgar Solomonik and James Demmel 2.5D algorithms 48/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Symmetry is a problem I A fully symmetric tensor of dimenson d requires only nd =d! storage I Symmetry signi cantly complicates sequential implementation I Irregular indexing makes alignment and unrolling dicult I Generalizing over all partial-symmetries is expensive I Blocked or block-cyclic virtual processor decmpositions give irregular or imbalanced virtual grids Blocked Block-cyclic
P0 P1 P2 P3
Edgar Solomonik and James Demmel 2.5D algorithms 49/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Solving the symmetry problem I A cyclic decomposition allows balanced and regular blocking of symmetric tensors I If the cyclic-phase is the same in each symmetric dimension, each sub-tensor retains the symmetry of the whole tensor
Cyclic
Edgar Solomonik and James Demmel 2.5D algorithms 50/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation A generalized cyclic layout is still challenging
I In order to retain partial symmetry, all symmetric dimensions of a tensor must be mapped with the same cyclic phase I The contracted dimensions of A and B must be mapped with the same phase I And yet the virtual mapping, needs to be mapped to a physical topology, which can be any shape
Edgar Solomonik and James Demmel 2.5D algorithms 51/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Virtual processor grid dimensions
I Our virtual cyclic topology is somewhat restrictive and the physical topology is very restricted I Virtual processor grid dimensions serve as a new level of indirection I If a tensor dimension must have a certain cyclic phase, adjust physical mapping by creating a virtual processor dimension I Allows physical processor grid to be 'stretchable'
Edgar Solomonik and James Demmel 2.5D algorithms 52/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Constructing a virtual processor grid for MM
Matrix multiply on 2x3 processor grid. Red lines represent virtualized part of processor grid. Elements assigned to blocks by cyclic phase. B A C X =
Edgar Solomonik and James Demmel 2.5D algorithms 53/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Unfolding the processor grid
I Higher-dimensional fully-symmetric tensors can be mapped onto a lower-dimensional processor grid via creation of new virtual dimensions I Lower-dimensional tensors can be mapped onto a higher-dimensional processor grid via by unfolding (serializing) pairs of processor dimensions I However, when possible, replication is better than unfolding, since unfolded processor grids can lead to an unbalanced mapping
Edgar Solomonik and James Demmel 2.5D algorithms 54/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation A basic parallel algorithm for symmetric tensor contractions
1. Arrange processor grid in any k-ary n-cube shape 2. Map (via unfold & virt) both A and B cyclically along the dimensions being contracted 3. Map (via unfold & virt) the remaining dimensions of A and B cyclically 4. For each tensor dimension contracted over, recursively mulitply the tensors along the mapping I Each contraction dimension is represented with a nested call to a local multiply or a parallel algorithm (e.g. Cannon)
Edgar Solomonik and James Demmel 2.5D algorithms 55/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Tensor library structure
The library supports arbitrary-dimensional parallel tensor contractions with any symmetries on n-cuboid processor torus partitions 1. Load tensor data by (global rank, value) pairs 2. Once a contraction is de ned, map participating tensors 3. Distribute or reshue tensor data/pairs 4. Construct contraction algorithm with recursive function/args pointers 5. Contract the sub-tensors with a user-de ned sequential contract function 6. Output (global rank, value) pairs on request
Edgar Solomonik and James Demmel 2.5D algorithms 56/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Current tensor library status
I Dense and symmetric remapping/repadding/contractions implemented I Currently functional only for dense tensors, but with full symmetric logic I Can perform automatic mapping with physical and virtual dimensions, but cannot unfold processor dimensions yet I Complete library interface implemented, including basic auxillary functions (e.g. map/reduce, sum, etc.)
Edgar Solomonik and James Demmel 2.5D algorithms 57/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Next implementation steps
I Currently integrating library with a SCF method code that uses dense contractions I Get symmetric redistribution working correctly I Automatic unfolding of processor dimensions I Implement mapping by replication to enable 2.5D algorithms I Much basic performance debugging/optimization left to do I More optimization needed for sequential symmetric contractions
Edgar Solomonik and James Demmel 2.5D algorithms 58/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Very preliminary contraction library results Contracts tensors of size 64x64x256x256 in 1 second on 2K nodes
Strong scaling of dense contraction on BG/P 64x64x256x256 100 no rephase rephase every contraction repad every contraction 80
60
40
20 Percentage of machine peak
0 256 512 1024 2048 #nodes
Edgar Solomonik and James Demmel 2.5D algorithms 59/ 36 Latency lower bound for LU Rectangular collectives Algorithms for distributed tensor contractions A tensor contraction library implementation Potential bene t of unfolding Unfolding smallest two BG/P torus dimensions improves performance.
Strong scaling of dense contraction on BG/P 64x64x256x256 100 no-rephase 2D no-rephase 3D 80
60
40
20 Percentage of machine peak
0 256 512 1024 2048 #nodes Edgar Solomonik and James Demmel 2.5D algorithms 60/ 36