Design of a High-Performance GEMM-Like Tensor-Tensor Multiplication
Design of a High-Performance GEMM-like Tensor-Tensor Multiplication
Paul Springer and Paolo Bientinesi
Aachen Institute for Advanced Study in Computational Engineering Science
Austin, Sep. 20th 2016
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 1 / 19 Outline
1 Introduction
2 GEMM-like Tensor-Tensor Multiplication
3 Tensor Contraction Code Generator
4 Performance
5 Conclusion and Future Work
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 2 / 19 Nested loops Transpose-Transpose-GEMM-Transpose (TTGT) Loops over GEMM (LoG)
Akin to a high-performance GEMM implementation Adopts the BLIS methodology: Breaking through the BLAS layer
combine GETT, TTGT and LoG into a unified tool
Essentially three approaches:
We propose a novel approach: GETT1
Tensor Contraction Code Generator (TCCG)
Introduction
Tensors can be thought of as higher dimensional matrices Tensor contraction can be thought of as higher dimensional GEMMs
1Paul Springer and Paolo Bientinesi. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication”. In: TOMS, in review (). Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 3 / 19 Akin to a high-performance GEMM implementation Adopts the BLIS methodology: Breaking through the BLAS layer
combine GETT, TTGT and LoG into a unified tool
We propose a novel approach: GETT1
Tensor Contraction Code Generator (TCCG)
Introduction
Tensors can be thought of as higher dimensional matrices Tensor contraction can be thought of as higher dimensional GEMMs Essentially three approaches: Nested loops Transpose-Transpose-GEMM-Transpose (TTGT) Loops over GEMM (LoG)
1Paul Springer and Paolo Bientinesi. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication”. In: TOMS, in review (). Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 3 / 19 combine GETT, TTGT and LoG into a unified tool
Tensor Contraction Code Generator (TCCG)
Introduction
Tensors can be thought of as higher dimensional matrices Tensor contraction can be thought of as higher dimensional GEMMs Essentially three approaches: Nested loops Transpose-Transpose-GEMM-Transpose (TTGT) Loops over GEMM (LoG) We propose a novel approach: GETT1 Akin to a high-performance GEMM implementation Adopts the BLIS methodology: Breaking through the BLAS layer
1Paul Springer and Paolo Bientinesi. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication”. In: TOMS, in review (). Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 3 / 19 Introduction
Tensors can be thought of as higher dimensional matrices Tensor contraction can be thought of as higher dimensional GEMMs Essentially three approaches: Nested loops Transpose-Transpose-GEMM-Transpose (TTGT) Loops over GEMM (LoG) We propose a novel approach: GETT1 Akin to a high-performance GEMM implementation Adopts the BLIS methodology: Breaking through the BLAS layer Tensor Contraction Code Generator (TCCG) combine GETT, TTGT and LoG into a unified tool
1Paul Springer and Paolo Bientinesi. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication”. In: TOMS, in review (). Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 3 / 19 Matrix-Matrix Multiplication
Matrix-Matrix Multiplication
M×K K×N M×N A ∈ R , B ∈ R and C ∈ R be 2D tensors: P Cm,n ← k Am,k Bk,n
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 4 / 19 Matrix-Matrix Multiplication
Matrix-Matrix Multiplication (Einstein notation)
M×K K×N M×N A ∈ R , B ∈ R and C ∈ R be 2D tensors: Cm,n ← Am,k Bk,n
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 4 / 19 Matrix-Matrix Multiplication
Matrix-Matrix Multiplication (Einstein notation)
M×K K×N M×N A ∈ R , B ∈ R and C ∈ R be 2D tensors: Cm,n ← Am,k Bk,n
//N-Loop for j = 0 : N − 1 //M-Loop for i = 0 : M − 1 tmp = 0 //K-Loop(contracted) for k = 0 : K − 1 tmp += Ai,k Bk,j // updateC Ci,j = α tmp + βCi,j
Naive GEMM. Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 4 / 19 Matrix-Matrix Multiplication
Matrix-Matrix Multiplication (Einstein notation)
M×K K×N M×N A ∈ R , B ∈ R and C ∈ R be 2D tensors: Cm,n ← Am,k Bk,n
//N-Loop for n = 0 : nc : N − 1 //K-Loop(contracted) for k = 0 : kc : K − 1 Bb = identify_submatrix(B , n, k ) // pack Bb into Be kc×nc Be = packB (Bb)// Be ∈ R //M-Loop //N-Loop for m = 0 : mc : M − 1 for j = 0 : N − 1 //M-Loop Ab = identify_submatrix(A, m, k ) for i = 0 : M − 1 // pack Ab into Ae tmp = 0 mc×kc //K-Loop(contracted) Ae = packA (Ab)// Ae ∈ R for k = 0 : K − 1 Cb = identify_submatrix(C , m, n) tmp += Ai,k Bk,j // updateC // matrix-matrix product: AeBe Ci,j = α tmp + βCi,j macroKernel(Ae, Be, Cb, α, β)
Naive GEMM. High-performance GEMM. Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 4 / 19 Cm1,m2,n ← Am1,m2,k Bk,n
Cm1,n,m2 ← Am1,m2,k Bk,n
Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1
Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1
Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...
Tensor Contractions
Tensor contraction examples:
Cm,n ← Am,k Bk,n
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Cm1,n,m2 ← Am1,m2,k Bk,n
Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1
Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1
Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...
Tensor Contractions
Tensor contraction examples:
Cm,n ← Am,k Bk,n
Cm1,m2,n ← Am1,m2,k Bk,n
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1
Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1
Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...
Tensor Contractions
Tensor contraction examples:
Cm,n ← Am,k Bk,n
Cm1,m2,n ← Am1,m2,k Bk,n
Cm1,n,m2 ← Am1,m2,k Bk,n
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1
Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...
Tensor Contractions
Tensor contraction examples:
Cm,n ← Am,k Bk,n
Cm1,m2,n ← Am1,m2,k Bk,n
Cm1,n,m2 ← Am1,m2,k Bk,n
Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...
Tensor Contractions
Tensor contraction examples:
Cm,n ← Am,k Bk,n
Cm1,m2,n ← Am1,m2,k Bk,n
Cm1,n,m2 ← Am1,m2,k Bk,n
Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1
Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Tensor Contractions
Tensor contraction examples:
Cm,n ← Am,k Bk,n
Cm1,m2,n ← Am1,m2,k Bk,n
Cm1,n,m2 ← Am1,m2,k Bk,n
Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1
Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1
Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Tensor Contractions
Tensor contraction examples:
Cm,n ← Am,k Bk,n
Cm1,m2,n ← Am1,m2,k Bk,n
Cm1,n,m2 ← Am1,m2,k Bk,n
Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1
Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1
Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...
⇒ Quite similar to GEMM.
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Im := {m1, m2, ..., mγ }: free indices of A In := {n1, n2, ..., nζ }: free indices of B Ik := {k1, k2, ..., kξ}: contracted indices
These index sets Im, In and Ik are critical
GETT
Tensor-Tensor Multiplication (Einstein notation)
SA×SA×...SA SB×SB×...SB Let the input tensors A ∈ R 1 2 rA , and B ∈ R 1 2 rB SC×SC×...SC update the output tensor C ∈ R 1 2 rC :
C C ← αA A B B + βC C . Π (Im∪In) Π (Im∪Ik ) Π (In∪Ik ) Π (Im∪In)
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 6 / 19 GETT
Tensor-Tensor Multiplication (Einstein notation)
SA×SA×...SA SB×SB×...SB Let the input tensors A ∈ R 1 2 rA , and B ∈ R 1 2 rB SC×SC×...SC update the output tensor C ∈ R 1 2 rC :
C C ← αA A B B + βC C . Π (Im∪In) Π (Im∪Ik ) Π (In∪Ik ) Π (Im∪In)
These index sets Im, In and Ik are critical Im := {m1, m2, ..., mγ }: free indices of A In := {n1, n2, ..., nζ }: free indices of B Ik := {k1, k2, ..., kξ}: contracted indices
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 6 / 19 GETT
1 //N-Loops 2 for n1 = 1 : Sn1 3 //... remainingN-loops omitted... 4 for nζ = 1 : Snζ 5 //M-Loops 6 for m1 = 1 : Sm1 7 //... remainingM-loops omitted... 8 for mγ = 1 : Smγ 9 tmp = 0 10 //K-Loops(contracted) 11 for k1 = 1 : Sk1 12 //... remainingK-loops omitted... 13 for kξ = 1 : Skξ 14 tmp += A A B B Π (m1,...,mγ ,k1,...,kξ) Π (k1,...,kξ,n1,...,nζ ) 15 // update C 16 C C = α tmp + βC C Π (m1,...,mγ ,n1,...,nζ ) Π (m1,...,mγ ,n1,...,nζ )
Naive GETT.
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 7 / 19 GETT
1 //N-Loops 2 for n1 = 1 : Sn1 3 //... remainingN-loops omitted... 4 for nζ = 1 : Snζ 5 //M-Loops 6 for m1 = 1 : Sm1 7 //... remainingM-loops omitted... 8 for mγ = 1 : Smγ 9 tmp = 0 10 //K-Loops(contracted) 11 for k1 = 1 : Sk1 12 //... remainingK-loops omitted... 13 for kξ = 1 : Skξ 14 tmp += A A B B Π (m1,...,mγ ,k1,...,kξ) Π (k1,...,kξ,n1,...,nζ ) 15 // update C 16 C C = α tmp + βC C Π (m1,...,mγ ,n1,...,nζ ) Π (m1,...,mγ ,n1,...,nζ )
Naive GETT.
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 7 / 19 GETT
1 //N-Loops 2 for n1 = 1 : Sn1 3 //... remainingN-loops omitted... 4 for nζ = 1 : Snζ 5 //M-Loops 6 for m1 = 1 : Sm1 7 //... remainingM-loops omitted... 8 for mγ = 1 : Smγ 9 tmp = 0 10 //K-Loops(contracted) 11 for k1 = 1 : Sk1 12 //... remainingK-loops omitted... 13 for kξ = 1 : Skξ 14 tmp += A A B B Π (m1,...,mγ ,k1,...,kξ) Π (k1,...,kξ,n1,...,nζ ) 15 // update C 16 C C = α tmp + βC C Π (m1,...,mγ ,n1,...,nζ ) Π (m1,...,mγ ,n1,...,nζ )
Naive GETT.
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 7 / 19 GETT
1 //N-Loops 2 for n1 = 1 : Sn1 3 //... remainingN-loops omitted... 4 for nζ = 1 : Snζ 5 //M-Loops 6 for m1 = 1 : Sm1 7 //... remainingM-loops omitted... 8 for mγ = 1 : Smγ 9 tmp = 0 10 //K-Loops(contracted) 11 for k1 = 1 : Sk1 12 //... remainingK-loops omitted... 13 for kξ = 1 : Skξ 14 tmp += A A B B Π (m1,...,mγ ,k1,...,kξ) Π (k1,...,kξ,n1,...,nζ ) 15 // update C 16 C C = α tmp + βC C Π (m1,...,mγ ,n1,...,nζ ) Π (m1,...,mγ ,n1,...,nζ )
Naive GETT.
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 7 / 19 GETT
1 //N-Loop 2 for n = 1 : nc : SIn 3 //K-Loop(contracted) 4 for k = 1 : kc : SIk 5 Bb = identify_subtensor(B, n, k) 6 // pack Bb into Be 7 Be = packB (Bb) 8 //M-Loop 9 for m = 1 : mc : SIm 10 Ab = identify_subtensor(A, m, k) 11 // pack Ab into Ae 12 Ae = packA (Ab) 13 Cb = identify_subtensor(C, m, n) 14 // compute matrix-matrix product of AeBe 15 macroKernel(Ae, Be, Cb, α, β)
High-performance GETT.
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 8 / 19 GETT
1 //N-Loop 2 for n = 1 : nc : SIn 3 //K-Loop(contracted) 4 for k = 1 : kc : SIk 5 Bb = identify_subtensor(B, n, k) 6 // pack Bb into Be 7 Be = packB (Bb) 8 //M-Loop 9 for m = 1 : mc : SIm 10 Ab = identify_subtensor(A, m, k) 11 // pack Ab into Ae 12 Ae = packA (Ab) 13 Cb = identify_subtensor(C, m, n) 14 // compute matrix-matrix product of AeBe 15 macroKernel(Ae, Be, Cb, α, β)
High-performance GETT.
Key Idea
Pack-and-transpose while moving data into the caches
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 8 / 19 GETT
Cb = A × e Be 6 macro-kernel pack pack update 5 3 6 nc kc nc
mc Bb kc 4 4 2 m1 Cb mc m1 Ab k1
4 m2 2 k1 1 n1 4 m2 1 n1
Cm1,n1,m2 = Am1,m2,k1 × Bk1,n1
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 9 / 19 Written in AVX2 intrinsics
GETT: Macro- /Micro-Kernel
nr
nr mr += mr × update C eme1,ne1 micro-kernel nc
kc mr kc += mc ×
C(m , n , m , n ) Ae nr Be b e1 e1 e2 e2 m1,ke1,m2 n1,ke1,n2 macro-kernel e e e e
Blocking for L3, L2, L1 cache as well as registers
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 10 / 19 GETT: Macro- /Micro-Kernel
nr
nr mr += mr × update C eme1,ne1 micro-kernel nc
kc mr kc += mc ×
C(m , n , m , n ) Ae nr Be b e1 e1 e2 e2 m1,ke1,m2 n1,ke1,n2 macro-kernel e e e e
Blocking for L3, L2, L1 cache as well as registers Written in AVX2 intrinsics
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 10 / 19 ⇒ Efficient packing routines
Preserve stride-1 index
Packing via Tensor Transpositions
k
? m m1 m k 2 A bm1,k,m2 Ae(m1,m2),k
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 11 / 19 Packing via Tensor Transpositions
k
? m m1 m k 2 A TTC bm1,k,m2 2 Ae ≡ (m1,m2),k
m1 k m2
Aem1,m2,k
Preserve stride-1 index ⇒ Efficient packing routines 2Paul Springer, Jeff R. Hammond, and Paolo Bientinesi. “TTC: A high-performance Compiler for Tensor Transpositions”. In: TOMS, in review (). Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 11 / 19 GETT: Summary
Blocking for caches Blocking for registers Explicitly vectorized Use TTC to generate high-performance packing routines Exploits full cache line (avoids non-stride-one memory accesses) Explore large search-space: Different GEMM-variants (e.g., panel-matrix, matrix-panel) Different permutations Different values for mc, nc and kc Prune the search space via a performance model
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 12 / 19 Tensor Contraction Code Generator (TCCG)
Solution Yes TC, sizes Merge indices already contraction.hpp known?
Generate candidates No Store fastest candidate LoG TTGT GETT
Yes Time candidates
apply perf. cost okay More can- No Add candidate to list Compile candidates model didates?
cost too high Figure: Schematic overview of TCCG.
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 13 / 19 Performance
System: Intel Xeon E5-2680 v3 CPU (Haswell) Single core Turbo Boost: disabled Compiler: icpc 16.0.1 20151021 Benchmark Collection of 48 TCs Compiled from four publications Each TC is at least 200 MiB Correctness checked against naive loop-based implementation
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 14 / 19 TTGT good in compute-bound regime TTGT bad in bandwidth-bound regime TTGT faster than CTF everywhere.
Performance
80 TTGT CTF 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 15 / 19 Performance
80 TTGT CTF 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
TTGT good in compute-bound regime TTGT bad in bandwidth-bound regime TTGT faster than CTF everywhere.
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 15 / 19 Performance
80 GETT LoG TTGT 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
GETT excels in bandwidth-bound regime. GETT slightly lags behind in compute-bound regime.
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 16 / 19 Performance: i1ji2-i1ki2-jk
80 GETT LoG TTGT GEMM 70 60 50 40
GFLOPS 30 20 10 0 8 16 32 64 128 256 512 1024 SIk GETT especially good in bandwidth-bound regime GETT still attains up to 91.3% of peak floating-point performance TTGT poor in bandwidth-bound regime
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 17 / 19 Performance: i1j1i2j2-i1ki2-j1kj2
80 GETT LoG TTGT GEMM 70 60 50 40
GFLOPS 30 20 10 0 8 16 32 64 128 256 512 1024 SIk GETT especially good in bandwidth-bound regime GETT still attains up to 91.3% of peak floating-point performance TTGT poor in bandwidth-bound regime LoG performance can become arbitrarily bad GETT and TTGT barely affected by higher dimensions
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 17 / 19 Speedup
4.0 11.7 12.4 4.1 9.0 6.9 4.2 4.5 3.7
3.5 3.4 3.3 3.1
3.0 2.9 2.8
2.5 2.3 2.0 1.7 1.7 1.7 1.7
Speedup 1.5 1.4 1.2 1.1 1.1 1.1 1.0 1.0 0.5 0.0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
Speedup varies between 1.0× and 12.4×
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 18 / 19 Conclusion and Future Work
Conclusion GETT: a systematic way to reduce an arbitrary TC to a GEMM-like macro-kernel GETT exhibits high performance across a wide range of TCs It especially excels in the bandwidth-bound regime It attains up to 91.3% of peak floating-point performance A survey of different approaches to TCs has been presented Give it a try: https://github.com/HPAC/tccg
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 19 / 19 Conclusion and Future Work
Conclusion GETT: a systematic way to reduce an arbitrary TC to a GEMM-like macro-kernel GETT exhibits high performance across a wide range of TCs It especially excels in the bandwidth-bound regime It attains up to 91.3% of peak floating-point performance A survey of different approaches to TCs has been presented Give it a try: https://github.com/HPAC/tccg
Future Work Assess TCCG’s performance on KNL Add parallelism Turn TCCG into a C library?
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 19 / 19 Conclusion and Future Work
Conclusion GETT: a systematic way to reduce an arbitrary TC to a GEMM-like macro-kernel GETT exhibits high performance across a wide range of TCs It especially excels in the bandwidth-bound regime It attains up to 91.3% of peak floating-point performance Thank you for your attention. A survey of different approaches to TCs has been presented Give it a try: https://github.com/HPAC/tccg
Future Work Assess TCCG’s performance on KNL Add parallelism Turn TCCG into a C library?
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 19 / 19 Performance - SP
80 GETT LoG TTGT 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef GETT excels in bandwidth-bound regime. GETT slightly lags behind in compute-bound regime. GETT attains min/avg/max performance of GEMM: SP: 72.4% / 98.1% / 141.4% DP: 60.8% / 97.0% / 132.9%
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 1 / 9 Performance - DP
40 TTGT CTF TT 35 30 25 20
GFLOPS 15 10 5 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
TTGT faster than CTF everywhere. TTGT good in compute-bound regime TTGT bad in bandwidth-bound regime
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 2 / 9 Performance - DP
40 GETT LoG TTGT 35 30 25 20
GFLOPS 15 10 5 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef GETT excels in bandwidth-bound regime. GETT slightly lags behind in compute-bound regime. GETT attains min/avg/max performance of GEMM: SP: 72.4% / 98.1% / 141.4% DP: 60.8% / 97.0% / 132.9%
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 3 / 9 Performance: i1j1i2j2-i1ki2-j1kj2 - DP
40 GETT LoG TTGT GEMM 35 30 25 20
GFLOPS 15 10 5 0 8 16 32 64 128 256 512 1024 SIk GETT especially good in bandwidth-bound regime GETT still attains up to 91.3% of peak floating-point performance TTGT poor in bandwidth-bound regime LoG performance can become arbitrarily bad GETT and TTGT barely affected by higher dimensions
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 4 / 9 Speedup
4.0 11.7 12.4 4.1 9.0 6.9 4.2 4.5 3.7
3.5 3.4 3.3 3.1
3.0 2.9 2.8
2.5 2.3 2.0 1.7 1.7 1.7 1.7
Speedup 1.5 1.4 1.2 1.1 1.1 1.1 1.0 1.0 0.5 0.0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
(a) Single-Precision. (b) Double-Precision.
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 9 GETT Performance Model
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4 Efficiency 1 Efficiency 1 4 4 0.2 8 0.2 8 16 16 32 32 0.0 0.0 cb cb be be ec ce ec ce bf bf cf cf bd db bd bd db bd dc dc fd fd ------dcb dbc dcb dbc ebd ebd ------ebcd ebcd dfce dfce fbec fbec abed abed geac geac aecd aecd gfbc gfbc fdec fdec gfab gfab gfac gfac ac ac ------bda dca acd adc bda dca acd adc ------dbea deca ebad ea eb ec dbea deca ebad ea eb ec cad acd cad acd ab ab ------adec adec efbad efcad efbad efcad ecbfa ecbfa ------dfgb dfgb aebf aebf - - dega degb degc dega degb degc aebf eafd aebf eafd abc abc ab ab abc abc abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcd abcd abcd abcd abcd abc abcd abcd abcde abcde abcde abcde abcde abcd abcd abcde abcd abcd abcdef abcdef abcdef abcdef abcdef abcdef abcdef abcdef
(a) Single-Precision. (b) Double-Precision.
Figure: Limit the GETT candidates to 1, 4, 8, 16 or 32, respectively.
Average performance without search: 90.7% / 92.3% Average performance of the four best candidates: 98.3% / 97.2%
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 6 / 9 Tensor Contraction Code Generator (TCCG)
C[a,b,i,j] = A[i,m,a] * B[m,j,b] a = 24 b = 24 i = 24 j = 24 m = 24
Figure: Exemplary input file for TCCG.
Argument Description --floatType=[s,d] data type --maxWorkspace=
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 7 / 9 Transpose-Transpose-GEMM-Transpose
1 A m k ← A A // unfold A Π (Im),Π (Ik ) Π (Im∪Ik ) 2 B k n ← B B // unfold B Π (Ik ),Π (In) Π (In∪Ik ) 3 XΠm(I ),Πn(I ) ← op(A) m k × op(B) k n // contract A and B viaaGEMM m n Π (Im),Π (Ik ) Π (Ik ),Π (In) 4 C ← X m n // fold X ΠC (Im∪In) Π (Im),Π (In)
TTGT pseudo code for a general tensor contraction
C A B C CΠ (Im∪In) = AΠ (Im∪Ik )BΠ (In∪Ik ) + CΠ (Im∪In).
m n k Π (Im), Π (In) and Π (Ik ) represent arbitrary, but fixed, permutations Transpositions account for pure overhead Requires additional memory Good if GEMM dominates performance (i.e., compute-bound) Bad if transpositions dominate performance (i.e., bandwidth-bound)
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 8 / 9 Loop-over-GEMM (LoG)
Loop over 2D slices of the tensors Contract these 2D slices via GEMM Disadvantages
Advantages Performance can become arbitrarily poor Exploits GEMM’s Sometimes not applicable high-performance (if stride-one accesses are No additional memory required)
Cm1,n1,m2,n2 = Am1,m2,k1 Bk1,n1,n2
for m2 = 0: M2 for n2 = 0: N2 GEMM(&A[m2 ∗ M1 ],&B[n2 ∗ K1 ∗ N1 ],&C[ m2 ∗ M1 ∗ n1 + n2 ∗ m1 ∗ N1 ∗ M2 ])
Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 9 / 9