Quick viewing(Text Mode)

Design of a High-Performance GEMM-Like Tensor-Tensor Multiplication

Design of a High-Performance GEMM-like -Tensor Multiplication

Paul Springer and Paolo Bientinesi

Aachen Institute for Advanced Study in Computational Engineering Science

Austin, Sep. 20th 2016

Paul Springer (AICES) Code Generator Sep. 20th 2016 1 / 19 Outline

1 Introduction

2 GEMM-like Tensor-Tensor Multiplication

3 Tensor Contraction Code Generator

4 Performance

5 Conclusion and Future Work

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 2 / 19 Nested loops -Transpose-GEMM-Transpose (TTGT) Loops over GEMM (LoG)

Akin to a high-performance GEMM implementation Adopts the BLIS methodology: Breaking through the BLAS layer

combine GETT, TTGT and LoG into a unified tool

Essentially three approaches:

We propose a novel approach: GETT1

Tensor Contraction Code Generator (TCCG)

Introduction

Tensors can be thought of as higher dimensional matrices Tensor contraction can be thought of as higher dimensional GEMMs

1Paul Springer and Paolo Bientinesi. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication”. In: TOMS, in review (). Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 3 / 19 Akin to a high-performance GEMM implementation Adopts the BLIS methodology: Breaking through the BLAS layer

combine GETT, TTGT and LoG into a unified tool

We propose a novel approach: GETT1

Tensor Contraction Code Generator (TCCG)

Introduction

Tensors can be thought of as higher dimensional matrices Tensor contraction can be thought of as higher dimensional GEMMs Essentially three approaches: Nested loops Transpose-Transpose-GEMM-Transpose (TTGT) Loops over GEMM (LoG)

1Paul Springer and Paolo Bientinesi. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication”. In: TOMS, in review (). Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 3 / 19 combine GETT, TTGT and LoG into a unified tool

Tensor Contraction Code Generator (TCCG)

Introduction

Tensors can be thought of as higher dimensional matrices Tensor contraction can be thought of as higher dimensional GEMMs Essentially three approaches: Nested loops Transpose-Transpose-GEMM-Transpose (TTGT) Loops over GEMM (LoG) We propose a novel approach: GETT1 Akin to a high-performance GEMM implementation Adopts the BLIS methodology: Breaking through the BLAS layer

1Paul Springer and Paolo Bientinesi. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication”. In: TOMS, in review (). Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 3 / 19 Introduction

Tensors can be thought of as higher dimensional matrices Tensor contraction can be thought of as higher dimensional GEMMs Essentially three approaches: Nested loops Transpose-Transpose-GEMM-Transpose (TTGT) Loops over GEMM (LoG) We propose a novel approach: GETT1 Akin to a high-performance GEMM implementation Adopts the BLIS methodology: Breaking through the BLAS layer Tensor Contraction Code Generator (TCCG) combine GETT, TTGT and LoG into a unified tool

1Paul Springer and Paolo Bientinesi. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication”. In: TOMS, in review (). Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 3 / 19 -

Matrix-Matrix Multiplication

M×K K×N M×N A ∈ R , B ∈ R and C ∈ R be 2D tensors: P Cm,n ← k Am,k Bk,n

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 4 / 19 Matrix-Matrix Multiplication

Matrix-Matrix Multiplication (Einstein notation)

M×K K×N M×N A ∈ R , B ∈ R and C ∈ R be 2D tensors: Cm,n ← Am,k Bk,n

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 4 / 19 Matrix-Matrix Multiplication

Matrix-Matrix Multiplication (Einstein notation)

M×K K×N M×N A ∈ R , B ∈ R and C ∈ R be 2D tensors: Cm,n ← Am,k Bk,n

//N-Loop for j = 0 : N − 1 //M-Loop for i = 0 : M − 1 tmp = 0 //K-Loop(contracted) for k = 0 : K − 1 tmp += Ai,k Bk,j // updateC Ci,j = α tmp + βCi,j

Naive GEMM. Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 4 / 19 Matrix-Matrix Multiplication

Matrix-Matrix Multiplication (Einstein notation)

M×K K×N M×N A ∈ R , B ∈ R and C ∈ R be 2D tensors: Cm,n ← Am,k Bk,n

//N-Loop for n = 0 : nc : N − 1 //K-Loop(contracted) for k = 0 : kc : K − 1 Bb = identify_submatrix(B , n, k ) // pack Bb into Be kc×nc Be = packB (Bb)// Be ∈ R //M-Loop //N-Loop for m = 0 : mc : M − 1 for j = 0 : N − 1 //M-Loop Ab = identify_submatrix(A, m, k ) for i = 0 : M − 1 // pack Ab into Ae tmp = 0 mc×kc //K-Loop(contracted) Ae = packA (Ab)// Ae ∈ R for k = 0 : K − 1 Cb = identify_submatrix(C , m, n) tmp += Ai,k Bk,j // updateC // matrix-matrix : AeBe Ci,j = α tmp + βCi,j macroKernel(Ae, Be, Cb, α, β)

Naive GEMM. High-performance GEMM. Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 4 / 19 Cm1,m2,n ← Am1,m2,k Bk,n

Cm1,n,m2 ← Am1,m2,k Bk,n

Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1

Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1

Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...

Tensor Contractions

Tensor contraction examples:

Cm,n ← Am,k Bk,n

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Cm1,n,m2 ← Am1,m2,k Bk,n

Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1

Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1

Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...

Tensor Contractions

Tensor contraction examples:

Cm,n ← Am,k Bk,n

Cm1,m2,n ← Am1,m2,k Bk,n

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1

Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1

Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...

Tensor Contractions

Tensor contraction examples:

Cm,n ← Am,k Bk,n

Cm1,m2,n ← Am1,m2,k Bk,n

Cm1,n,m2 ← Am1,m2,k Bk,n

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1

Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...

Tensor Contractions

Tensor contraction examples:

Cm,n ← Am,k Bk,n

Cm1,m2,n ← Am1,m2,k Bk,n

Cm1,n,m2 ← Am1,m2,k Bk,n

Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...

Tensor Contractions

Tensor contraction examples:

Cm,n ← Am,k Bk,n

Cm1,m2,n ← Am1,m2,k Bk,n

Cm1,n,m2 ← Am1,m2,k Bk,n

Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1

Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Tensor Contractions

Tensor contraction examples:

Cm,n ← Am,k Bk,n

Cm1,m2,n ← Am1,m2,k Bk,n

Cm1,n,m2 ← Am1,m2,k Bk,n

Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1

Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1

Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Tensor Contractions

Tensor contraction examples:

Cm,n ← Am,k Bk,n

Cm1,m2,n ← Am1,m2,k Bk,n

Cm1,n,m2 ← Am1,m2,k Bk,n

Cm1,n1,n2,m2 ← Am1,m2,k Bn2,k,n1

Cm1,n1,n2,m2 ← Am1,k1,m2,k2 Bk2,n2,k1,n1

Cm1,n1,n2,m2,n3 ← Am1,k1,m2,k2 Bn3,k2,n2,k1,n1 ...

⇒ Quite similar to GEMM.

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 19 Im := {m1, m2, ..., mγ }: free indices of A In := {n1, n2, ..., nζ }: free indices of B Ik := {k1, k2, ..., kξ}: contracted indices

These index sets Im, In and Ik are critical

GETT

Tensor-Tensor Multiplication (Einstein notation)

SA×SA×...SA SB×SB×...SB Let the input tensors A ∈ R 1 2 rA , and B ∈ R 1 2 rB SC×SC×...SC update the output tensor C ∈ R 1 2 rC :

C C ← αA A B B + βC C . Π (Im∪In) Π (Im∪Ik ) Π (In∪Ik ) Π (Im∪In)

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 6 / 19 GETT

Tensor-Tensor Multiplication (Einstein notation)

SA×SA×...SA SB×SB×...SB Let the input tensors A ∈ R 1 2 rA , and B ∈ R 1 2 rB SC×SC×...SC update the output tensor C ∈ R 1 2 rC :

C C ← αA A B B + βC C . Π (Im∪In) Π (Im∪Ik ) Π (In∪Ik ) Π (Im∪In)

These index sets Im, In and Ik are critical Im := {m1, m2, ..., mγ }: free indices of A In := {n1, n2, ..., nζ }: free indices of B Ik := {k1, k2, ..., kξ}: contracted indices

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 6 / 19 GETT

1 //N-Loops 2 for n1 = 1 : Sn1 3 //... remainingN-loops omitted... 4 for nζ = 1 : Snζ 5 //M-Loops 6 for m1 = 1 : Sm1 7 //... remainingM-loops omitted... 8 for mγ = 1 : Smγ 9 tmp = 0 10 //K-Loops(contracted) 11 for k1 = 1 : Sk1 12 //... remainingK-loops omitted... 13 for kξ = 1 : Skξ 14 tmp += A A B B Π (m1,...,mγ ,k1,...,kξ) Π (k1,...,kξ,n1,...,nζ ) 15 // update C 16 C C = α tmp + βC C Π (m1,...,mγ ,n1,...,nζ ) Π (m1,...,mγ ,n1,...,nζ )

Naive GETT.

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 7 / 19 GETT

1 //N-Loops 2 for n1 = 1 : Sn1 3 //... remainingN-loops omitted... 4 for nζ = 1 : Snζ 5 //M-Loops 6 for m1 = 1 : Sm1 7 //... remainingM-loops omitted... 8 for mγ = 1 : Smγ 9 tmp = 0 10 //K-Loops(contracted) 11 for k1 = 1 : Sk1 12 //... remainingK-loops omitted... 13 for kξ = 1 : Skξ 14 tmp += A A B B Π (m1,...,mγ ,k1,...,kξ) Π (k1,...,kξ,n1,...,nζ ) 15 // update C 16 C C = α tmp + βC C Π (m1,...,mγ ,n1,...,nζ ) Π (m1,...,mγ ,n1,...,nζ )

Naive GETT.

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 7 / 19 GETT

1 //N-Loops 2 for n1 = 1 : Sn1 3 //... remainingN-loops omitted... 4 for nζ = 1 : Snζ 5 //M-Loops 6 for m1 = 1 : Sm1 7 //... remainingM-loops omitted... 8 for mγ = 1 : Smγ 9 tmp = 0 10 //K-Loops(contracted) 11 for k1 = 1 : Sk1 12 //... remainingK-loops omitted... 13 for kξ = 1 : Skξ 14 tmp += A A B B Π (m1,...,mγ ,k1,...,kξ) Π (k1,...,kξ,n1,...,nζ ) 15 // update C 16 C C = α tmp + βC C Π (m1,...,mγ ,n1,...,nζ ) Π (m1,...,mγ ,n1,...,nζ )

Naive GETT.

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 7 / 19 GETT

1 //N-Loops 2 for n1 = 1 : Sn1 3 //... remainingN-loops omitted... 4 for nζ = 1 : Snζ 5 //M-Loops 6 for m1 = 1 : Sm1 7 //... remainingM-loops omitted... 8 for mγ = 1 : Smγ 9 tmp = 0 10 //K-Loops(contracted) 11 for k1 = 1 : Sk1 12 //... remainingK-loops omitted... 13 for kξ = 1 : Skξ 14 tmp += A A B B Π (m1,...,mγ ,k1,...,kξ) Π (k1,...,kξ,n1,...,nζ ) 15 // update C 16 C C = α tmp + βC C Π (m1,...,mγ ,n1,...,nζ ) Π (m1,...,mγ ,n1,...,nζ )

Naive GETT.

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 7 / 19 GETT

1 //N-Loop 2 for n = 1 : nc : SIn 3 //K-Loop(contracted) 4 for k = 1 : kc : SIk 5 Bb = identify_subtensor(B, n, k) 6 // pack Bb into Be 7 Be = packB (Bb) 8 //M-Loop 9 for m = 1 : mc : SIm 10 Ab = identify_subtensor(A, m, k) 11 // pack Ab into Ae 12 Ae = packA (Ab) 13 Cb = identify_subtensor(C, m, n) 14 // compute matrix-matrix product of AeBe 15 macroKernel(Ae, Be, Cb, α, β)

High-performance GETT.

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 8 / 19 GETT

1 //N-Loop 2 for n = 1 : nc : SIn 3 //K-Loop(contracted) 4 for k = 1 : kc : SIk 5 Bb = identify_subtensor(B, n, k) 6 // pack Bb into Be 7 Be = packB (Bb) 8 //M-Loop 9 for m = 1 : mc : SIm 10 Ab = identify_subtensor(A, m, k) 11 // pack Ab into Ae 12 Ae = packA (Ab) 13 Cb = identify_subtensor(C, m, n) 14 // compute matrix-matrix product of AeBe 15 macroKernel(Ae, Be, Cb, α, β)

High-performance GETT.

Key Idea

Pack-and-transpose while moving data into the caches

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 8 / 19 GETT

Cb = A × e Be 6 macro-kernel pack pack update 5 3 6 nc kc nc

mc Bb kc 4 4 2 m1 Cb mc m1 Ab k1

4 m2 2 k1 1 n1 4 m2 1 n1

Cm1,n1,m2 = Am1,m2,k1 × Bk1,n1

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 9 / 19 Written in AVX2 intrinsics

GETT: Macro- /Micro-Kernel

nr

nr mr += mr × update C eme1,ne1 micro-kernel nc

kc mr kc += mc ×

C(m , n , m , n ) Ae nr Be b e1 e1 e2 e2 m1,ke1,m2 n1,ke1,n2 macro-kernel e e e e

Blocking for L3, L2, L1 cache as well as registers

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 10 / 19 GETT: Macro- /Micro-Kernel

nr

nr mr += mr × update C eme1,ne1 micro-kernel nc

kc mr kc += mc ×

C(m , n , m , n ) Ae nr Be b e1 e1 e2 e2 m1,ke1,m2 n1,ke1,n2 macro-kernel e e e e

Blocking for L3, L2, L1 cache as well as registers Written in AVX2 intrinsics

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 10 / 19 ⇒ Efficient packing routines

Preserve stride-1 index

Packing via Tensor Transpositions

k

? m m1 m k 2 A bm1,k,m2 Ae(m1,m2),k

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 11 / 19 Packing via Tensor Transpositions

k

? m m1 m k 2 A TTC bm1,k,m2 2 Ae ≡ (m1,m2),k

m1 k m2

Aem1,m2,k

Preserve stride-1 index ⇒ Efficient packing routines 2Paul Springer, Jeff R. Hammond, and Paolo Bientinesi. “TTC: A high-performance Compiler for Tensor Transpositions”. In: TOMS, in review (). Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 11 / 19 GETT: Summary

Blocking for caches Blocking for registers Explicitly vectorized Use TTC to generate high-performance packing routines Exploits full cache line (avoids non-stride-one memory accesses) Explore large search-space: Different GEMM-variants (e.g., panel-matrix, matrix-panel) Different permutations Different values for mc, nc and kc Prune the search space via a performance model

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 12 / 19 Tensor Contraction Code Generator (TCCG)

Solution Yes TC, sizes Merge indices already contraction.hpp known?

Generate candidates No Store fastest candidate LoG TTGT GETT

Yes Time candidates

apply perf. cost okay More can- No Add candidate to list Compile candidates model didates?

cost too high Figure: Schematic overview of TCCG.

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 13 / 19 Performance

System: Intel Xeon E5-2680 v3 CPU (Haswell) Single core Turbo Boost: disabled Compiler: icpc 16.0.1 20151021 Benchmark Collection of 48 TCs Compiled from four publications Each TC is at least 200 MiB Correctness checked against naive loop-based implementation

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 14 / 19 TTGT good in compute-bound regime TTGT bad in bandwidth-bound regime TTGT faster than CTF everywhere.

Performance

80 TTGT CTF 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 15 / 19 Performance

80 TTGT CTF 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

TTGT good in compute-bound regime TTGT bad in bandwidth-bound regime TTGT faster than CTF everywhere.

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 15 / 19 Performance

80 GETT LoG TTGT 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

GETT excels in bandwidth-bound regime. GETT slightly lags behind in compute-bound regime.

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 16 / 19 Performance: i1ji2-i1ki2-jk

80 GETT LoG TTGT GEMM 70 60 50 40

GFLOPS 30 20 10 0 8 16 32 64 128 256 512 1024 SIk GETT especially good in bandwidth-bound regime GETT still attains up to 91.3% of peak floating-point performance TTGT poor in bandwidth-bound regime

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 17 / 19 Performance: i1j1i2j2-i1ki2-j1kj2

80 GETT LoG TTGT GEMM 70 60 50 40

GFLOPS 30 20 10 0 8 16 32 64 128 256 512 1024 SIk GETT especially good in bandwidth-bound regime GETT still attains up to 91.3% of peak floating-point performance TTGT poor in bandwidth-bound regime LoG performance can become arbitrarily bad GETT and TTGT barely affected by higher

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 17 / 19 Speedup

4.0 11.7 12.4 4.1 9.0 6.9 4.2 4.5 3.7

3.5 3.4 3.3 3.1

3.0 2.9 2.8

2.5 2.3 2.0 1.7 1.7 1.7 1.7

Speedup 1.5 1.4 1.2 1.1 1.1 1.1 1.0 1.0 0.5 0.0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

Speedup varies between 1.0× and 12.4×

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 18 / 19 Conclusion and Future Work

Conclusion GETT: a systematic way to reduce an arbitrary TC to a GEMM-like macro-kernel GETT exhibits high performance across a wide range of TCs It especially excels in the bandwidth-bound regime It attains up to 91.3% of peak floating-point performance A survey of different approaches to TCs has been presented Give it a try: https://github.com/HPAC/tccg

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 19 / 19 Conclusion and Future Work

Conclusion GETT: a systematic way to reduce an arbitrary TC to a GEMM-like macro-kernel GETT exhibits high performance across a wide range of TCs It especially excels in the bandwidth-bound regime It attains up to 91.3% of peak floating-point performance A survey of different approaches to TCs has been presented Give it a try: https://github.com/HPAC/tccg

Future Work Assess TCCG’s performance on KNL Add parallelism Turn TCCG into a C library?

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 19 / 19 Conclusion and Future Work

Conclusion GETT: a systematic way to reduce an arbitrary TC to a GEMM-like macro-kernel GETT exhibits high performance across a wide range of TCs It especially excels in the bandwidth-bound regime It attains up to 91.3% of peak floating-point performance Thank you for your attention. A survey of different approaches to TCs has been presented Give it a try: https://github.com/HPAC/tccg

Future Work Assess TCCG’s performance on KNL Add parallelism Turn TCCG into a C library?

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 19 / 19 Performance - SP

80 GETT LoG TTGT 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef GETT excels in bandwidth-bound regime. GETT slightly lags behind in compute-bound regime. GETT attains min/avg/max performance of GEMM: SP: 72.4% / 98.1% / 141.4% DP: 60.8% / 97.0% / 132.9%

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 1 / 9 Performance - DP

40 TTGT CTF TT 35 30 25 20

GFLOPS 15 10 5 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

TTGT faster than CTF everywhere. TTGT good in compute-bound regime TTGT bad in bandwidth-bound regime

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 2 / 9 Performance - DP

40 GETT LoG TTGT 35 30 25 20

GFLOPS 15 10 5 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef GETT excels in bandwidth-bound regime. GETT slightly lags behind in compute-bound regime. GETT attains min/avg/max performance of GEMM: SP: 72.4% / 98.1% / 141.4% DP: 60.8% / 97.0% / 132.9%

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 3 / 9 Performance: i1j1i2j2-i1ki2-j1kj2 - DP

40 GETT LoG TTGT GEMM 35 30 25 20

GFLOPS 15 10 5 0 8 16 32 64 128 256 512 1024 SIk GETT especially good in bandwidth-bound regime GETT still attains up to 91.3% of peak floating-point performance TTGT poor in bandwidth-bound regime LoG performance can become arbitrarily bad GETT and TTGT barely affected by higher dimensions

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 4 / 9 Speedup

4.0 11.7 12.4 4.1 9.0 6.9 4.2 4.5 3.7

3.5 3.4 3.3 3.1

3.0 2.9 2.8

2.5 2.3 2.0 1.7 1.7 1.7 1.7

Speedup 1.5 1.4 1.2 1.1 1.1 1.1 1.0 1.0 0.5 0.0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

(a) Single-Precision. (b) Double-Precision.

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 5 / 9 GETT Performance Model

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4 Efficiency 1 Efficiency 1 4 4 0.2 8 0.2 8 16 16 32 32 0.0 0.0 cb cb be be ec ce ec ce bf bf cf cf bd db bd bd db bd dc dc fd fd ------dcb dbc dcb dbc ebd ebd ------ebcd ebcd dfce dfce fbec fbec abed abed geac geac aecd aecd gfbc gfbc fdec fdec gfab gfab gfac gfac ac ac ------bda dca acd adc bda dca acd adc ------dbea deca ebad ea eb ec dbea deca ebad ea eb ec cad acd cad acd ab ab ------adec adec efbad efcad efbad efcad ecbfa ecbfa ------dfgb dfgb aebf aebf - - dega degb degc dega degb degc aebf eafd aebf eafd abc abc ab ab abc abc abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcd abcd abcd abcd abcd abc abcd abcd abcde abcde abcde abcde abcde abcd abcd abcde abcd abcd abcdef abcdef abcdef abcdef abcdef abcdef abcdef abcdef

(a) Single-Precision. (b) Double-Precision.

Figure: Limit the GETT candidates to 1, 4, 8, 16 or 32, respectively.

Average performance without search: 90.7% / 92.3% Average performance of the four best candidates: 98.3% / 97.2%

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 6 / 9 Tensor Contraction Code Generator (TCCG)

C[a,b,i,j] = A[i,m,a] * B[m,j,b] a = 24 b = 24 i = 24 j = 24 m = 24

Figure: Exemplary input file for TCCG.

Argument Description --floatType=[s,d] data type --maxWorkspace= maximum auxiliary workspace in GB --maxImplementations= maximum #implementations --arch=[hsw,knl,cuda] selected architecture --numThreads= number of threads Table: TCCG’s command line arguments.

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 7 / 9 Transpose-Transpose-GEMM-Transpose

1 A m k ← A A // unfold A Π (Im),Π (Ik ) Π (Im∪Ik ) 2 B k n ← B B // unfold B Π (Ik ),Π (In) Π (In∪Ik ) 3 XΠm(I ),Πn(I ) ← op(A) m k × op(B) k n // contract A and B viaaGEMM m n Π (Im),Π (Ik ) Π (Ik ),Π (In) 4 C ← X m n // fold X ΠC (Im∪In) Π (Im),Π (In)

TTGT pseudo code for a general tensor contraction

C A B C CΠ (Im∪In) = AΠ (Im∪Ik )BΠ (In∪Ik ) + CΠ (Im∪In).

m n k Π (Im), Π (In) and Π (Ik ) represent arbitrary, but fixed, permutations Transpositions account for pure overhead Requires additional memory Good if GEMM dominates performance (i.e., compute-bound) Bad if transpositions dominate performance (i.e., bandwidth-bound)

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 8 / 9 Loop-over-GEMM (LoG)

Loop over 2D slices of the tensors Contract these 2D slices via GEMM Disadvantages

Advantages Performance can become arbitrarily poor Exploits GEMM’s Sometimes not applicable high-performance (if stride-one accesses are No additional memory required)

Cm1,n1,m2,n2 = Am1,m2,k1 Bk1,n1,n2

for m2 = 0: M2 for n2 = 0: N2 GEMM(&A[m2 ∗ M1 ],&B[n2 ∗ K1 ∗ N1 ],&C[ m2 ∗ M1 ∗ n1 + n2 ∗ m1 ∗ N1 ∗ M2 ])

Paul Springer (AICES) Tensor Contraction Code Generator Sep. 20th 2016 9 / 9