The Landscape of High-Performance Tensor Contractions
Paul Springer and Paolo Bientinesi
Aachen Institute for Advanced Study in Computational Engineering Science
Atlanta, Feb. 24th 2017
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 1 / 17 Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT)
Akin to a high-performance GEMM implementation
1-order tensor: vector Ai1
2-order tensor: matrix Ai1,i2
n-order tensor: Ai1,i2,...,in Tensor contractions can be thought of as generalized GEMMs Three approaches to tensor contractions:
We propose a novel approach: GETT
Introduction
A tensors is a multidimensional array: 0-order tensor: scalar α
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT)
Akin to a high-performance GEMM implementation
2-order tensor: matrix Ai1,i2
n-order tensor: Ai1,i2,...,in Tensor contractions can be thought of as generalized GEMMs Three approaches to tensor contractions:
We propose a novel approach: GETT
Introduction
A tensors is a multidimensional array: 0-order tensor: scalar α
1-order tensor: vector Ai1
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT)
Akin to a high-performance GEMM implementation
n-order tensor: Ai1,i2,...,in Tensor contractions can be thought of as generalized GEMMs Three approaches to tensor contractions:
We propose a novel approach: GETT
Introduction
A tensors is a multidimensional array: 0-order tensor: scalar α
1-order tensor: vector Ai1
2-order tensor: matrix Ai1,i2
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT)
Akin to a high-performance GEMM implementation
Tensor contractions can be thought of as generalized GEMMs Three approaches to tensor contractions:
We propose a novel approach: GETT1
Introduction
A tensors is a multidimensional array: 0-order tensor: scalar α
1-order tensor: vector Ai1
2-order tensor: matrix Ai1,i2
n-order tensor: Ai1,i2,...,in
1Paul Springer et al. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT)
Akin to a high-performance GEMM implementation
Three approaches to tensor contractions:
We propose a novel approach: GETT1
Introduction
A tensors is a multidimensional array: 0-order tensor: scalar α
1-order tensor: vector Ai1
2-order tensor: matrix Ai1,i2
n-order tensor: Ai1,i2,...,in Tensor contractions can be thought of as generalized GEMMs
1Paul Springer et al. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Akin to a high-performance GEMM implementation
We propose a novel approach: GETT1
Introduction
A tensors is a multidimensional array: 0-order tensor: scalar α
1-order tensor: vector Ai1
2-order tensor: matrix Ai1,i2
n-order tensor: Ai1,i2,...,in Tensor contractions can be thought of as generalized GEMMs Three approaches to tensor contractions: Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT)
1Paul Springer et al. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Introduction
A tensors is a multidimensional array: 0-order tensor: scalar α
1-order tensor: vector Ai1
2-order tensor: matrix Ai1,i2
n-order tensor: Ai1,i2,...,in Tensor contractions can be thought of as generalized GEMMs Three approaches to tensor contractions: Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT) We propose a novel approach: GETT1 Akin to a high-performance GEMM implementation
1Paul Springer et al. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Outline
Approaches to Tensor Contractions: Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT) GEMM-like Tensor-Tensor Multiply (GETT)
Tensor Contraction Code Generator2
Performance Evaluation
2Source code available at: https://github.com/HPAC/tccg Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 3 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
C ← P A B m1,n1 k1 m1,k1 k1,n1
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
m1 k1
k1 n1 Am1,k1 Bk1,n1
gemm (M1 , N1 , K1 , A[:, :], B[:, :], C[:, :])
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
Cm1,m2,n1 ← Am1,m2,k1 Bk1,n1
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
C(m ,m ),n ← A(m ,m ),k Bk1,n1 1 2 1 1 2 1 (m1, m2) k1
k1 n1 A(m1,m2),k1 Bk1,n1
gemm (M1 × M2 , N1 , K1 , A[:, :], B[:, :], C[:, :])
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1
m1 k1 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k1 n1 m2 n2
Am1,m2,k1 Bk1,n2,n1
for( m2 = 0; m2 < M2 ; m2 ++ ) for( n1 = 0; n1 < N1 ; n1 ++ ) gemm (M1 , N2 , K1 , A[:, m2, :], B[:, :, n1], C[:, n1, :, m2])
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1
m1 k1 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k1 n1 m2 n2
Am1,m2,k1 Bk1,n2,n1
for( m2 = 0; m2 < M2 ; m2 ++ ) for( n2 = 0; n2 < N2 ; n2 ++ ) gemm (M1 , N1 , K1 , A[:, m2, :], B[:, n2, :], C[:, :, n2, m2])
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1
m1 k1 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k1 n1 m 2 [n2] Am1,m2,k1 Bk1,[n2],n1
for( m2 = 0; m2 < M2 ; m2 ++ ) gemm_batch(M1 , N1 , K1 , A[:, m2, :], B[:, n2, :], C[:, :, n2, m2], N2 )
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1
m1 k1 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k1 n1 n [m2] 2 Am1,[m2],k1 Bk1,n2,n1
for( n2 = 0; n2 < N2 ; n2 ++ ) gemm_batch(M1 , N1 , K1 , A[:, m2, :], B[:, n2, :], C[:, :, n2, m2], M2 )
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1
k1 k2 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k2 k1 m1 n1 Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1 Ak1,m1,k2 Bk2,n1,k1
for( k1 = 0; k1 < K1 ; k1 ++ ) gemm (M1 , N1 , K2 , A[k1, :, :], B[:, :, k1], C[:, :])
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1
k1 k2 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k2 k1 m1 n1 Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1 Ak1,m1,k2 Bk2,n1,k1
for( k1 = 0; k1 < K1 ; k1 ++ ) gemm (M1 , N1 , K2 , A[k1, :, :], B[:, :, k1], C[:, :])
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1
k1 k2 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k2 k1 m1 n1 Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1 Ak1,m1,k2 Bk2,n1,k1
for( k2 = 0; k2 < K2 ; k2 ++ ) T T gemm (M1 , N1 , K1 , A[:, :, k2] , B[k2, :, :] , C[:, :])
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1
k1 k2 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k2 k1 m1 n1 Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1 Ak1,m1,k2 Bk2,n1,k1
for( k2 = 0; k2 < K2 ; k2 ++ ) T T gemm (M1 , N1 , K1 , A[:, :, k2] , B[k2, :, :] , C[:, :])
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM
Cm1,n1 ← Am1,k1 Bk1,n1
C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1
k1 k2 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k2 k1 m1 n1 Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1 Ak1,m1,k2 Bk2,n1,k1
for( n = 0; n1 < N1 ; n1 ++ ) for( k2 = 0; k2 < K2 ; k2 ++ ) T gemv (M1 , K1 , A[:, :, k2] , B[k2, n1, :], C[:, n])
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Easy to implement Exploits existing BLAS libraries No additional memory required
Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal
GEMM indices: m, n, k Loop order Batched index Advantages:
Disadvantages:
Loop Over GEMM (LoG)
Search space:
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Easy to implement Exploits existing BLAS libraries No additional memory required
Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal
Loop order Batched index Advantages:
Disadvantages:
Loop Over GEMM (LoG)
Search space: GEMM indices: m, n, k
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Easy to implement Exploits existing BLAS libraries No additional memory required
Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal
Batched index Advantages:
Disadvantages:
Loop Over GEMM (LoG)
Search space: GEMM indices: m, n, k Loop order
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Easy to implement Exploits existing BLAS libraries No additional memory required
Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal
Advantages:
Disadvantages:
Loop Over GEMM (LoG)
Search space: GEMM indices: m, n, k Loop order Batched index
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal
Easy to implement Exploits existing BLAS libraries No additional memory required Disadvantages:
Loop Over GEMM (LoG)
Search space: GEMM indices: m, n, k Loop order Batched index Advantages:
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal
Exploits existing BLAS libraries No additional memory required Disadvantages:
Loop Over GEMM (LoG)
Search space: GEMM indices: m, n, k Loop order Batched index Advantages: Easy to implement
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal
No additional memory required Disadvantages:
Loop Over GEMM (LoG)
Search space: GEMM indices: m, n, k Loop order Batched index Advantages: Easy to implement Exploits existing BLAS libraries
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal
Disadvantages:
Loop Over GEMM (LoG)
Search space: GEMM indices: m, n, k Loop order Batched index Advantages: Easy to implement Exploits existing BLAS libraries No additional memory required
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal
Loop Over GEMM (LoG)
Search space: GEMM indices: m, n, k Loop order Batched index Advantages: Easy to implement Exploits existing BLAS libraries No additional memory required Disadvantages:
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 GEMM’s arithmetic intensity can be suboptimal
Loop Over GEMM (LoG)
Search space: GEMM indices: m, n, k Loop order Batched index Advantages: Easy to implement Exploits existing BLAS libraries No additional memory required Disadvantages: Some contractions cannot be implemented via straight LoG
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Loop Over GEMM (LoG)
Search space: GEMM indices: m, n, k Loop order Batched index Advantages: Easy to implement Exploits existing BLAS libraries No additional memory required Disadvantages: Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 In := {n1, n2, ..., nζ } = IB ∩ IC
Ik := {k1, k2, ..., kξ} = IA ∩ IB
GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.
GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅
GER: Im 6= ∅ and In 6= ∅ and Ik = ∅
AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.
Free indices of B
Contracted indices
Tensor contractions can be mapped to BLAS routines3,4:
Map Tensor Contractions to BLAS
Free indices of A
Im := {m1, m2, ..., mγ } = IA ∩ IC
3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 Ik := {k1, k2, ..., kξ} = IA ∩ IB
GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.
GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅
GER: Im 6= ∅ and In 6= ∅ and Ik = ∅
AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.
Contracted indices
Tensor contractions can be mapped to BLAS routines3,4:
Map Tensor Contractions to BLAS
Free indices of A
Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B
In := {n1, n2, ..., nζ } = IB ∩ IC
3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.
GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅
GER: Im 6= ∅ and In 6= ∅ and Ik = ∅
AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.
Tensor contractions can be mapped to BLAS routines3,4:
Map Tensor Contractions to BLAS
Free indices of A
Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B
In := {n1, n2, ..., nζ } = IB ∩ IC Contracted indices
Ik := {k1, k2, ..., kξ} = IA ∩ IB
3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅
GER: Im 6= ∅ and In 6= ∅ and Ik = ∅
AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.
Map Tensor Contractions to BLAS
Free indices of A
Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B
In := {n1, n2, ..., nζ } = IB ∩ IC Contracted indices
Ik := {k1, k2, ..., kξ} = IA ∩ IB Tensor contractions can be mapped to BLAS routines3,4:
GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.
3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 GER: Im 6= ∅ and In 6= ∅ and Ik = ∅
AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.
Map Tensor Contractions to BLAS
Free indices of A
Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B
In := {n1, n2, ..., nζ } = IB ∩ IC Contracted indices
Ik := {k1, k2, ..., kξ} = IA ∩ IB Tensor contractions can be mapped to BLAS routines3,4:
GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.
GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅
3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.
Map Tensor Contractions to BLAS
Free indices of A
Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B
In := {n1, n2, ..., nζ } = IB ∩ IC Contracted indices
Ik := {k1, k2, ..., kξ} = IA ∩ IB Tensor contractions can be mapped to BLAS routines3,4:
GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.
GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅
GER: Im 6= ∅ and In 6= ∅ and Ik = ∅
3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 DOT: else.
Map Tensor Contractions to BLAS
Free indices of A
Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B
In := {n1, n2, ..., nζ } = IB ∩ IC Contracted indices
Ik := {k1, k2, ..., kξ} = IA ∩ IB Tensor contractions can be mapped to BLAS routines3,4:
GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.
GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅
GER: Im 6= ∅ and In 6= ∅ and Ik = ∅
AXPY:(Im = ∅ or In = ∅) and Ik = ∅
3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 Map Tensor Contractions to BLAS
Free indices of A
Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B
In := {n1, n2, ..., nζ } = IB ∩ IC Contracted indices
Ik := {k1, k2, ..., kξ} = IA ∩ IB Tensor contractions can be mapped to BLAS routines3,4:
GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.
GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅
GER: Im 6= ∅ and In 6= ∅ and Ik = ∅
AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.
3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)
Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)
Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor
Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)
Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor
Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1
Aem1,(k1,k2) ← Ak1,m1,k2
Be(k1,k2),n1 ← Bk2,n1,k1 gemm(M1, N1, K1 × K2, Ae, Be, C)
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)
Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor
Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1
Aem1,(k1,k2) ← Ak1,m1,k2 Ae(k1,k2),m1 ← Ak1,m1,k2
Be(k1,k2),n1 ← Bk2,n1,k1 Be(k1,k2),n1 ← Bk2,n1,k1 T gemm(M1, N1, K1 × K2, Ae, Be, C) gemm(M1, N1, K1 × K2, Ae , Be, C)
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)
Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor
Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1
Aem1,(k1,k2) ← Ak1,m1,k2 Ae(k1,k2),m1 ← Ak1,m1,k2
Be(k1,k2),n1 ← Bk2,n1,k1 Be(k1,k2),n1 ← Bk2,n1,k1 T gemm(M1, N1, K1 × K2, Ae, Be, C) gemm(M1, N1, K1 × K2, Ae , Be, C)
Ae(k1,k2),m1 ← Ak1,m1,k2
Be(k1,k2),n1 ← Bk2,n1,k1 T gemm(M1, N1, K1 × K2, Be , Ae, Ce)
Cm1,n1 ← Cen1,m1
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)
Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor
Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1
Aem1,(k1,k2) ← Ak1,m1,k2 Ae(k1,k2),m1 ← Ak1,m1,k2
Be(k1,k2),n1 ← Bk2,n1,k1 Be(k1,k2),n1 ← Bk2,n1,k1 T gemm(M1, N1, K1 × K2, Ae, Be, C) gemm(M1, N1, K1 × K2, Ae , Be, C)
Ae(k1,k2),m1 ← Ak1,m1,k2 Ae(k2, k1),m1 ← Ak1,m1,k2
Be(k1,k2),n1 ← Bk2,n1,k1 Be(k2, k1),n1 ← Bk2,n1,k1 T T gemm(M1, N1, K1 × K2, Be , Ae, Ce) gemm(M1, N1, K1 × K2, Be , Ae, Ce)
Cm1,n1 ← Cen1,m1 Cm1,n1 ← Cen1,m1
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)
Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor
Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1
Aem1,(k1,k2) ← Ak1,m1,k2 Ae(k1,k2),m1 ← Ak1,m1,k2
Be(k1,k2),n1 ← Bk2,n1,k1 Be(k1,k2),n1 ← Bk2,n1,k1 T gemm(M1, N1, K1 × K2, Ae, Be, C) gemm(M1, N1, K1 × K2, Ae , Be, C)
Ae(k1,k2),m1 ← Ak1,m1,k2 Ae(k2, k1),m1 ← Ak1,m1,k2
Be(k1,k2),n1 ← Bk2,n1,k1 Be(k2, k1),n1 ← Bk2,n1,k1 T T gemm(M1, N1, K1 × K2, Be , Ae, Ce) gemm(M1, N1, K1 × K2, Be , Ae, Ce)
Cm1,n1 ← Cen1,m1 Cm1,n1 ← Cen1,m1 ... and more.
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance?
Transpositions5 account for pure overhead Additional memory required
Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages:
Disadvantages:
Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance?
Transpositions5 account for pure overhead Additional memory required
Transposed A Transposed B Interchange A and B within GEMM Advantages:
Disadvantages:
Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
Any permutation of Im, In, Ik
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance?
Transpositions5 account for pure overhead Additional memory required
Transposed B Interchange A and B within GEMM Advantages:
Disadvantages:
Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
Any permutation of Im, In, Ik Transposed A
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance?
Transpositions5 account for pure overhead Additional memory required
Interchange A and B within GEMM Advantages:
Disadvantages:
Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
Any permutation of Im, In, Ik Transposed A Transposed B
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance?
Transpositions5 account for pure overhead Additional memory required
Advantages:
Disadvantages:
Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpositions5 account for pure overhead Additional memory required
Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance? Disadvantages:
Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages:
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpositions5 account for pure overhead Additional memory required
Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance? Disadvantages:
Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpositions5 account for pure overhead Additional memory required
All TCs can be implemented via TTGT Large GEMM ⇒ good performance? Disadvantages:
Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement Exploits existing BLAS libraries
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpositions5 account for pure overhead Additional memory required
Large GEMM ⇒ good performance? Disadvantages:
Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpositions5 account for pure overhead Additional memory required
Disadvantages:
Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance?
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpositions5 account for pure overhead Additional memory required
Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance? Disadvantages:
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Additional memory required
Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance? Disadvantages: Transpositions5 account for pure overhead
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)
Search space:
Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance? Disadvantages: Transpositions5 account for pure overhead Additional memory required
5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 GEMM-like Tensor-Tensor Multiplication (GETT)
Key Idea
Eliminate explicit transpositions Pack-and-transpose while moving data into the caches5 ⇒ Complexity offloaded into packing routines
5Sharan Chetlur et al. “CUDNN: Efficient Primitives for Deep Learning” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 9 / 17 GEMM-like Tensor-Tensor Multiplication (GETT)
Key Idea
Eliminate explicit transpositions Pack-and-transpose while moving data into the caches5 ⇒ Complexity offloaded into packing routines
1 //N-Loop 2 for n = 1 : nc : SIn 3 //K-Loop(contracted) 4 for k = 1 : kc : SIk 5 Bb = identify_subtensor(B, n, k) 6 // pack Bb into Be (L3 cache) 7 Be = packB (Bb) 8 //M-Loop 9 for m = 1 : mc : SIm 10 Ab = identify_subtensor(A, m, k) 11 // pack Ab into Ae (L2 cache) 12 Ae = packA (Ab) 13 Cb = identify_subtensor(C, m, n) 14 // compute matrix-matrix product of AeBe 15 macroKernel(Ae, Be, Cb, α, β)
5Sharan Chetlur et al. “CUDNN: Efficient Primitives for Deep Learning” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 9 / 17 GEMM-like Tensor-Tensor Multiplication (GETT)
Key Idea
Eliminate explicit transpositions Pack-and-transpose while moving data into the caches5 ⇒ Complexity offloaded into packing routines
1 //N-Loop 2 for n = 1 : nc : SIn 3 //K-Loop(contracted) 4 for k = 1 : kc : S Monday, February 27Ik 10:15 - 10:35 5 Bb = identify_subtensor(B, n, k) 6 // pack Bb into Be (L3 cache) 7 TCCG:Be Tensor= packB ( ContractionBb) Code Generator 8 //M-Loop 9 for m = 1 : mc : SIm 10 Ab = identify_subtensor(A, m, k) 11 // pack Ab into Ae (L2 cache) 12 Ae = packA (Ab) 13 Cb = identify_subtensor(C, m, n) 14 // compute matrix-matrix product of AeBe 15 macroKernel(Ae, Be, Cb, α, β)
5Sharan Chetlur et al. “CUDNN: Efficient Primitives for Deep Learning” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 9 / 17 Same arithmetic intensity as GEMM No memory overhead
Complex to implement
Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb Advantages:
Disadvantages:
GEMM-like Tensor-Tensor Multiplication (GETT)
Search space:
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Same arithmetic intensity as GEMM No memory overhead
Complex to implement
Subtensors Ab, Bb, Cb Advantages:
Disadvantages:
GEMM-like Tensor-Tensor Multiplication (GETT)
Search space: Blocking parameters: mc, nc, kc
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Same arithmetic intensity as GEMM No memory overhead
Complex to implement
Advantages:
Disadvantages:
GEMM-like Tensor-Tensor Multiplication (GETT)
Search space: Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Complex to implement
Same arithmetic intensity as GEMM No memory overhead Disadvantages:
GEMM-like Tensor-Tensor Multiplication (GETT)
Search space: Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb Advantages:
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Complex to implement
No memory overhead Disadvantages:
GEMM-like Tensor-Tensor Multiplication (GETT)
Search space: Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb Advantages: Same arithmetic intensity as GEMM
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Complex to implement
Disadvantages:
GEMM-like Tensor-Tensor Multiplication (GETT)
Search space: Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb Advantages: Same arithmetic intensity as GEMM No memory overhead
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Complex to implement
GEMM-like Tensor-Tensor Multiplication (GETT)
Search space: Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb Advantages: Same arithmetic intensity as GEMM No memory overhead Disadvantages:
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 GEMM-like Tensor-Tensor Multiplication (GETT)
Search space: Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb Advantages: Same arithmetic intensity as GEMM No memory overhead Disadvantages: Complex to implement
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Tensor Contraction Code Generator (TCCG)
Input: Mathematical description of TC e.g., C[a,b,i,j] = A[i,k,a] * B[k,j,b]; Output: High-Performance C++ code
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 11 / 17 Tensor Contraction Code Generator (TCCG)
Input: Mathematical description of TC e.g., C[a,b,i,j] = A[i,k,a] * B[k,j,b]; Output: High-Performance C++ code
Solution Yes TC, sizes Merge indices already contraction.hpp known?
Generate candidates No Store fastest candidate LoG TTGT GETT
Yes Time candidates
Apply cost okay More can- No perf. Add candidate to list Compile candidates didates? model cost too high Figure: Schematic overview of TCCG.
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 11 / 17 Not all TCs can be implemented via LoG Mixed performance
Performance — Haswell (single core)
80 LoG 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 12 / 17 Mixed performance
Performance — Haswell (single core)
80 LoG 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
Not all TCs can be implemented via LoG
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 12 / 17 Performance — Haswell (single core)
80 LoG 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
Not all TCs can be implemented via LoG Mixed performance
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 12 / 17 TTGT: good for compute-bound TCs TTGT: bad for bandwidth-bound TCs
Performance — Haswell (single core)
80 LoG TTGT 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 13 / 17 TTGT: bad for bandwidth-bound TCs
Performance — Haswell (single core)
80 LoG TTGT 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
TTGT: good for compute-bound TCs
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 13 / 17 Performance — Haswell (single core)
80 LoG TTGT 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
TTGT: good for compute-bound TCs TTGT: bad for bandwidth-bound TCs
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 13 / 17 GETT: excels for bandwidth-bound TCs GETT: good for compute-bound TCs
Performance — Haswell (single core)
80 LoG TTGT GETT 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 14 / 17 GETT: good for compute-bound TCs
Performance — Haswell (single core)
80 LoG TTGT GETT 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
GETT: excels for bandwidth-bound TCs
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 14 / 17 Performance — Haswell (single core)
80 LoG TTGT GETT 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
GETT: excels for bandwidth-bound TCs GETT: good for compute-bound TCs
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 14 / 17 9000 LoG TTGT 8000 7000 6000 5000 4000 GFLOPS 3000 2000 1000 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - dfce fbec geac fdec ac - - - - - bda dca acd adc - - - - dbea deca ebad cad acd ab - - - - - adec efbad efcad ecbfa - - - dfgb aebf - aebf eafd abc abc ab ab abc abc - - - - abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef (b) NVIDIA Tesla P100
Performance — Multi-threaded
1800 LoG TTGT 1600 1400 1200 1000 800 GFLOPS 600 400 200 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - dfce fbec geac fdec ac - - - - - bda dca acd adc - - - - dbea deca ebad cad acd ab - - - - - adec efbad efcad ecbfa - - - dfgb aebf - aebf eafd abc abc ab ab abc abc - - - - abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef (a)2 ×Intel Xeon E5-2680 v3
Performance gap increases for bandwidth-bound TCs
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 15 / 17 Performance — Multi-threaded
9000 LoG TTGT 9000 LoG TTGT 8000 8000 7000 7000 6000 6000 5000 5000 4000 4000 GFLOPS 3000 GFLOPS 3000 2000 2000 1000 1000 0 0 cb cb be be ec ce ec ce bf bf cf cf bd db bd bd db bd dc dc fd fd ------dcb dbc dcb dbc ebd ebd ------dfce dfce fbec fbec geac geac fdec fdec ac ac ------bda dca acd adc bda dca acd adc ------dbea deca ebad dbea deca ebad cad acd cad acd ab ab ------adec adec efbad efcad efbad efcad ecbfa ecbfa ------dfgb dfgb aebf aebf - - aebf eafd aebf eafd abc abc ab ab abc abc abc abc ab ab abc abc ------abcd abcd abcd abc abcd abcd abcd abc abcd abcd abcde abcde abcde abcde abcde abcd abcd abcde abcd abcd abcdef abcdef (a)2 ×Intel Xeon E5-2680 v3 (b) NVIDIA Tesla P100
Performance gap increases for bandwidth-bound TCs
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 15 / 17 For different settings: opA, opB, interchanged A and B
Account for varying GEMM perf6
Performance for equally-sized GEMMs varies greatly
Performance Model for TTGT and LoG:
GEMM Performance — Multi-threaded
1800 9000 1600 8000 1400 7000 6000 1200 5000 1000 4000
GFLOPS 800 GFLOPS 3000 600 2000 400 1000 200 0 cb cb be be ec ce ec ce bf bf cf cf bd db bd bd db bd dc dc fd fd ------dcb dbc dcb dbc ebd ebd ------dfce dfce fbec fbec geac geac fdec fdec ac ac ------bda dca acd adc bda dca acd adc ------dbea deca ebad dbea deca ebad cad acd cad acd ab ab ------adec adec efbad efcad efbad efcad ecbfa ecbfa ------dfgb dfgb aebf aebf - - aebf eafd aebf eafd abc abc ab ab abc abc abc abc ab ab abc abc ------abcd abcd abcd abc abcd abcd abcd abc abcd abcd abcde abcde abcde abcde abcde abcd abcd abcde abcd abcd abcdef abcdef (a)2 ×Intel Xeon E5-2680 v3 (b) NVIDIA Tesla P100
6Elmar Peise et al. “On the Performance Prediction of BLAS-based Tensor Contractions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 16 / 17 Account for varying GEMM perf6
Performance Model for TTGT and LoG:
GEMM Performance — Multi-threaded
1800 9000 1600 8000 1400 7000 6000 1200 5000 1000 4000
GFLOPS 800 GFLOPS 3000 600 2000 400 1000 200 0 cb cb be be ec ce ec ce bf bf cf cf bd db bd bd db bd dc dc fd fd ------dcb dbc dcb dbc ebd ebd ------dfce dfce fbec fbec geac geac fdec fdec ac ac ------bda dca acd adc bda dca acd adc ------dbea deca ebad dbea deca ebad cad acd cad acd ab ab ------adec adec efbad efcad efbad efcad ecbfa ecbfa ------dfgb dfgb aebf aebf - - aebf eafd aebf eafd abc abc ab ab abc abc abc abc ab ab abc abc ------abcd abcd abcd abc abcd abcd abcd abc abcd abcd abcde abcde abcde abcde abcde abcd abcd abcde abcd abcd abcdef abcdef (a)2 ×Intel Xeon E5-2680 v3 (b) NVIDIA Tesla P100
Performance for equally-sized GEMMs varies greatly For different settings: opA, opB, interchanged A and B
6Elmar Peise et al. “On the Performance Prediction of BLAS-based Tensor Contractions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 16 / 17 GEMM Performance — Multi-threaded
1800 9000 1600 8000 1400 7000 6000 1200 5000 1000 4000
GFLOPS 800 GFLOPS 3000 600 2000 400 1000 200 0 cb cb be be ec ce ec ce bf bf cf cf bd db bd bd db bd dc dc fd fd ------dcb dbc dcb dbc ebd ebd ------dfce dfce fbec fbec geac geac fdec fdec ac ac ------bda dca acd adc bda dca acd adc ------dbea deca ebad dbea deca ebad cad acd cad acd ab ab ------adec adec efbad efcad efbad efcad ecbfa ecbfa ------dfgb dfgb aebf aebf - - aebf eafd aebf eafd abc abc ab ab abc abc abc abc ab ab abc abc ------abcd abcd abcd abc abcd abcd abcd abc abcd abcd abcde abcde abcde abcde abcde abcd abcd abcde abcd abcd abcdef abcdef (a)2 ×Intel Xeon E5-2680 v3 (b) NVIDIA Tesla P100
Performance for equally-sized GEMMs varies greatly For different settings: opA, opB, interchanged A and B Performance Model for TTGT and LoG: Account for varying GEMM perf6
6Elmar Peise et al. “On the Performance Prediction of BLAS-based Tensor Contractions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 16 / 17 Conclusion and Future Work
Conclusion A survey of different approaches to TCs has been presented GETT exhibits high performance across a wide range of TCs TCCG is available at https://github.com/HPAC/tccg
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 17 / 17 Conclusion and Future Work
Conclusion A survey of different approaches to TCs has been presented GETT exhibits high performance across a wide range of TCs TCCG is available at https://github.com/HPAC/tccg
Future Work Implement TC library based on GETT Parallelize GETT
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 17 / 17 Conclusion and Future Work
Conclusion A survey of different approaches to TCs has been presented GETT exhibits high performance across a wide range of TCs TCCG is available at https://github.com/HPAC/tccg
Future Work Implement TC library based on GETT Parallelize GETT
Thank you for your attention.
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 17 / 17 Performance
Systems: Intel Xeon E5-2680 v3 CPU (Haswell) NVIDIA Tesla P100 GPU (Pascal) Compilers: icpc 16.0.1 20151021 nvcc v8.0.44 Benchmark Collection of 48 TCs Compiled from four publications Each TC is at least 200 MiB Correctness checked against naive loop-based implementation
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 1 / 10 Performance - DP
40 TTGT CTF TT 35 30 25 20
GFLOPS 15 10 5 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
TTGT faster than CTF everywhere. TTGT good in compute-bound regime TTGT bad in bandwidth-bound regime
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 10 Performance: m1n1m2-m1k1m2-n1k1
80 GETT LoG TTGT GEMM 70 60 50 40
GFLOPS 30 20 10 0 8 16 32 64 128 256 512 1024 SIk GETT especially good in bandwidth-bound regime GETT still attains up to 91.3% of peak floating-point performance TTGT poor in bandwidth-bound regime
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 3 / 10 Performance: m1n1m2n2-m1k1m2-n1k1n2
80 GETT LoG TTGT GEMM 70 60 50 40
GFLOPS 30 20 10 0 8 16 32 64 128 256 512 1024 SIk GETT especially good in bandwidth-bound regime GETT still attains up to 91.3% of peak floating-point performance TTGT poor in bandwidth-bound regime LoG performance can become arbitrarily bad GETT and TTGT barely affected by higher dimensions
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 3 / 10 Speedup
4.0 11.7 12.4 4.1 9.0 6.9 4.2 4.5 3.7
3.5 3.4 3.3 3.1
3.0 2.9 2.8
2.5 2.3 2.0 1.7 1.7 1.7 1.7
Speedup 1.5 1.4 1.2 1.1 1.1 1.1 1.0 1.0 0.5 0.0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
(a) Single-Precision. (b) Double-Precision.
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 10 GETT
Cb = A × e Be 6 macro-kernel pack pack update 5 3 6 nc kc nc
mc Bb kc 4 4 2 m1 Cb mc m1 Ab k1
4 m2 2 k1 1 n1 4 m2 1 n1
Cm1,n1,m2 = Am1,m2,k1 × Bk1,n1
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 10 Written in AVX2 intrinsics
GETT: Macro- /Micro-Kernel
nr
nr
mr += mr × update C eme1,ne1 micro-kernel
nc
kc
mr kc += mc ×
C(m , n , m , n ) Ae nr Be b e1 e1 e2 e2 m1,ke1,m2 n1,ke1,n2 macro-kernel e e e e
Blocking for L3, L2, L1 cache as well as registers
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 10 GETT: Macro- /Micro-Kernel
nr
nr
mr += mr × update C eme1,ne1 micro-kernel
nc
kc
mr kc += mc ×
C(m , n , m , n ) Ae nr Be b e1 e1 e2 e2 m1,ke1,m2 n1,ke1,n2 macro-kernel e e e e
Blocking for L3, L2, L1 cache as well as registers Written in AVX2 intrinsics
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 10 ⇒ Efficient packing routines
Preserve stride-1 index
Packing via Tensor Transpositions
k
? m m1 m k 2 A bm1,k,m2 Ae(m1,m2),k
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 10 Packing via Tensor Transpositions
k
? m m1 m k 2 A TTC bm1,k,m2 7 Ae ≡ (m1,m2),k
m1 k m2
Aem1,m2,k
Preserve stride-1 index ⇒ Efficient packing routines
7Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 10 GETT: Summary
Blocking for caches Blocking for registers Explicitly vectorized Use TTC to generate high-performance packing routines Exploits full cache line (avoids non-stride-one memory accesses) Explore large search-space: Different GEMM-variants (e.g., panel-matrix, matrix-panel) Different permutations Different values for mc, nc and kc Prune the search space via a performance model
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 10 TTGT good in compute-bound regime TTGT bad in bandwidth-bound regime TTGT faster than CTF everywhere.
Performance
80 TTGT CTF 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 9 / 10 Performance
80 TTGT CTF 70 60 50 40
GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef
TTGT good in compute-bound regime TTGT bad in bandwidth-bound regime TTGT faster than CTF everywhere.
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 9 / 10 GETT Performance Model
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4 Efficiency 1 Efficiency 1 4 4 0.2 8 0.2 8 16 16 32 32 0.0 0.0 cb cb be be ec ce ec ce bf bf cf cf bd db bd bd db bd dc dc fd fd ------dcb dbc dcb dbc ebd ebd ------ebcd ebcd dfce dfce fbec fbec abed abed geac geac aecd aecd gfbc gfbc fdec fdec gfab gfab gfac gfac ac ac ------bda dca acd adc bda dca acd adc ------dbea deca ebad ea eb ec dbea deca ebad ea eb ec cad acd cad acd ab ab ------adec adec efbad efcad efbad efcad ecbfa ecbfa ------dfgb dfgb aebf aebf - - dega degb degc dega degb degc aebf eafd aebf eafd abc abc ab ab abc abc abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcd abcd abcd abcd abcd abc abcd abcd abcde abcde abcde abcde abcde abcd abcd abcde abcd abcd abcdef abcdef abcdef abcdef abcdef abcdef abcdef abcdef
(a) Single-Precision. (b) Double-Precision.
Figure: Limit the GETT candidates to 1, 4, 8, 16 or 32, respectively.
Average performance without search: 90.7% / 92.3% Average performance of the four best candidates: 98.3% / 97.2%
Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 10