<<

The Landscape of High-Performance Contractions

Paul Springer and Paolo Bientinesi

Aachen Institute for Advanced Study in Computational Science

Atlanta, Feb. 24th 2017

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 1 / 17 Nested loops Loops over GEMM (LoG) -Transpose-GEMM-Transpose (TTGT)

Akin to a high-performance GEMM implementation

1-order tensor: vector Ai1

2-order tensor: Ai1,i2

n-order tensor: Ai1,i2,...,in Tensor contractions can be thought of as generalized GEMMs Three approaches to tensor contractions:

We propose a novel approach: GETT

Introduction

A is a multidimensional array: 0-order tensor: α

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT)

Akin to a high-performance GEMM implementation

2-order tensor: matrix Ai1,i2

n-order tensor: Ai1,i2,...,in Tensor contractions can be thought of as generalized GEMMs Three approaches to tensor contractions:

We propose a novel approach: GETT

Introduction

A tensors is a multidimensional array: 0-order tensor: scalar α

1-order tensor: vector Ai1

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT)

Akin to a high-performance GEMM implementation

n-order tensor: Ai1,i2,...,in Tensor contractions can be thought of as generalized GEMMs Three approaches to tensor contractions:

We propose a novel approach: GETT

Introduction

A tensors is a multidimensional array: 0-order tensor: scalar α

1-order tensor: vector Ai1

2-order tensor: matrix Ai1,i2

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT)

Akin to a high-performance GEMM implementation

Tensor contractions can be thought of as generalized GEMMs Three approaches to tensor contractions:

We propose a novel approach: GETT1

Introduction

A tensors is a multidimensional array: 0-order tensor: scalar α

1-order tensor: vector Ai1

2-order tensor: matrix Ai1,i2

n-order tensor: Ai1,i2,...,in

1Paul Springer et al. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT)

Akin to a high-performance GEMM implementation

Three approaches to tensor contractions:

We propose a novel approach: GETT1

Introduction

A tensors is a multidimensional array: 0-order tensor: scalar α

1-order tensor: vector Ai1

2-order tensor: matrix Ai1,i2

n-order tensor: Ai1,i2,...,in Tensor contractions can be thought of as generalized GEMMs

1Paul Springer et al. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Akin to a high-performance GEMM implementation

We propose a novel approach: GETT1

Introduction

A tensors is a multidimensional array: 0-order tensor: scalar α

1-order tensor: vector Ai1

2-order tensor: matrix Ai1,i2

n-order tensor: Ai1,i2,...,in Tensor contractions can be thought of as generalized GEMMs Three approaches to tensor contractions: Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT)

1Paul Springer et al. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Introduction

A tensors is a multidimensional array: 0-order tensor: scalar α

1-order tensor: vector Ai1

2-order tensor: matrix Ai1,i2

n-order tensor: Ai1,i2,...,in Tensor contractions can be thought of as generalized GEMMs Three approaches to tensor contractions: Nested loops Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT) We propose a novel approach: GETT1 Akin to a high-performance GEMM implementation

1Paul Springer et al. “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 17 Outline

Approaches to Tensor Contractions: Loops over GEMM (LoG) Transpose-Transpose-GEMM-Transpose (TTGT) GEMM-like Tensor-Tensor Multiply (GETT)

Tensor Contraction Code Generator2

Performance Evaluation

2Source code available at: https://github.com/HPAC/tccg Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 3 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

C ← P A B m1,n1 k1 m1,k1 k1,n1

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

m1 k1

k1 n1 Am1,k1 Bk1,n1

gemm (M1 , N1 , K1 , A[:, :], B[:, :], C[:, :])

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

Cm1,m2,n1 ← Am1,m2,k1 Bk1,n1

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

C(m ,m ),n ← A(m ,m ),k Bk1,n1 1 2 1 1 2 1 (m1, m2) k1

k1 n1 A(m1,m2),k1 Bk1,n1

gemm (M1 × M2 , N1 , K1 , A[:, :], B[:, :], C[:, :])

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1

m1 k1 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k1 n1 m2 n2

Am1,m2,k1 Bk1,n2,n1

for( m2 = 0; m2 < M2 ; m2 ++ ) for( n1 = 0; n1 < N1 ; n1 ++ ) gemm (M1 , N2 , K1 , A[:, m2, :], B[:, :, n1], C[:, n1, :, m2])

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1

m1 k1 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k1 n1 m2 n2

Am1,m2,k1 Bk1,n2,n1

for( m2 = 0; m2 < M2 ; m2 ++ ) for( n2 = 0; n2 < N2 ; n2 ++ ) gemm (M1 , N1 , K1 , A[:, m2, :], B[:, n2, :], C[:, :, n2, m2])

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1

m1 k1 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k1 n1 m 2 [n2] Am1,m2,k1 Bk1,[n2],n1

for( m2 = 0; m2 < M2 ; m2 ++ ) gemm_batch(M1 , N1 , K1 , A[:, m2, :], B[:, n2, :], C[:, :, n2, m2], N2 )

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1

m1 k1 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k1 n1 n [m2] 2 Am1,[m2],k1 Bk1,n2,n1

for( n2 = 0; n2 < N2 ; n2 ++ ) gemm_batch(M1 , N1 , K1 , A[:, m2, :], B[:, n2, :], C[:, :, n2, m2], M2 )

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1

k1 k2 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k2 k1 m1 n1 Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1 Ak1,m1,k2 Bk2,n1,k1

for( k1 = 0; k1 < K1 ; k1 ++ ) gemm (M1 , N1 , K2 , A[k1, :, :], B[:, :, k1], C[:, :])

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1

k1 k2 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k2 k1 m1 n1 Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1 Ak1,m1,k2 Bk2,n1,k1

for( k1 = 0; k1 < K1 ; k1 ++ ) gemm (M1 , N1 , K2 , A[k1, :, :], B[:, :, k1], C[:, :])

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1

k1 k2 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k2 k1 m1 n1 Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1 Ak1,m1,k2 Bk2,n1,k1

for( k2 = 0; k2 < K2 ; k2 ++ ) T T gemm (M1 , N1 , K1 , A[:, :, k2] , B[k2, :, :] , C[:, :])

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1

k1 k2 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k2 k1 m1 n1 Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1 Ak1,m1,k2 Bk2,n1,k1

for( k2 = 0; k2 < K2 ; k2 ++ ) T T gemm (M1 , N1 , K1 , A[:, :, k2] , B[k2, :, :] , C[:, :])

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Loop over GEMM (LoG) Conceptual Idea Identify 2D subtensors and contract them via GEMM

Cm1,n1 ← Am1,k1 Bk1,n1

C(m1,m2),n1 ← A(m1,m2),k1 Bk1,n1

k1 k2 Cm1,n1,n2,m2 ← Am1,m2,k1 Bk1,n2,n1 k2 k1 m1 n1 Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1 Ak1,m1,k2 Bk2,n1,k1

for( n = 0; n1 < N1 ; n1 ++ ) for( k2 = 0; k2 < K2 ; k2 ++ ) T gemv (M1 , K1 , A[:, :, k2] , B[k2, n1, :], C[:, n])

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 17 Easy to implement Exploits existing BLAS libraries No additional memory required

Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal

GEMM indices: m, n, k Loop order Batched index Advantages:

Disadvantages:

Loop Over GEMM (LoG)

Search space:

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Easy to implement Exploits existing BLAS libraries No additional memory required

Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal

Loop order Batched index Advantages:

Disadvantages:

Loop Over GEMM (LoG)

Search space: GEMM indices: m, n, k

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Easy to implement Exploits existing BLAS libraries No additional memory required

Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal

Batched index Advantages:

Disadvantages:

Loop Over GEMM (LoG)

Search space: GEMM indices: m, n, k Loop order

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Easy to implement Exploits existing BLAS libraries No additional memory required

Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal

Advantages:

Disadvantages:

Loop Over GEMM (LoG)

Search space: GEMM indices: m, n, k Loop order Batched index

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal

Easy to implement Exploits existing BLAS libraries No additional memory required Disadvantages:

Loop Over GEMM (LoG)

Search space: GEMM indices: m, n, k Loop order Batched index Advantages:

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal

Exploits existing BLAS libraries No additional memory required Disadvantages:

Loop Over GEMM (LoG)

Search space: GEMM indices: m, n, k Loop order Batched index Advantages: Easy to implement

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal

No additional memory required Disadvantages:

Loop Over GEMM (LoG)

Search space: GEMM indices: m, n, k Loop order Batched index Advantages: Easy to implement Exploits existing BLAS libraries

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal

Disadvantages:

Loop Over GEMM (LoG)

Search space: GEMM indices: m, n, k Loop order Batched index Advantages: Easy to implement Exploits existing BLAS libraries No additional memory required

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal

Loop Over GEMM (LoG)

Search space: GEMM indices: m, n, k Loop order Batched index Advantages: Easy to implement Exploits existing BLAS libraries No additional memory required Disadvantages:

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 GEMM’s arithmetic intensity can be suboptimal

Loop Over GEMM (LoG)

Search space: GEMM indices: m, n, k Loop order Batched index Advantages: Easy to implement Exploits existing BLAS libraries No additional memory required Disadvantages: Some contractions cannot be implemented via straight LoG

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 Loop Over GEMM (LoG)

Search space: GEMM indices: m, n, k Loop order Batched index Advantages: Easy to implement Exploits existing BLAS libraries No additional memory required Disadvantages: Some contractions cannot be implemented via straight LoG GEMM’s arithmetic intensity can be suboptimal

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 17 In := {n1, n2, ..., nζ } = IB ∩ IC

Ik := {k1, k2, ..., kξ} = IA ∩ IB

GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.

GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅

GER: Im 6= ∅ and In 6= ∅ and Ik = ∅

AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.

Free indices of B

Contracted indices

Tensor contractions can be mapped to BLAS routines3,4:

Map Tensor Contractions to BLAS

Free indices of A

Im := {m1, m2, ..., mγ } = IA ∩ IC

3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 Ik := {k1, k2, ..., kξ} = IA ∩ IB

GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.

GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅

GER: Im 6= ∅ and In 6= ∅ and Ik = ∅

AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.

Contracted indices

Tensor contractions can be mapped to BLAS routines3,4:

Map Tensor Contractions to BLAS

Free indices of A

Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B

In := {n1, n2, ..., nζ } = IB ∩ IC

3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.

GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅

GER: Im 6= ∅ and In 6= ∅ and Ik = ∅

AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.

Tensor contractions can be mapped to BLAS routines3,4:

Map Tensor Contractions to BLAS

Free indices of A

Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B

In := {n1, n2, ..., nζ } = IB ∩ IC Contracted indices

Ik := {k1, k2, ..., kξ} = IA ∩ IB

3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅

GER: Im 6= ∅ and In 6= ∅ and Ik = ∅

AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.

Map Tensor Contractions to BLAS

Free indices of A

Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B

In := {n1, n2, ..., nζ } = IB ∩ IC Contracted indices

Ik := {k1, k2, ..., kξ} = IA ∩ IB Tensor contractions can be mapped to BLAS routines3,4:

GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.

3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 GER: Im 6= ∅ and In 6= ∅ and Ik = ∅

AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.

Map Tensor Contractions to BLAS

Free indices of A

Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B

In := {n1, n2, ..., nζ } = IB ∩ IC Contracted indices

Ik := {k1, k2, ..., kξ} = IA ∩ IB Tensor contractions can be mapped to BLAS routines3,4:

GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.

GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅

3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.

Map Tensor Contractions to BLAS

Free indices of A

Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B

In := {n1, n2, ..., nζ } = IB ∩ IC Contracted indices

Ik := {k1, k2, ..., kξ} = IA ∩ IB Tensor contractions can be mapped to BLAS routines3,4:

GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.

GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅

GER: Im 6= ∅ and In 6= ∅ and Ik = ∅

3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 DOT: else.

Map Tensor Contractions to BLAS

Free indices of A

Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B

In := {n1, n2, ..., nζ } = IB ∩ IC Contracted indices

Ik := {k1, k2, ..., kξ} = IA ∩ IB Tensor contractions can be mapped to BLAS routines3,4:

GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.

GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅

GER: Im 6= ∅ and In 6= ∅ and Ik = ∅

AXPY:(Im = ∅ or In = ∅) and Ik = ∅

3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 Map Tensor Contractions to BLAS

Free indices of A

Im := {m1, m2, ..., mγ } = IA ∩ IC Free indices of B

In := {n1, n2, ..., nζ } = IB ∩ IC Contracted indices

Ik := {k1, k2, ..., kξ} = IA ∩ IB Tensor contractions can be mapped to BLAS routines3,4:

GEMM: Im 6= ∅ and In 6= ∅ and Ik 6= ∅.

GEMV:(Im = ∅ or In = ∅) and Ik 6= ∅

GER: Im 6= ∅ and In 6= ∅ and Ik = ∅

AXPY:(Im = ∅ or In = ∅) and Ik = ∅ DOT: else.

3Di Napoli et al. “Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions” 4Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU and GPU” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)

Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)

Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor

Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)

Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor

Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1

Aem1,(k1,k2) ← Ak1,m1,k2

Be(k1,k2),n1 ← Bk2,n1,k1 gemm(M1, N1, K1 × K2, Ae, Be, C)

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)

Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor

Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1

Aem1,(k1,k2) ← Ak1,m1,k2 Ae(k1,k2),m1 ← Ak1,m1,k2

Be(k1,k2),n1 ← Bk2,n1,k1 Be(k1,k2),n1 ← Bk2,n1,k1 T gemm(M1, N1, K1 × K2, Ae, Be, C) gemm(M1, N1, K1 × K2, Ae , Be, C)

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)

Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor

Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1

Aem1,(k1,k2) ← Ak1,m1,k2 Ae(k1,k2),m1 ← Ak1,m1,k2

Be(k1,k2),n1 ← Bk2,n1,k1 Be(k1,k2),n1 ← Bk2,n1,k1 T gemm(M1, N1, K1 × K2, Ae, Be, C) gemm(M1, N1, K1 × K2, Ae , Be, C)

Ae(k1,k2),m1 ← Ak1,m1,k2

Be(k1,k2),n1 ← Bk2,n1,k1 T gemm(M1, N1, K1 × K2, Be , Ae, Ce)

Cm1,n1 ← Cen1,m1

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)

Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor

Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1

Aem1,(k1,k2) ← Ak1,m1,k2 Ae(k1,k2),m1 ← Ak1,m1,k2

Be(k1,k2),n1 ← Bk2,n1,k1 Be(k1,k2),n1 ← Bk2,n1,k1 T gemm(M1, N1, K1 × K2, Ae, Be, C) gemm(M1, N1, K1 × K2, Ae , Be, C)

Ae(k1,k2),m1 ← Ak1,m1,k2 Ae(k2, k1),m1 ← Ak1,m1,k2

Be(k1,k2),n1 ← Bk2,n1,k1 Be(k2, k1),n1 ← Bk2,n1,k1 T T gemm(M1, N1, K1 × K2, Be , Ae, Ce) gemm(M1, N1, K1 × K2, Be , Ae, Ce)

Cm1,n1 ← Cen1,m1 Cm1,n1 ← Cen1,m1

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)

Conceptual Idea 1 “Flatten” the tensors to matrices 2 Use GEMM for contraction 3 “Unflatten” output matrix to tensor

Cm1,n1 ← Ak1,m1,k2 Bk2,n1,k1

Aem1,(k1,k2) ← Ak1,m1,k2 Ae(k1,k2),m1 ← Ak1,m1,k2

Be(k1,k2),n1 ← Bk2,n1,k1 Be(k1,k2),n1 ← Bk2,n1,k1 T gemm(M1, N1, K1 × K2, Ae, Be, C) gemm(M1, N1, K1 × K2, Ae , Be, C)

Ae(k1,k2),m1 ← Ak1,m1,k2 Ae(k2, k1),m1 ← Ak1,m1,k2

Be(k1,k2),n1 ← Bk2,n1,k1 Be(k2, k1),n1 ← Bk2,n1,k1 T T gemm(M1, N1, K1 × K2, Be , Ae, Ce) gemm(M1, N1, K1 × K2, Be , Ae, Ce)

Cm1,n1 ← Cen1,m1 Cm1,n1 ← Cen1,m1 ... and more.

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 17 Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance?

Transpositions5 account for pure overhead Additional memory required

Any of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages:

Disadvantages:

Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance?

Transpositions5 account for pure overhead Additional memory required

Transposed A Transposed B Interchange A and B within GEMM Advantages:

Disadvantages:

Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

Any permutation of Im, In, Ik

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance?

Transpositions5 account for pure overhead Additional memory required

Transposed B Interchange A and B within GEMM Advantages:

Disadvantages:

Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

Any permutation of Im, In, Ik Transposed A

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance?

Transpositions5 account for pure overhead Additional memory required

Interchange A and B within GEMM Advantages:

Disadvantages:

Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

Any permutation of Im, In, Ik Transposed A Transposed B

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance?

Transpositions5 account for pure overhead Additional memory required

Advantages:

Disadvantages:

Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpositions5 account for pure overhead Additional memory required

Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance? Disadvantages:

Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages:

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpositions5 account for pure overhead Additional memory required

Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance? Disadvantages:

Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpositions5 account for pure overhead Additional memory required

All TCs can be implemented via TTGT Large GEMM ⇒ good performance? Disadvantages:

Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement Exploits existing BLAS libraries

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpositions5 account for pure overhead Additional memory required

Large GEMM ⇒ good performance? Disadvantages:

Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpositions5 account for pure overhead Additional memory required

Disadvantages:

Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance?

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpositions5 account for pure overhead Additional memory required

Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance? Disadvantages:

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Additional memory required

Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance? Disadvantages: Transpositions5 account for pure overhead

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 Transpose-Transpose-GEMM-Transpose (TTGT)

Search space:

Any permutation of Im, In, Ik Transposed A Transposed B Interchange A and B within GEMM Advantages: Easy to implement Exploits existing BLAS libraries All TCs can be implemented via TTGT Large GEMM ⇒ good performance? Disadvantages: Transpositions5 account for pure overhead Additional memory required

5Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 17 GEMM-like Tensor-Tensor Multiplication (GETT)

Key Idea

Eliminate explicit transpositions Pack-and-transpose while moving data into the caches5 ⇒ Complexity offloaded into packing routines

5Sharan Chetlur et al. “CUDNN: Efficient Primitives for Deep Learning” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 9 / 17 GEMM-like Tensor-Tensor Multiplication (GETT)

Key Idea

Eliminate explicit transpositions Pack-and-transpose while moving data into the caches5 ⇒ Complexity offloaded into packing routines

1 //N-Loop 2 for n = 1 : nc : SIn 3 //K-Loop(contracted) 4 for k = 1 : kc : SIk 5 Bb = identify_subtensor(B, n, k) 6 // pack Bb into Be (L3 cache) 7 Be = packB (Bb) 8 //M-Loop 9 for m = 1 : mc : SIm 10 Ab = identify_subtensor(A, m, k) 11 // pack Ab into Ae (L2 cache) 12 Ae = packA (Ab) 13 Cb = identify_subtensor(C, m, n) 14 // compute matrix-matrix product of AeBe 15 macroKernel(Ae, Be, Cb, α, β)

5Sharan Chetlur et al. “CUDNN: Efficient Primitives for Deep Learning” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 9 / 17 GEMM-like Tensor-Tensor Multiplication (GETT)

Key Idea

Eliminate explicit transpositions Pack-and-transpose while moving data into the caches5 ⇒ Complexity offloaded into packing routines

1 //N-Loop 2 for n = 1 : nc : SIn 3 //K-Loop(contracted) 4 for k = 1 : kc : S Monday, February 27Ik 10:15 - 10:35 5 Bb = identify_subtensor(B, n, k) 6 // pack Bb into Be (L3 cache) 7 TCCG:Be Tensor= packB ( ContractionBb) Code Generator 8 //M-Loop 9 for m = 1 : mc : SIm 10 Ab = identify_subtensor(A, m, k) 11 // pack Ab into Ae (L2 cache) 12 Ae = packA (Ab) 13 Cb = identify_subtensor(C, m, n) 14 // compute matrix-matrix product of AeBe 15 macroKernel(Ae, Be, Cb, α, β)

5Sharan Chetlur et al. “CUDNN: Efficient Primitives for Deep Learning” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 9 / 17 Same arithmetic intensity as GEMM No memory overhead

Complex to implement

Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb Advantages:

Disadvantages:

GEMM-like Tensor-Tensor Multiplication (GETT)

Search space:

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Same arithmetic intensity as GEMM No memory overhead

Complex to implement

Subtensors Ab, Bb, Cb Advantages:

Disadvantages:

GEMM-like Tensor-Tensor Multiplication (GETT)

Search space: Blocking parameters: mc, nc, kc

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Same arithmetic intensity as GEMM No memory overhead

Complex to implement

Advantages:

Disadvantages:

GEMM-like Tensor-Tensor Multiplication (GETT)

Search space: Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Complex to implement

Same arithmetic intensity as GEMM No memory overhead Disadvantages:

GEMM-like Tensor-Tensor Multiplication (GETT)

Search space: Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb Advantages:

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Complex to implement

No memory overhead Disadvantages:

GEMM-like Tensor-Tensor Multiplication (GETT)

Search space: Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb Advantages: Same arithmetic intensity as GEMM

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Complex to implement

Disadvantages:

GEMM-like Tensor-Tensor Multiplication (GETT)

Search space: Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb Advantages: Same arithmetic intensity as GEMM No memory overhead

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Complex to implement

GEMM-like Tensor-Tensor Multiplication (GETT)

Search space: Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb Advantages: Same arithmetic intensity as GEMM No memory overhead Disadvantages:

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 GEMM-like Tensor-Tensor Multiplication (GETT)

Search space: Blocking parameters: mc, nc, kc Subtensors Ab, Bb, Cb Advantages: Same arithmetic intensity as GEMM No memory overhead Disadvantages: Complex to implement

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 17 Code Generator (TCCG)

Input: Mathematical description of TC e.g., C[a,b,i,j] = A[i,k,a] * B[k,j,b]; Output: High-Performance C++ code

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 11 / 17 Tensor Contraction Code Generator (TCCG)

Input: Mathematical description of TC e.g., C[a,b,i,j] = A[i,k,a] * B[k,j,b]; Output: High-Performance C++ code

Solution Yes TC, sizes Merge indices already contraction.hpp known?

Generate candidates No Store fastest candidate LoG TTGT GETT

Yes Time candidates

Apply cost okay More can- No perf. Add candidate to list Compile candidates didates? model cost too high Figure: Schematic overview of TCCG.

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 11 / 17 Not all TCs can be implemented via LoG Mixed performance

Performance — Haswell (single core)

80 LoG 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 12 / 17 Mixed performance

Performance — Haswell (single core)

80 LoG 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

Not all TCs can be implemented via LoG

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 12 / 17 Performance — Haswell (single core)

80 LoG 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

Not all TCs can be implemented via LoG Mixed performance

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 12 / 17 TTGT: good for compute-bound TCs TTGT: bad for bandwidth-bound TCs

Performance — Haswell (single core)

80 LoG TTGT 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 13 / 17 TTGT: bad for bandwidth-bound TCs

Performance — Haswell (single core)

80 LoG TTGT 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

TTGT: good for compute-bound TCs

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 13 / 17 Performance — Haswell (single core)

80 LoG TTGT 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

TTGT: good for compute-bound TCs TTGT: bad for bandwidth-bound TCs

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 13 / 17 GETT: excels for bandwidth-bound TCs GETT: good for compute-bound TCs

Performance — Haswell (single core)

80 LoG TTGT GETT 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 14 / 17 GETT: good for compute-bound TCs

Performance — Haswell (single core)

80 LoG TTGT GETT 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

GETT: excels for bandwidth-bound TCs

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 14 / 17 Performance — Haswell (single core)

80 LoG TTGT GETT 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

GETT: excels for bandwidth-bound TCs GETT: good for compute-bound TCs

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 14 / 17 9000 LoG TTGT 8000 7000 6000 5000 4000 GFLOPS 3000 2000 1000 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - dfce fbec geac fdec ac - - - - - bda dca acd adc - - - - dbea deca ebad cad acd ab - - - - - adec efbad efcad ecbfa - - - dfgb aebf - aebf eafd abc abc ab ab abc abc - - - - abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef (b) NVIDIA Tesla P100

Performance — Multi-threaded

1800 LoG TTGT 1600 1400 1200 1000 800 GFLOPS 600 400 200 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - dfce fbec geac fdec ac - - - - - bda dca acd adc - - - - dbea deca ebad cad acd ab - - - - - adec efbad efcad ecbfa - - - dfgb aebf - aebf eafd abc abc ab ab abc abc - - - - abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef (a)2 ×Intel Xeon E5-2680 v3

Performance gap increases for bandwidth-bound TCs

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 15 / 17 Performance — Multi-threaded

9000 LoG TTGT 9000 LoG TTGT 8000 8000 7000 7000 6000 6000 5000 5000 4000 4000 GFLOPS 3000 GFLOPS 3000 2000 2000 1000 1000 0 0 cb cb be be ec ce ec ce bf bf cf cf bd db bd bd db bd dc dc fd fd ------dcb dbc dcb dbc ebd ebd ------dfce dfce fbec fbec geac geac fdec fdec ac ac ------bda dca acd adc bda dca acd adc ------dbea deca ebad dbea deca ebad cad acd cad acd ab ab ------adec adec efbad efcad efbad efcad ecbfa ecbfa ------dfgb dfgb aebf aebf - - aebf eafd aebf eafd abc abc ab ab abc abc abc abc ab ab abc abc ------abcd abcd abcd abc abcd abcd abcd abc abcd abcd abcde abcde abcde abcde abcde abcd abcd abcde abcd abcd abcdef abcdef (a)2 ×Intel Xeon E5-2680 v3 (b) NVIDIA Tesla P100

Performance gap increases for bandwidth-bound TCs

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 15 / 17 For different settings: opA, opB, interchanged A and B

Account for varying GEMM perf6

Performance for equally-sized GEMMs varies greatly

Performance Model for TTGT and LoG:

GEMM Performance — Multi-threaded

1800 9000 1600 8000 1400 7000 6000 1200 5000 1000 4000

GFLOPS 800 GFLOPS 3000 600 2000 400 1000 200 0 cb cb be be ec ce ec ce bf bf cf cf bd db bd bd db bd dc dc fd fd ------dcb dbc dcb dbc ebd ebd ------dfce dfce fbec fbec geac geac fdec fdec ac ac ------bda dca acd adc bda dca acd adc ------dbea deca ebad dbea deca ebad cad acd cad acd ab ab ------adec adec efbad efcad efbad efcad ecbfa ecbfa ------dfgb dfgb aebf aebf - - aebf eafd aebf eafd abc abc ab ab abc abc abc abc ab ab abc abc ------abcd abcd abcd abc abcd abcd abcd abc abcd abcd abcde abcde abcde abcde abcde abcd abcd abcde abcd abcd abcdef abcdef (a)2 ×Intel Xeon E5-2680 v3 (b) NVIDIA Tesla P100

6Elmar Peise et al. “On the Performance Prediction of BLAS-based Tensor Contractions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 16 / 17 Account for varying GEMM perf6

Performance Model for TTGT and LoG:

GEMM Performance — Multi-threaded

1800 9000 1600 8000 1400 7000 6000 1200 5000 1000 4000

GFLOPS 800 GFLOPS 3000 600 2000 400 1000 200 0 cb cb be be ec ce ec ce bf bf cf cf bd db bd bd db bd dc dc fd fd ------dcb dbc dcb dbc ebd ebd ------dfce dfce fbec fbec geac geac fdec fdec ac ac ------bda dca acd adc bda dca acd adc ------dbea deca ebad dbea deca ebad cad acd cad acd ab ab ------adec adec efbad efcad efbad efcad ecbfa ecbfa ------dfgb dfgb aebf aebf - - aebf eafd aebf eafd abc abc ab ab abc abc abc abc ab ab abc abc ------abcd abcd abcd abc abcd abcd abcd abc abcd abcd abcde abcde abcde abcde abcde abcd abcd abcde abcd abcd abcdef abcdef (a)2 ×Intel Xeon E5-2680 v3 (b) NVIDIA Tesla P100

Performance for equally-sized GEMMs varies greatly For different settings: opA, opB, interchanged A and B

6Elmar Peise et al. “On the Performance Prediction of BLAS-based Tensor Contractions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 16 / 17 GEMM Performance — Multi-threaded

1800 9000 1600 8000 1400 7000 6000 1200 5000 1000 4000

GFLOPS 800 GFLOPS 3000 600 2000 400 1000 200 0 cb cb be be ec ce ec ce bf bf cf cf bd db bd bd db bd dc dc fd fd ------dcb dbc dcb dbc ebd ebd ------dfce dfce fbec fbec geac geac fdec fdec ac ac ------bda dca acd adc bda dca acd adc ------dbea deca ebad dbea deca ebad cad acd cad acd ab ab ------adec adec efbad efcad efbad efcad ecbfa ecbfa ------dfgb dfgb aebf aebf - - aebf eafd aebf eafd abc abc ab ab abc abc abc abc ab ab abc abc ------abcd abcd abcd abc abcd abcd abcd abc abcd abcd abcde abcde abcde abcde abcde abcd abcd abcde abcd abcd abcdef abcdef (a)2 ×Intel Xeon E5-2680 v3 (b) NVIDIA Tesla P100

Performance for equally-sized GEMMs varies greatly For different settings: opA, opB, interchanged A and B Performance Model for TTGT and LoG: Account for varying GEMM perf6

6Elmar Peise et al. “On the Performance Prediction of BLAS-based Tensor Contractions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 16 / 17 Conclusion and Future Work

Conclusion A survey of different approaches to TCs has been presented GETT exhibits high performance across a wide range of TCs TCCG is available at https://github.com/HPAC/tccg

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 17 / 17 Conclusion and Future Work

Conclusion A survey of different approaches to TCs has been presented GETT exhibits high performance across a wide range of TCs TCCG is available at https://github.com/HPAC/tccg

Future Work Implement TC library based on GETT Parallelize GETT

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 17 / 17 Conclusion and Future Work

Conclusion A survey of different approaches to TCs has been presented GETT exhibits high performance across a wide range of TCs TCCG is available at https://github.com/HPAC/tccg

Future Work Implement TC library based on GETT Parallelize GETT

Thank you for your attention.

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 17 / 17 Performance

Systems: Intel Xeon E5-2680 v3 CPU (Haswell) NVIDIA Tesla P100 GPU (Pascal) Compilers: icpc 16.0.1 20151021 nvcc v8.0.44 Benchmark Collection of 48 TCs Compiled from four publications Each TC is at least 200 MiB Correctness checked against naive loop-based implementation

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 1 / 10 Performance - DP

40 TTGT CTF TT 35 30 25 20

GFLOPS 15 10 5 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

TTGT faster than CTF everywhere. TTGT good in compute-bound regime TTGT bad in bandwidth-bound regime

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 2 / 10 Performance: m1n1m2-m1k1m2-n1k1

80 GETT LoG TTGT GEMM 70 60 50 40

GFLOPS 30 20 10 0 8 16 32 64 128 256 512 1024 SIk GETT especially good in bandwidth-bound regime GETT still attains up to 91.3% of peak floating-point performance TTGT poor in bandwidth-bound regime

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 3 / 10 Performance: m1n1m2n2-m1k1m2-n1k1n2

80 GETT LoG TTGT GEMM 70 60 50 40

GFLOPS 30 20 10 0 8 16 32 64 128 256 512 1024 SIk GETT especially good in bandwidth-bound regime GETT still attains up to 91.3% of peak floating-point performance TTGT poor in bandwidth-bound regime LoG performance can become arbitrarily bad GETT and TTGT barely affected by higher

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 3 / 10 Speedup

4.0 11.7 12.4 4.1 9.0 6.9 4.2 4.5 3.7

3.5 3.4 3.3 3.1

3.0 2.9 2.8

2.5 2.3 2.0 1.7 1.7 1.7 1.7

Speedup 1.5 1.4 1.2 1.1 1.1 1.1 1.0 1.0 0.5 0.0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

(a) Single-Precision. (b) Double-Precision.

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 4 / 10 GETT

Cb = A × e Be 6 macro- pack pack update 5 3 6 nc kc nc

mc Bb kc 4 4 2 m1 Cb mc m1 Ab k1

4 m2 2 k1 1 n1 4 m2 1 n1

Cm1,n1,m2 = Am1,m2,k1 × Bk1,n1

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 5 / 10 Written in AVX2 intrinsics

GETT: Macro- /Micro-Kernel

nr

nr

mr += mr × update C eme1,ne1 micro-kernel

nc

kc

mr kc += mc ×

C(m , n , m , n ) Ae nr Be b e1 e1 e2 e2 m1,ke1,m2 n1,ke1,n2 macro-kernel e e e e

Blocking for L3, L2, L1 cache as well as registers

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 10 GETT: Macro- /Micro-Kernel

nr

nr

mr += mr × update C eme1,ne1 micro-kernel

nc

kc

mr kc += mc ×

C(m , n , m , n ) Ae nr Be b e1 e1 e2 e2 m1,ke1,m2 n1,ke1,n2 macro-kernel e e e e

Blocking for L3, L2, L1 cache as well as registers Written in AVX2 intrinsics

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 6 / 10 ⇒ Efficient packing routines

Preserve stride-1 index

Packing via Tensor Transpositions

k

? m m1 m k 2 A bm1,k,m2 Ae(m1,m2),k

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 10 Packing via Tensor Transpositions

k

? m m1 m k 2 A TTC bm1,k,m2 7 Ae ≡ (m1,m2),k

m1 k m2

Aem1,m2,k

Preserve stride-1 index ⇒ Efficient packing routines

7Paul Springer et al. “TTC: A high-performance Compiler for Tensor Transpositions” Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 7 / 10 GETT: Summary

Blocking for caches Blocking for registers Explicitly vectorized Use TTC to generate high-performance packing routines Exploits full cache line (avoids non-stride-one memory accesses) Explore large search-space: Different GEMM-variants (e.g., panel-matrix, matrix-panel) Different Different values for mc, nc and kc Prune the search space via a performance model

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 8 / 10 TTGT good in compute-bound regime TTGT bad in bandwidth-bound regime TTGT faster than CTF everywhere.

Performance

80 TTGT CTF 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 9 / 10 Performance

80 TTGT CTF 70 60 50 40

GFLOPS 30 20 10 0 cb be ec ce bf cf bd db bd dc fd ------dcb dbc ebd - - - - ebcd dfce fbec abed geac aecd gfbc fdec gfab gfac ac ------bda dca acd adc - - - - dbea deca ebad ea eb ec cad acd ab ------adec efbad efcad ecbfa - - - dfgb aebf - dega degb degc aebf eafd abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcde abcde abcde abcd abcd abcdef abcdef abcdef abcdef

TTGT good in compute-bound regime TTGT bad in bandwidth-bound regime TTGT faster than CTF everywhere.

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 9 / 10 GETT Performance Model

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4 Efficiency 1 Efficiency 1 4 4 0.2 8 0.2 8 16 16 32 32 0.0 0.0 cb cb be be ec ce ec ce bf bf cf cf bd db bd bd db bd dc dc fd fd ------dcb dbc dcb dbc ebd ebd ------ebcd ebcd dfce dfce fbec fbec abed abed geac geac aecd aecd gfbc gfbc fdec fdec gfab gfab gfac gfac ac ac ------bda dca acd adc bda dca acd adc ------dbea deca ebad ea eb ec dbea deca ebad ea eb ec cad acd cad acd ab ab ------adec adec efbad efcad efbad efcad ecbfa ecbfa ------dfgb dfgb aebf aebf - - dega degb degc dega degb degc aebf eafd aebf eafd abc abc ab ab abc abc abc abc ab ab abc abc ------abcd abcd abcd abcd abcd abcd abc abcd abcd abcd abcd abcd abcd abc abcd abcd abcde abcde abcde abcde abcde abcd abcd abcde abcd abcd abcdef abcdef abcdef abcdef abcdef abcdef abcdef abcdef

(a) Single-Precision. (b) Double-Precision.

Figure: Limit the GETT candidates to 1, 4, 8, 16 or 32, respectively.

Average performance without search: 90.7% / 92.3% Average performance of the four best candidates: 98.3% / 97.2%

Paul Springer (AICES) High-Performance Tensor Contractions Feb. 24th 2017 10 / 10