Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contractions with Extended BLAS Kernels on CPU and GPU Yang Shi ∗, U. N. Niranjan y, Animashree Anandkumar ∗ Cris Cecka ∗ EECS Department, y ICS Department NVIDIA Research University of California, Irvine Santa Clara, USA Irvine, USA Email: [email protected] Email: fshiy4,un.niranjan,[email protected] Abstract—Tensor contractions constitute a key computa- to the high space and time complexities associated with tional ingredient of numerical multi-linear algebra. However, tensor computations. In this paper, motivated by the recent as the order and dimension of tensors grow, the time and space increased interest from machine learning and deep learning, complexities of tensor-based computations grow quickly. In this paper, we propose and evaluate new BLAS-like primitives that we propose and study library-based communication avoiding are capable of performing a wide range of tensor contractions approaches for performing tensor contractions. on CPU and GPU efficiently. We begin by focusing on single- Conventional approaches for computing general tensor index contractions involving all the possible configurations contractions rely on matricization, the logical or explicit of second-order and third-order tensors. Then, we discuss restructuring of the data so that the computation can be extensions to more general cases. Existing approaches for tensor contractions spend large performed with a sequence of Basic Linear Algebra Sub- amounts of time restructuring the data which typically in- routine (BLAS) library calls. The BLAS routines provide volves explicit copy and transpose operations. In this work, efficient and portable implementations of linear algebra we summarize existing approaches and present library-based primitives, with many fast implementations existing across approaches that avoid memory movement. Through systematic many architectures [8]. benchmarking, we demonstrate that our approach can achieve 10x speedup on a K40c GPU and 2x speedup on dual-socket To this point, the GEneral Matrix Multiply (GEMM) Haswell-EP CPUs, using MKL and CUBLAS respectively, for primitive specified within the BLAS library is possibly small and moderate tensor sizes. This is relevant in many the most optimized and widely used routine in scientific machine learning applications such as deep learning, where computing. Noting that the basic theoretical computational tensor sizes tend to be small, but require numerous tensor con- and communication complexities of most tensor contractions traction operations to be performed successively. Concretely, we implement a Tucker decomposition and show that using is equivalent to that of GEMM, these computations should our kernels yields atleast an order of magnitude speedup as scale equally well. However, we find that existing tensor compared to state-of-the-art libraries. libraries such as the TENSOR TOOLBOX and CYCLOPS Keywords-Parallelism; BLAS; GPU; Tensor; TENSOR FRAMEWORK perform explicit data transposition to compute almost all tensor contractions and the cost of data restructuring often dominates the cost of the actual I. INTRODUCTION AND SCOPE computation. Other approaches have previously proposed Multilinear algebraic computations, are ubiquitous in mul- intrusive compiler and static analysis solutions, whereas we tiple scientific domains such as machine learning and mod- provide a much simpler library-based solution [16], [17]. ern data science [4], quantum chemistry and physics [14], Findings and contributions: We introduce a new BLAS signal and image processing [9], chemometrics [7], and primitive, known as STRIDEDBATCHEDGEMM, that allows biochemistry [13]. The study of tensor computations has the majority of tensor contractions to be computed with- a long and diverse history, as early as in the work by out any explicit memory motion. We detail the so-called Hitchcock [11]. The domains and references provided herein exceptional cases that cannot be evaluated with STRIDED- are by no means exhaustive but merely a small representative BATCHEDGEMM and demonstrate that an efficient solution sample of the various flavors in which tensor computations exists with another small extension to the primitive. are used in science. Tensors are multi-way arrays which can We demonstrate performance improvement using our ap- be viewed as a generalization of matrices to allow multi- proach on both CPU and GPU in direct benchmarks in ad- modality in data. Tensor contractions play a central role in dition to an application study. The Tucker decomposition is a variety of algorithms and applications; for a motivating an important tensor application in machine learning wherein example, see Section II-C. However, non-trivial performance the advantage of our strategy compared to existing libraries bottlenecks in several application areas are encountered due is clear. Finally, the value of this approach and its applications are all indexing is zero-based. R denotes the set of real numbers. being recognized by NVIDIA. As of this writing, the pro- The order of a tensor is the number of modes it admits. A posed interface exists in the CUBLAS 8.0 Release Candidate scalar is a zeroth-order tensor, a vector is a first-order tensor, and is likely to appear the official release later this summer. a matrix (say Amn) is a second-order tensor with the rows II. BACKGROUND (indexed by m) being the first mode and columns (indexed A. Related Work by n) being the second mode, and a three-way array (say Peise et al [22] extended results from Napoli et al [19] Amnp) is a third-order tensor with the first, second and third in mapping tensor contractions to sequences of BLAS rou- modes indexed by m, n, and p, respectively. Note that we tines and modeling the performance of these mappings. In use the term index to name a mode and iterate through the this work, they systematically enumerate and benchmark elements in that mode. combinations of possible BLAS kernels one could use to The dimension of the ith mode, denoted dim<i>, is the compute a given tensor contraction to conclude that the number of elements it contains. The dimension of a mode best performing algorithms involve the GEMM kernel. Some of a tensor is denoted by the bold lowercase letter of the evaluation strategies are neglected to be considered, such respective index; for example, the third-order tensor Amnp as flattening or developing new, generic linear algebraic has dimension dim<0>×dim<1>×dim<2> or m × n × p subroutines that could yield improved performance. where the first mode (indexed by m) takes values 0;:::; m− Li et al [16] also recognizes the cost of explicit copies and 1, the second mode (indexed by n) takes values 0;:::; n−1, proposes evaluation strategies exactly comparable to the flat- the third mode (indexed by p) takes values 0;:::; p − 1. tening and batching strategies addressed in this paper. Their We follow Einstein summation convention to represent discussion of loop modes and component modes map to our tensor contractions.A general tensor contraction is written discussion of batch modes and GEMM modes. However, Li as et al do not discuss strategies beyond tensor-times-matrix multiply. Furthermore, they only consider mode-n tensor- C = α A B + β C (1) C A B C times-matrix contractions of the form Yi1 in−1j iN = P ··· ··· i Xi1 iN Ujin , which avoids the more complicated cases where A; B; C are ordered sequences of indices such that n ··· in this paper. Abdelfattah et al [3] presents a framework C ≡ (A[B) n (A\B). The indices in A\B are called using batched GEMM for tensor contractions on GPUs. contracted indices. The indices in C are called free indices. However, they focus on optimizing only limited number of tensor contraction kernels on extreme small size tensors. C. An Important Practical Application Other works in [1] [20] improve the tensor computation In unsupervised learning, tensor decomposition [4] is performance by doing loop reorganization and fusion. gaining a lot of attention and is the crux of model estimation The STRIDEDBATCHEDGEMM interface proposed in this via the method of moments. A variety of problems such as paper has previously been mentioned by Jhurani et al [12] topic model estimation, Gaussian mixtures model estimation, as a low-overhead interface for multiple small matrices on and social network learning can be provably, consistently NVIDIA GPUs. Jhurani proposes the same interface for and efficiently solved via the tensor decomposition tech- CUBLAS that we propose in this paper and focuses on niques under certain mild assumptions. implementation concerns. In this work, we treat STRID- The basic building blocks of these algorithms involve ten- EDBATCHEDGEMM as an available primitive, benchmark sor contractions. Two frequently used tensor decomposition evaluation strategies that utilize it, and examine how it may methods are the CP decomposition [10] and the Tucker be further extended for use in multi-linear algebra. decomposition [25]. In [27], the authors use the Tucker The BLAS-like Library Instantiation Software (BLIS) decomposition to extract new representations of the face framework [26] offers GEMMs which support non-unit images despite different expressions or camera viewpoints. both strides in the row and column dimensions, which are To illustrate the fundamental importance of tensor contrac- attractive solutions to some of the problems in this paper. tions, we will pick one of the most common tensor de- However, performance is expected to suffer due to decreases composition algorithms, namely the higher-order orthogonal in cache line utilization, and SIMD opportunities. iteration (HOOI) [15] for asymmetric Tucker decomposition, Recent improvements in parallel and distributed comput- and use it as a case-study. In the Einstein notation, the ing systems have made complex tensor computation feasible. m n p factorization of a third-order tensor T 2 R × × is given TensorFlow [2] can handle multi-linear algebra operations i j k by Tmnp = GijkAmiBnjCpk, where G 2 R × × is the and it is primarily a data-flow and task-scheduling frame- m i n j p k core tensor, A 2 R × , B 2 R × , C 2 R × .

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Manipulation in GPL Maxima

Electromagnetic Fields from Contact- and Symplectic Geometry

Automated Operation Minimization of Tensor Contraction Expressions in Electronic Structure Calculations

The Riemann Curvature Tensor

Appendix a Relations Between Covariant and Contravariant Bases

Gicheru3012018jamcs43211.Pdf

Tensor Calculus

GEMM-Like Tensor-Tensor Contraction

Conformal Differential Geometry of a Subspace

Lecture Notes on General Relativity Sean M

M.Sc. (MATHEMATICS) MAL-633

A Code Generator for High-Performance Tensor Contractions on Gpus