High-Performance Tensor Contraction Without BLAS

High-Performance Tensor Contraction without BLAS Devin A. Matthews Institute for Computational Engineering and Sciences, University of Texas at Austin, Austin, Texas 78712, USA [email protected] I. INTRODUCTION suitable for use in matrix multiplication (i.e. with one unit- stride dimension in each tensor) may not exist. Tensors are an integral part of many scientific disciplines [22], [16], [2], [12], [11]. At their most basic, tensors are II. TENSOR CONTRACTION IN TBLIS simply a multidimensional collection of data (or a multidimensional array, as expressed in many programming languages). The exponential explosion of varieties of tensor contraction In other cases, tensors represent multidimensional transfor- compared to matrix multiplication would seem to require the mations, extending the theory of vectors and matrices. The use of code-generation techniques, search over large parameter logic of handling, transforming, and operating on tensors is spaces, etc. In this work we show that casting tensor contrac- a common task in many scientific computing applications tion logically as matrix multiplication provides a much simpler (SCAs), often being reimplemented many times as needed for and more general alternative which achieves high performance a particular project. Calculations on tensors also often account and parallel scalability. We leverage the BLIS framework [21], for a significant fraction of the running time of such tensor- [20], which exposes the entire internal structure of algorithms based codes, and so their efficiency has a significant impact such as matrix multiplication, down to the level of the micro- on the rate at which the resulting scientific advances can be kernel, which is the only part that must be implemented in achieved. highly-optimized assembly language. In order to perform operations on tensors, such as tensor The exposure of the micro-kernel as the unit of work enables contraction (TC), there are currently two commonly used al- the physical transition from tensor to matrices to be delayed as gorithms: (1) transpose-transpose-DGEMM-transpose (TTDT) long as possible, so that the tensors may be logically viewed , where the tensors physically permuted (transposed) into a and manipulated (partitioned) as matrices during the majority shape suitable for matrix multiplication (MM) with the BLAS of the algorithm. Additionally, the small size of the micro- interface [13], [7], [6], and then the output is transposed kernel (4x4, 8x4, 6x8, etc.) allows the individual tensor blocks back into a tensor, and (2) loop-over-GEMM (LoG), where to be treated as regular matrix blocks (i.e. with a constant row two suitable dimensions from each tensor are chosen as sub- and column stride) most of the time when the leading tensor matrices and multiplied, and the remaining dimensions are dimension is large enough. Blocks that fall on the edge of the explicitly looped over. The TTDT approach is ubiquitous in tensor’s leading dimension are treated using a scatter-matrix SCAs in quantum chemistry [18], [10], [9], [8], [3], [4] and layout, where the offset for each row and column is stored. in other scientific fields [1], [19], despite several drawbacks: Threading is enabled in precisely the same manner as in BLIS • The storage space required is increased by up to a factor [17], with similarly high scalability. of two, since full copies of all three tensors are required. The specific contributions of this work are: • The tensor transpositions require a (sometimes quite • A novel logical mapping from general tensor layouts significant) amount of time relative to the matrix mul- to a non-standard (neither row-major nor column-major) tiplication step. matrix layout. • Significant algorithmic and code complexity is required • Implementations of key BLIS kernels using this novel when a pre-packaged solution is not available or when matrix layout, eliminating the need for explicit transpo- customization is necessary (this is the case in the vast sition of tensors while retaining a matrix-oriented algo- majority of SCAs). rithm. The LoG approach [5], [15] can be efficient in certain cir- • A new BLIS-like framework, TBLIS, incorporating these cumstances, such as tensor-times-matrix multiplication [14], kernels that achieves high performance for tensor con- but in general it suffers from poor cache reuse and locality, traction and does not require external workspace. low computational intensity (ratio of FLOPS to bytes) in the • Efficient multithreading of tensor contraction within the MM step, and a lack of generality, since tensor dimensions aforementioned framework. Xeon E5-2690 v3, 12 Cores 600 500 400 s / P O 300 L G 200 100 MKL BLIS TBLIS TTDT 0 0 1000 2000 3000 4000 5000 M=N=K flexibility of the BLISciency approach and to simplicity matrix ofalternative multiplication. TBLIS to also existing highlights tensor thethis contraction utility and algorithms. algorithm The should effi- user be applications considered inworkspace as some requirements, a circumstances, and high-performance drawbacks we complicated of conclude program TTDT: that logic variability in inframework performance, provides for excessive aproblem high sizes degree and of shapes. parallelefficiency scalability. Additionally, of the bare usefor matrix of an multiplication the for up BLIS a to wide 17x range speedup of over TTDT, and approaches the processor. Figure 1: Using twelve cores of a Xeon E5-2690 v3 (Haswell) compared to the equivalent(a) matrix multiplication. Random “square” tensor contractions with TBLIS and TTDT, (b) Speedup of TBLIS over TTDT for a range of tensor contractions. [1] Since the TBLIS approach does not have any of the major As exemplified by Figure 1, the TBLIS approach allows B. W. Bader and T. G. Kolda. Algorithm 862: MATLAB tensor classes 2006. for fast algorithm prototyping. 20 III. S T 15 D T T r e UMMARY AND v o 10 p u d e R e p EFERENCES S 5 ACM Trans. Math. Softw. 0 f f c c c c c c c c e e e d d d d d d d b b b b f c C c c a a e e e b c c c c e b d b b a b b b d f f f - - f e - - e - d b - - b - d b e d - - - d g f f a g d g a a c g d a e d a c ONCLUSIONS a a - - - f d - - - - - a - e a - c a - a c - c - d d d c d f f b c a b d b c - e b a b c e b b d a c a a b c b b f f f g g g e e e b d e d c a d - - g - - e e a e e e e e f e a - - - - - c c - a - c c - - a a e - - d d d d d d d - d d b b d b b b b e - - - e e - - - - c c c c c c c a a a a a a d d d d f f f f d d b b b b b b b c c c c c c e e e e a a a a a a a b b b b b b d d d d a a a a c c c c a a b b b b a a a a , 32(4):635–653, [22] [21] [20] [19] [18] [17] [16] [15] [14] [13] [12] [11] [10] [9] [8] [7] [6] [5] [4] [3] [2] M. Hanrath and A. Engels-Putzka. An efficient matrix-matrix multipli- E. Epifanovsky, M. Wormit, T. Kus, A. Landau, D. Zuev, K. Khistyaev, J. J. Dongarra, J. Du Croz, S. Hammarling,J. J. and Dongarra, R. J. J. Du Hanson. Croz, S. An Hammarling, andE. I. Di Napoli, S. D. Duff. Fabregat-Traver, G. A Quintana-Orti, set and of P. Bientinesi. J. A. Calvin and E. F. Valeev. Task-based algorithm for matrix multipli- J. A. Calvin, C. A.R. Lewis, J. and Bartlett E. and F. M. Valeev. Musiał. Scalable task-based Coupled-cluster theory in quantum M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis ofF. image G. Van Zee and R. A. van de Geijn. BLIS: A framework forF. rapidly G. Van Zee, T. M. Smith, B. Marker, T. M. Low, R.S. A. van van de de Geijn, Walt, S. C. Colbert, and G. Varoquaux. The NumPyJ. array: F. Stanton, J. Gauss, J. D. Watts, andT. M. R. Smith, R. J. A. van Bartlett. de Geijn,A. M. A Smilde, Smelyanskiy, J. R. direct R. Bro, Hammond, and and P. Geladi. E. Peise, D. Fabregat-Traver, and P. Bientinesi. On the performance J. Li, C. Battaglino, I. Perros,C. J. L. Sun, Lawson, and R. J. R. Hanson,P. Vuduc. D. M. R. Kroonenberg. An Kincaid, and input- F.T. T. Kolda Krogh. and B. Basic Bader. Tensor decompositions and applications. A. Hartono, Q. Lu, T. Henretty, S. Krishnamoorthy, H. Zhang, G. Baum- Heidelberg, 2002. DOI:Lecture 10.1007/3-540-47969-4_30. Notes inP. Computer Johansen, Science, editors, ensembles: pages TensorFaces. 447–460. Springer In Berlin A.14:33, Heyden, 2015. G.instantiating Sparr, BLAS functionality. M. Nielsen, and Trans. Math. Softw. L. Killough. TheF. D. BLIS Igual, M. framework: Smelyanskiy, Experiments X. Zhang, in V. A. portability. Kistler,13(2):22–30, J. 2011. A. Gunnels, and A structure for efficient numerical1991. computation. body methods. I. Energy calculations. product decomposition approach for symmetryIEEE exploitation 28th in International many- multiplication. In F. G. Van Zee. Anatomythe of chemical high-performance sciences many-threaded matrix International Publishing, 2014.8966 DOI: in 10.1007/978-3-319-17248-4_10. LectureSystems. Notes Performance Modeling, in Benchmarking, Computer and Simulation Wright, Science, pages and 193–212.

Load more