<<

High-Performance Contraction without BLAS

Devin A. Matthews Institute for Computational Engineering and Sciences, University of Texas at Austin, Austin, Texas 78712, USA [email protected]

I.INTRODUCTION suitable for use in multiplication (i.e. with one unit- stride in each tensor) may not exist. are an integral part of many scientific disciplines [22], [16], [2], [12], [11]. At their most basic, tensors are II.TENSOR CONTRACTIONIN TBLIS simply a multidimensional collection of data (or a multidimen- sional array, as expressed in many programming languages). The exponential explosion of varieties of tensor contraction In other cases, tensors represent multidimensional transfor- compared to matrix multiplication would seem to require the mations, extending the theory of vectors and matrices. The use of code-generation techniques, search over large parameter logic of handling, transforming, and operating on tensors is spaces, etc. In this work we show that casting tensor contrac- a common task in many scientific computing applications tion logically as matrix multiplication provides a much simpler (SCAs), often being reimplemented many times as needed for and more general alternative which achieves high performance a particular project. Calculations on tensors also often account and parallel scalability. We leverage the BLIS framework [21], for a significant fraction of the running time of such tensor- [20], which exposes the entire internal structure of algorithms based codes, and so their efficiency has a significant impact such as matrix multiplication, down to the level of the micro- on the rate at which the resulting scientific advances can be kernel, which is the only part that must be implemented in achieved. highly-optimized assembly language. In order to perform operations on tensors, such as tensor The exposure of the micro-kernel as the unit of work enables contraction (TC), there are currently two commonly used al- the physical transition from tensor to matrices to be delayed as gorithms: (1) -transpose-DGEMM-transpose (TTDT) long as possible, so that the tensors may be logically viewed , where the tensors physically permuted (transposed) into a and manipulated (partitioned) as matrices during the majority shape suitable for matrix multiplication (MM) with the BLAS of the algorithm. Additionally, the small size of the micro- interface [13], [7], [6], and then the output is transposed kernel (4x4, 8x4, 6x8, etc.) allows the individual tensor blocks back into a tensor, and (2) loop-over-GEMM (LoG), where to be treated as regular matrix blocks (i.e. with a constant row two suitable from each tensor are chosen as sub- and column stride) most of the time when the leading tensor matrices and multiplied, and the remaining dimensions are dimension is large enough. Blocks that fall on the edge of the explicitly looped over. The TTDT approach is ubiquitous in tensor’s leading dimension are treated using a scatter-matrix SCAs in quantum chemistry [18], [10], [9], [8], [3], [4] and layout, where the offset for each row and column is stored. in other scientific fields [1], [19], despite several drawbacks: Threading is enabled in precisely the same manner as in BLIS • The storage space required is increased by up to a factor [17], with similarly high scalability. of two, since full copies of all three tensors are required. The specific contributions of this work are: • The tensor transpositions require a (sometimes quite • A novel logical mapping from general tensor layouts significant) amount of time relative to the matrix mul- to a non-standard (neither row-major nor column-major) tiplication step. matrix layout. • Significant algorithmic and code complexity is required • Implementations of key BLIS kernels using this novel when a pre-packaged solution is not available or when matrix layout, eliminating the need for explicit transpo- customization is necessary (this is the case in the vast sition of tensors while retaining a matrix-oriented algo- majority of SCAs). rithm. The LoG approach [5], [15] can be efficient in certain cir- • A new BLIS-like framework, TBLIS, incorporating these cumstances, such as tensor-times-matrix multiplication [14], kernels that achieves high performance for tensor con- but in general it suffers from poor cache reuse and locality, traction and does not require external workspace. low computational intensity (ratio of FLOPS to bytes) in the • Efficient multithreading of tensor contraction within the MM step, and a lack of generality, since tensor dimensions aforementioned framework. Xeon E5-2690 v3, 12 Cores [2] R. J. Bartlett and M. Musiał. Coupled-cluster theory in quantum 600 chemistry. Rev. Mod. Phys., 79(1):291–352, 2007. [3] J. A. Calvin, C. A. Lewis, and E. F. Valeev. Scalable task-based algorithm for multiplication of block-rank-sparse matrices. In Proceed- 500 ings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, IA3 ’15, pages 4:1–4:8, New York, NY, USA, 2015. ACM. 400 [4] J. A. Calvin and E. F. Valeev. Task-based algorithm for matrix multipli- cation: A step towards block-sparse tensor computing. arXiv:1504.05046 s /

P [cs], 2015.

O 300

L [5] E. Di Napoli, D. Fabregat-Traver, G. Quintana-Orti, and P. Bientinesi. G Towards an efficient use of the BLAS library for multilinear tensor 200 contractions. Appl. Math. Comput., 235:454–468, 2014. [6] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 100 MKL BLIS 16(1):1–17, 1990. TBLIS TTDT [7] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An 0 extended set of FORTRAN basic linear algebra subprograms. ACM 0 1000 2000 3000 4000 5000 Trans. Math. Softw., 14(1):1–17, 1988. M=N=K [8] E. Epifanovsky, M. Wormit, T. Kus, A. Landau, D. Zuev, K. Khistyaev, (a) Random “square” tensor contractions with TBLIS and TTDT, P. Manohar, I. Kaliman, A. Dreuw, and A. I. Krylov. New implementa- compared to the equivalent matrix multiplication. tion of high-level correlated methods using a general block-tensor library for high-performance electronic structure calculations. J. Comp. Chem., 34(26):2293–2309, 2013. 20 [9] M. Hanrath and A. Engels-Putzka. An efficient matrix-matrix multipli- cation based contraction engine for general order

T 15 coupled cluster. J. Chem. Phys., 133(6):064108, 2010. D

T [10] A. Hartono, Q. Lu, T. Henretty, S. Krishnamoorthy, H. Zhang, G. Baum- T

r

e gartner, D. E. Bernholdt, M. Nooijen, R. Pitzer, J. Ramanujam, and v

o 10 P. Sadayappan. Performance optimization of tensor contraction expres- p u

d sions for many-body methods in quantum chemistry. J. Phys. Chem. A, e e 113(45):12715–12723, 2009. p

S 5 [11] T. Kolda and B. Bader. Tensor decompositions and applications. SIAM Rev., 51(3):455–500, 2009. 0 [12] P. M. Kroonenberg. Applied multiway data analysis. John Wiley & f f c c c c c c c c e e e d d d d d d d b b b b f c Sons, 2008. c c a a e e e b c c c c e b d b b a b b b d f f f - - f e - - e - d b - - b - d b e d - - - d g f f a g d g a a c g d a e d a c a a - - - f d [13] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic - - - - - a - e a - c a - a c - c - d d d c d f f b c a b d b c - e b a b c e b b d a c a a b c b b f f f g g g e e e b d e d c a d - - g - -

e ACM Trans. Math. e a

e linear algebra subprograms for FORTRAN usage. e e e e f e a - - - - - c c - a - c c - - a a e - - d d d d d d d - d d b b d b b b b e - - - e e - - - - c c c c c c c a a Softw., 5(3):308–323, 1979. a a a a d d d d f f f f d d b b b b b b b c c c c c c e e e e a a a a a a a b b b b b b d d d d [14] J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc. An input- a a a a c c c c a a b b b b a a a a adaptive and in-place approach to dense tensor-times-matrix multiply. In Proceedings of the International Conference for High Performance (b) Speedup of TBLIS over TTDT for a range of tensor contractions. Computing, Networking, Storage and Analysis, SC ’15, pages 76:1– 76:12, New York, NY, USA, 2015. ACM. Figure 1: Using twelve cores of a Xeon E5-2690 v3 (Haswell) [15] E. Peise, D. Fabregat-Traver, and P. Bientinesi. On the performance prediction of BLAS-based tensor contractions. In S. A. Jarvis, S. A. processor. Wright, and S. D. Hammond, editors, High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, number 8966 in Lecture Notes in Computer Science, pages 193–212. Springer International Publishing, 2014. DOI: 10.1007/978-3-319-17248-4_10. III.SUMMARY AND CONCLUSIONS [16] A. Smilde, R. Bro, and P. Geladi. Multi-way analysis: Applications in As exemplified by Figure 1, the TBLIS approach allows the chemical sciences. John Wiley & Sons, 2005. [17] T. M. Smith, R. A. van de Geijn, M. Smelyanskiy, J. R. Hammond, and for an up to 17x speedup over TTDT, and approaches the F. G. Van Zee. Anatomy of high-performance many-threaded matrix efficiency of bare matrix multiplication for a wide range of multiplication. In Parallel and Distributed Processing Symposium, 2014 problem sizes and shapes. Additionally, the use of the BLIS IEEE 28th International, pages 1049–1059, 2014. [18] J. F. Stanton, J. Gauss, J. D. Watts, and R. J. Bartlett. A direct framework provides for a high degree of parallel scalability. product decomposition approach for symmetry exploitation in many- Since the TBLIS approach does not have any of the major body methods. I. Energy calculations. J. Chem. Phys., 94(6):4334–4345, 1991. drawbacks of TTDT: variability in performance, excessive [19] S. van de Walt, S. C. Colbert, and G. Varoquaux. The NumPy array: workspace requirements, and complicated program logic in A structure for efficient numerical computation. Comput. Sci. Eng., user applications in some circumstances, we conclude that 13(2):22–30, 2011. this algorithm should be considered as a high-performance [20] F. G. Van Zee, T. M. Smith, B. Marker, T. M. Low, R. A. van de Geijn, F. D. Igual, M. Smelyanskiy, X. Zhang, V. A. Kistler, J. A. Gunnels, and alternative to existing tensor contraction algorithms. The effi- L. Killough. The BLIS framework: Experiments in portability. ACM ciency and simplicity of TBLIS also highlights the utility and Trans. Math. Softw. in press. flexibility of the BLIS approach to matrix multiplication. [21] F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw., 41(3):14:1– 14:33, 2015. REFERENCES [22] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: TensorFaces. In A. Heyden, G. Sparr, M. Nielsen, and [1] B. W. Bader and T. G. Kolda. Algorithm 862: MATLAB tensor classes P. Johansen, editors, Computer Vision ECCV 2002, number 2350 in for fast algorithm prototyping. ACM Trans. Math. Softw., 32(4):635–653, Lecture Notes in Computer Science, pages 447–460. Springer Berlin 2006. Heidelberg, 2002. DOI: 10.1007/3-540-47969-4_30.