TTC: a High-Performance Compiler for Tensor Transpositions

A TTC: A high-performance Compiler for Tensor Transpositions

Paul Springer, AICES, RWTH Aachen Jeff R. Hammond, Intel Corporation Paolo Bientinesi, AICES, RWTH Aachen

We present TTC, an open-source parallel compiler for multidimensional tensor transpositions. In order to generate high-performance C++ code, TTC explores a number of optimizations, including software prefetching, blocking, loop-reordering, and explicit vectorization. To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we also release a benchmark covering arbitrary transpositions of up to six dimensions. Performance results show that the routines generated by TTC achieve close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, and yield signiﬁcant performance gains over modern compilers. By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large. Experiments indicate that when only 100 potential solutions are considered, the resulting performance is about 99% of that achieved with exhaustive search. CCS Concepts: Mathematics of computing Mathematical software; Theory of computation Parallel algorithms;• Software and its engineering→ Compilers; Computing• methodologies → Linear algebra algorithms;• → • → Additional Key Words and Phrases: domain-speciﬁc compiler, multidimensional transpositions, high- performance computing ACM Reference Format: Paul Springer, Jeff R. Hammond and Paolo Bientinesi. 2016. TTC: A high-performance Compiler for Tensor Transpositions ACM Trans. Math. Softw. V, N, Article A ( 2016), 18 pages. DOI: http://dx.doi.org/10.1145/0000000.0000000

1. INTRODUCTION Tensors appear in a wide range of applications, including electronic structure theory [Bartlett and Musiał 2007], multiresolution analysis [Harrison et al. 2015], quantum many-body theory [Pfeifer et al. 2014], quantum computing simulation [Markov and Shi 2008], machine learning [Chetlur et al. 2014; Abadi et al. 2015; Vasilache et al. 2014], and data analysis [Kolda and Bader 2009]. While a range of software tools exist for computations involving one- and two-dimensional arrays, i.e. vectors and matrices, the availability of high-performance software tools for tensors is much more limited. In part, this is due to the combinatorial explosion of different operations that need to mn be supported: There are only four ways to multiply two matrices, but k k ways to contract an m- and an n-dimensional tensor over k indices. Furthermore, the different dimensions and data layouts that are relevant to applications are much larger in the case of tensors, and these issues lead to memory access patterns that are particularly difﬁcult to execute efﬁciently on modern computing platforms. arXiv:1603.02297v1 [cs.MS] 7 Mar 2016

Author’s address: P. Springer ([email protected]) and P. Bientinesi ([email protected] aachen.de), AICES, RWTH Aachen University, Schinkelstr. 2, 52062 Aachen, Germany. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. 0098-3500/2016/-ARTA $15.00 DOI: http://dx.doi.org/10.1145/0000000.0000000

ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016. A:2 Paul Springer et al.

Efficient computation of tensor operations, particularly contractions, exposes a ten- sion between generality and mathematical expression on one hand, and performance and software reuse on the other. If one implements tensor contractions in a naive way—using perfectly-nested loops—the connection with the mathematical formulae is obvious, but the performance will be suboptimal in all nontrivial cases. High performance can be obtained by using the Basic Linear Algebra Subprograms (BLAS) [Dongarra et al. 1990], but mapping from tensors to matrices efficiently is nontrivial [Di Napoli et al. 2014; Li et al. 2015] and optimal strategies are unlikely to be performance portable, due to the ways that multidimensional array striding taxes the memory hierarchy of modern microprocessors. A well-known approach for optimizing tensor computations is to use the level-3 BLAS for contracting matrices at high efficiency, and to always permute tensor ob- jects into a compatible matrix format. This approach has been used successfully in the NWChem Tensor Contraction Engine module [Hirata 2003; Kowalski et al. 2011], the Cyclops Tensor Framework [Solomonik et al. 2014], and numerous other coupled- cluster codes dating back more than 25 years [Gauss et al. 1991]. The critical issue for this approach is the existence of a high-performance tensor permutation, or tensor transpositions. While tensor contractions appear in a range of scientific domains (e.g., climate simulation [Drake et al. 1995] and multidimensional Fourier transforms [Frigo and Johnson 2005; Pekurovsky 2012]), they are perhaps of greatest importance in quantum chemistry [Baumgartner et al. 2005; Hirata 2003], where the most expensive widely used methods—coupled-cluster methods—consist almost entirely of contractions which deal with 4-, 6-, and even 8-dimensional tensors. Such computations consume millions of processor-hours of supercomputing time at facilities around the world, so any improve- ment in their performance is of significant value. To this end, we developed Tensor Transpose Compiler (TTC), an open source code generator for high-performance multidimensional transpositions that also supports multithreading and vectorization.1 Together with TTC, we provide a transpose benchmark that can be used to compare different algorithms and implementations on a range of multidimensional transpositions.

Let Ai1,i2,...,iN be an N-dimensional tensor, and let Π(i1, i2, ..., iN ) denote an arbitrary permutation of the indices i1, i2, ..., iN . The transposition of A into BΠ(i1,i2,...,iN ) is ex- pressed as BΠ(i1,i2,...,iN ) α Ai1,i2,...,iN . To make TTC ﬂexible and applicable to a wide range of applications,← we× designed it to support the class of transpositions

B α Ai ,i ,...,i + β B , (1) Π(i1,i2,...,iN ) ← × 1 2 N × Π(i1,i2,...,iN ) where α and β R; that is, the output tensor B can be updated (as opposed to overwrit- ∈ 2 ten), and both A and B can be scaled. As an example, let Ai1,i2 be a two-dimensional tensor, Π(i1, i2) = (i2, i1), and α = 1, β = 0; then Eqn. 1 reduces to an ordinary out-of- place 2D matrix transposition of the form Bi2,i1 Ai1,i2 . Throughout this article, we adopt a Fortran storage← scheme, that is, the tensors are assumed to be stored following the indices from left to right (i.e., the leftmost index th has stride 1). Additionally, we use the notation π(ia) = ib to denote that the a index of the input operand will become the bth index of the output operand. For each input transposition, TCC explores a number of optimizations (e.g., blocking, vectorization, parallelization, prefetching, loop-reordering, non-temporal stores),

1TTC is available at www.github.com/HPAC/TTC. 2 An alternative representation for Eqn. 1 is Bi ,i ,...,i α A + β Bi ,i ,...,i , where Πe 1 2 N ← × Π(e i1,i2,...,iN ) × 1 2 N is a suitable permutation.

ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016. TTC: A high-performance Compiler for Tensor Transpositions A:3

12 12 hw-pre:Y, explicit-vec:N hw-pre:Y, explicit-vec:N hw-pre:Y, explicit-vec:Y hw-pre:Y, explicit-vec:Y 10 hw-pre:N, explicit-vec:Y 10 hw-pre:N, explicit-vec:Y

8 8

6 6 fixed size fixed size

4 4 Bandwidth [GiB/s] Bandwidth [GiB/s]

2 2

0 0 Variants Variants

(a) Π(i1, i2, i3, i4, i5) = (i2, i5, i3, i1, i4) (b) Π(i1, i2, i3, i4, i5) = (i5, i3, i1, i4, i2) Fig. 1: Single-threaded bandwidth for all the 120 possible loop orders of two different 5D transpositions. explicit-vec and hw-pre respectively denote whether explicit vectorization and hardware prefetching are enabled (Y) or disabled (N).

each of which exposes one or more parameters; to achieve high-performance, these parameters have be tuned for the specific hardware on which the transposition will be executed. Since the effects of such parameters are non-independent, the optimal configuration is a point (or a region) of an often large parameter space; indeed, as the size of the space grows as N!, an exhaustive search is feasible only for tensors of low dimensionality. In general, it is widely believed that the parameter space is too complex to design a perfect transpose from first principles [McCalpin and Smotherman 1995; Lu et al. 2006]. For all these reasons, whenever an exhaustive search is not applicable, TTC’s generation relies on heuristics. In Fig. 1, we use two exemplary 5D transpositions (involving tensors of equal volume), as compelling evidence in favor of a search-based approach.3 The figures show the bandwidth attained by each of the possible 5! = 120 loop orders4—sorted in de- scending order, from left to right—with and without hardware prefetching (hw-pre:N, hw-pre:Y), and with and without explicit vectorization (explicit-vec:Y, explicit-vec:N).5 The compiler-vectorized code consists of perfectly nested loops including #pragma ivdep, to assist the compiler. The horizontal line labeled “fixed size” denotes the performance of the compiler-vectorized versions (with hardware prefetching enabled), where the size of the tensor was known at compile time; this enabled the compiler to reorder the loops. Several observations can be made. (1) Despite the fact that the two transpositions move exactly the same amount of data, the resulting top bandwidth is clearly different. (2) The difference between the leftmost and the rightmost datapoints—of any color— provides clear evidence that the loop order has a huge impact on performance:A:11 9.2 A:11 and 4.3 in Fig. 1a and Fig. 1b, respectively. This is in line with the findings of Jeff× Hammond× [Hammond 2009], who pointed out that the best loop order for a multidimensional transpose can have a huge impact on performance. (3) By comparing the leftmost data point of the beige ( ) and blue lines ( ), one concludes that the explicit

3The experiments were performed on an Intel Xeon E5-2670 v2 CPU, using one thread. 4A d dimensional transposition can be implemented as d perfectly nested loops around an update statement. These loops can be ordered in d! different ways. 5We refer to explicit vectorization to denote code which is written with AVX intrinsics; the vectorized versions use a blocking of 8 8 (see Section 3.1). ×

ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016.

(a) Intel: Bandwidth.(a) Intel: Bandwidth. (b) Intel: Speedup. (b) Intel: Speedup.

(c) AMD: Bandwidth.(c) AMD: Bandwidth. (d) AMD: Speedup. (d) AMD: Speedup.

Fig. 6: TTC. BandwidthFig. 6: and TTC. speedup Bandwidth of the and fastest speedup implementations of the fastest across implementations the benchmark across the benchmark for the Intel and AMDfor the system. Intel and Vertical, AMD dashed system. lines Vertical, separate dashed transpositions lines separate of di transpositions↵erent of di↵erent dimensions. dimensions.

In terms of speedupsIn over terms the of reference speedups implementation, over the reference TTC implementation, achieves considerable TTC achieves results considerable results on both systems (seeon fig. both 6b systems and 6d). (see When fig. looking 6b and closely6d). When at the looking plots, closely it becomes at the apparent plots, it becomes apparent that the inverse permutationsthat the inverse benefit permutations the most from benefit TTC, the attaining most from speedups TTC, attaining of up to speedups of up to 8.84 and 18.15 on8.84 theand Intel 18 and.15 AMDon the system, Intel andrespectively. AMD system, While respectively. the speedups While for the the speedups for the ⇥ ⇥ stride-1 transpositionsstride-1⇥ are transpositionsmuch smaller,⇥ arethey much can still smaller, be as they high can as still 1.66 be. For as high general as 1.66 . For general transpositions, the speedups range from 1.02 to 8.71 , and from 1.28 to 19⇥ .44 , for the ⇥ transpositions, the speedups⇥ range⇥ from 1.02 to 8.71⇥ , and from⇥ 1.28 to 19.44 , for the Intel and AMD system,Intel andrespectively. AMD system, respectively. ⇥ ⇥ ⇥ ⇥ To gain insights onTo where gain insightsthe performance on where gains the performance come from, in gains Fig. come 7 we from, report in the Fig. 7 we report the speedup due to eachspeedup of TTC’s due optimizations to each of TTC’s separately. optimizations For each separately. test-case from For each the bench- test-case from the benchmark, the speedupmark, is measured the speedup as the is bandwidth measured of as the the fastest bandwidth candidate of the without fastest the candidate par- without the par- 10 ticular optimizationticular over the optimization fastest candidate over the with fastest the candidate particular with optimization the particular enabled. optimization enabled.10 Fig. 7a presents the speedup that can be gained over a fixed 8 8 blocking. The optimal Fig. 7a presents the speedup that can be gained⇥ over a fixed 8 8 blocking. The optimal blocking can resultsblocking in up to can 35% results performance in up to increase 35% performance and motivates increase our and search motivates⇥ in this our search in this search dimension.search Software dimension. prefetching Software (Fig. 7b) prefetching also yields (Fig. a noticeable 7b) also yields speedup a noticeable of up speedup of up

10 Note that the remaining10Note parameters that the remaining of the optimal parameters solutions of the are optimal allowed tosolutions change are (e.g., allowed the fastest to change (e.g., the fastest solution without software prefetching might require a blocking of 8 8whileablockingof16 16 is optimal solution without software prefetching might⇥ require a blocking of 8 8whileablockingof16⇥ 16 is optimal for solutions with softwarefor solutions prefetching with enabled). software prefetching enabled). ⇥ ⇥

ACM Journal Name, Vol.ACM V, No. Journal N, Article Name, A, Vol. Publication V, No. N, date: Article January A, Publication 2016. date: January 2016. A:11 A:4 Paul Springer et al. A:11 vectorization improves the performance over the fastest compiler-vectorized version by at least 20%. (4) Since the cyan ( ) lines in Fig. 1a and 1b are practically the same, one can conclude that the difference in performance between the two transpositions is due to hardware prefetching. (5) The difference between the blue line ( ) and the horizontal black line in Fig. 1a indicates that when it is possible for the compiler to reorder the loops, the code generated is much better than most loop orders, but still about 40% away from the best one; similarly for Fig. 1b, the compiler fails to identify the best loop order. While observation (5) suggests that modern compilers struggle to ﬁnd the best loop order at compile time, an even bigger incentive to adopt a search-based approach is provided by observation (4), since detailed information about the mechanism of hardware prefetchers is not well-documented. Moreover, the implementations of hardware prefetchers can vary between architectures and manufactures. Hence, designing a generic analytical model for different architectures seems infeasible at this point. The contributions of this paper can be summarized as follows. — We introduce TTC, a high-performance transpose generator that supports single, double, single-complex, double-complex and mixed precision data types, generates multithreaded C++ code with explicit vectorization for AVX-enabled processors, and exploits spatial locality. — We also introduce a comprehensive multidimensional transpose benchmark, to provide the means of comparing different implementations. By means of this benchmark, we perform a thorough performance comparison on two architectures. — We analyze both the effects of four different optimizations in isolation, and quantify the impact of a dramatic reduction of the search space. The remainder of the paper is structured as follows. Section 2 summarizes the re- (a) Intel: Bandwidth. lated work(b) on Intel: tensor transpositions. Speedup. In Section 3 we introduce TTC with a special focus(a) on its Intel: optimizations. Bandwidth. In Section 4 we present a thorough performance analysis(b) Intel: of Speedup. TTC’s generated implementations—showing both the attained peak bandwidth as well as the speedups over a realistic baseline implementation. Finally, Section 5 concludes our ﬁndings and outlines future work.

2. RELATED WORK McCalpin et al. [McCalpin and Smotherman 1995] realized that search is necessary for high-performance 2D transpositions as early as 1995. Their code-generator explored the optimization space in an exhaustive fashion. Mateescu et al. [Mateescu et al. 2012] developed a cache model for IBM’s Power7 processor. Their optimizations include blocking, prefetching and data alignment to avoid conﬂict-misses. They also illustrate the effect of large TLB6 page sizes on performance. Lu et al. [Lu et al. 2006] developed a code-generator for 2D transpositions using both an analytical model and search. They carried out an extensive work covering vectorization, blocking for both L1 cache and TLB, while parallelization was not explored. Andre Vladimirov’s [Vladimirov 2013] presented his research on in-place, square transpositions on Intel Xeon and Intel Xeon Phi processors. Chatterjee et al. [Chatterjee and Sen 2000] investigated the effect of cache and TLB misses on performance for square, in-place, 2D transpositions. Among other things, they concluded that “the limited number of TLB entries can easily lead to thrash- ing” and that “hierarchical non-linear layouts are inherently superior to the standard canonical layouts”.

6The translation lookaside buffer, or TLB, serves as a cache for the expensive virtual-to-physical memory address translation required to convert software memory addresses to hardware memory addresses. (c) AMD: Bandwidth. (d) AMD: Speedup. (c) AMD: Bandwidth. (d) AMD:ACM Transactions Speedup. on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016. Fig. 6: TTC. Bandwidth and speedup of theFig. fastest 6: TTC. implementations Bandwidth and across speedup the of benchmark the fastest implementations across the benchmark for the Intel and AMD system. Vertical,for dashed the Intel lines and separate AMD system. transpositions Vertical, of dashed di↵erent lines separate transpositions of di↵erent dimensions. dimensions.

In terms of speedups over the reference implementation,In terms of speedups TTC achieves over the reference considerable implementation, results TTC achieves considerable results on both systems (see fig. 6b and 6d). Whenon looking both systems closely (see at the fig. plots, 6b and it 6d). becomes When apparent looking closely at the plots, it becomes apparent that the inverse permutations benefit thethat most the from inverse TTC, permutations attaining benefit speedups the of most up tofrom TTC, attaining speedups of up to 8.84 and 18.15 on the Intel and AMD system, respectively. While the speedups for the 8.84 and 18.15 on the Intel and AMD system,⇥ respectively.⇥ While the speedups for the stride-1⇥ transpositions⇥ are much smaller,stride-1 they can transpositions still be as high are much as 1.66 smaller,. For they general can still be as high as 1.66 . For general transpositions, the speedups range from 1.02 to 8.71 , and from 1.28 to 19⇥ .44 , for the transpositions, the speedups range from 1.02 to 8.71 , and from 1.28 to 19⇥ .44 , for the Intel and AMD system, respectively. ⇥ ⇥ ⇥ ⇥ Intel and AMD system, respectively. ⇥ ⇥ ⇥ ⇥ To gain insights on where the performance gains come from, in Fig. 7 we report the To gain insights on where the performancespeedup gains due to come each from, of TTC’s in Fig. optimizations 7 we report separately. the For each test-case from the bench- speedup due to each of TTC’s optimizationsmark, separately. the speedup For is each measured test-case as the from bandwidth the bench- of the fastest candidate without the par- mark, the speedup is measured as the bandwidth of the fastest candidate without the par- 10 ticular optimization over the fastest candidate with10 the particular optimization enabled. ticular optimization over the fastest candidateFig. with 7a presents the particular the speedup optimization that can be enabled. gained over a fixed 8 8 blocking. The optimal Fig. 7a presents the speedup that can be gained over a fixed 8 8 blocking. The optimal ⇥ blocking can results in up⇥ to 35% performance increase and motivates our search in this blocking can results in up to 35% performancesearch dimension. increase and Software motivates prefetching our search (Fig. in 7b) this also yields a noticeable speedup of up search dimension. Software prefetching (Fig. 7b) also yields a noticeable speedup of up 10Note that the remaining parameters of the optimal solutions are allowed to change (e.g., the fastest 10Note that the remaining parameters of the optimalsolution solutionswithout software are allowed prefetching to change might (e.g., require the a blocking fastest of 8 8whileablockingof16 16 is optimal ⇥ ⇥ solution without software prefetching might requirefor asolutions blocking with of 8 software8whileablockingof16 prefetching enabled).16 is optimal for solutions with software prefetching enabled). ⇥ ⇥

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January 2016. ACM Journal Name, Vol. V, No. N, Article A, Publication date: January 2016. TTC: A high-performance Compiler for Tensor Transpositions A:5

While there has been a lot of research targeted on 2D transpositions [McCalpin and Smotherman 1995; Mateescu et al. 2012; Goldbogen 1981; Vladimirov 2013; Chatterjee and Sen 2000; Lu et al. 2006] and 3D transpositions [Jodra et al. 2015; Ding 2001; van Heel 1991], higher dimensional transpositions have not yet experienced the same level of attention. Ding et al. [Ding 2001; He and Ding 2002] present an algorithm dedicated to multidimensional in-place transpositions. Their approach is optimal with respect to the number of bytes moved. However, their results suggest that this approach does not yield good performance if the position of the stride-1 index changes (i.e., π(i1) = i1). The work of Wei et al. [Wei and Mellor-Crummey 2014] is probably the most6 com- plete study of multidimensional transpositions so far. Their code-generator, which “uses exhaustive global search”, explores blocking, in-cache buffers to avoid conflict misses [Gatlin and Carter 1999], loop unrolling, software prefetching and vectorization. The generated code achieves a significant percentage of the system’s memcpy bandwidth on an Intel Xeon and an IBM Power7 node for cache-aligned transpositions. However, a parallelization approach is not described, and different loop orders are not considered. Lyakh et al. [Lyakh 2015] designed a generic multidimensional transpose algorithm and evaluated it across different architectures (e.g., Intel Xeon, Intel Xeon Phi, AMD and NVIDIA K20X). In contrast to our approach, theirs does not rely on search. Their results suggest that on both the Xeon Phi as well as the NVIDIA architectures, there still is a significant performance gap between their transposition algorithm and a direct copy.

3. TENSOR TRANSPOSE COMPILER TTC is a domain-specific compiler for tensor transpositions of arbitrary dimension. It is written in Python and generates high-performance C++ code7. For a given permuta- tion8 and tensor-size, TTC explores a search space of possible implementations. These implementations differ in some properties which have direct effects on performance e.g., loop order, prefetch distance, blocking; each of such properties will be discussed in the following sections. Henceforth, we use the term ‘candidate’ for all these implementations; similarly, we use the term ‘solution’ to denote the fastest (i.e., best performing) candidate. To reduce the compilation time, TTC allows the user to limit the number of candidates to explore; by default, TTC explores up to 200 candidates. Unless the user specifically wants to explore the entire search space exhaustively, TTC applies heuristics to prune the search space and to identify promising candidates that are expected to yield high performance. A good heuristic should be generic enough to be applicable on different architectures, and it should prune the search space to a degree so that the remaining implementations can be evaluated exhaustively. Once the search space has been pruned, TTC generates the C++ routines for all the remaining candidates, which are compiled and timed. The minimum required input to the compiler is the actual permutation and the size of each index; several additional arguments to pass extra information and to guide the code generation process can be supplied by command-line arguments; a subset of the input arguments is listed in Table I. For instance, by using the --lda and --ldb arguments, it is possible to transpose tensors that are a portion of larger tensors. This feature is particularly interesting because it enables TTC to generate efficient packing routines of scattered data elements into contiguous buffers; such routines are

7Changing TTC to generate C code instead of C++ is trivial. 8We use the terms permutation and transposition interchangeably.

Argument Description --perm=,,... permutation∗ --size=,,... size of each index∗ --dataType= data type of A and B (default: s) --alpha= alpha (default: 1.0) --beta= beta (default: 0.0) --lda=,,... leading dimension of each index of A --ldb=,,... leading dimension of each index of B --prefetchDistances=,... allowed prefetch distances --no-streaming-stores disable non-temporal stores (default: off) --blockings=,... block sizes to be explored --maxImplementations=× max #implementations (default: 200) Table I: TTC’s command-line arguments. Required arguments are marked with a ∗. frequently used in dense linear algebra algorithms such as a matrix-matrix multipli- cation [Low et al. 2015]. Moreover, TTC supports single-, double-, single-complex- and double-complex data types for both the input A and the output B—via --dataType. TTC is also able to generate mixed-precision transpositions—where A and B are of different data type (e.g., --dataType=sd denotes that A uses single-precision while B uses double-precision). This feature is again especially interesting in the context of linear algebra libraries since it allows to implement mixed-precision routines ef- fortlessly. Furthermore, the user can guide the search by choosing certain blockings, prefetch distances or loop orders—and thereby reduce the search space. With the argument --maxImplementations, the user influences the compile time by imposing a maximum size to the search space; in the extreme case in which this flag is set to one, the solution is returned without performing any search. On the other hand, setting --maxImplementations=-1 effectively disables the heuristics and instructs TTC to explore the search space exhaustively. A flowchart outlining the stages (1) - (9) of TTC is shown in Fig. 2. (1) To reduce com- plexity, TTC starts off by merging indices, whenever possible, in the input and output tensor. For instance, given the permutation Π(i1, i2, i3) = (i2, i3, i1), the indices i2 and i3 are merged into a new ‘super index’ i˜2 := (i2, i3) of the same size as the combined indices (i.e., size(i˜2) = size(i2) size(i3)); as a consequence, the permutation becomes 9 × Π(i1, i˜2) = (i˜2, i1). Next, TTC queries a local SQL database of known/previous solutions, to check whether a solution for the input transposition already exists; if so, no generation takes place, and the previous solution is returned. Otherwise, the code- generation proceeds as follows: (2) one of the possible blockings is chosen, (3) a loop order is selected, and (4) other optimizations (e.g., software prefetching, streaming- stores) are set. The combination of the chosen blocking, the loop order and the optimizations uniquely identify a candidate. After these steps, (5) an estimated cost for the current candidate is calculated. This cost is used to determine whether the current candidate should be added to a queue of candidates or if it should be neglected. The aforementioned input argument --maxImplementations determines the capacity of this queue. The loop starting and ending at (2) is repeated, if different combinations of blockings, loop orders and optimizations are still possible. Once all candidates have been generated, they are (7) compiled by an external C++ compiler, and (8) timed. Finally,

9 Notice that merging of two indices im and im+1 is only possible if ld(im+1) = size(im)ld(im) holds, with size(i) and ld(i) respectively denoting the size and leading-dimension of a given index i.

Permutation, Solution Yes (1) Merge indices already transpose.[cpp/h] sizes known?

No

(4) Choose optimizations (3) Choose Loop order (2) Choose blocking (9) Store fastest candidate

Yes (8) Time candidates

(5) apply cost okay (6) Add candidate More No heuris- to queue combina- (7) Compile candidate tics tions?

cost too high Fig. 2: Schematic overview of our tensor transpose compiler TTC; vectorization and parallelization are always enabled.

i1 i1 i3 i1 i2 i3 i1 i3 i3 i2 i2 i2

(a) Case 1: π(i1) = i1 (b) Case 2: π(i1) =6 i1 Fig. 3: Two exemplary 3D transpositions.

(9) the best candidate (i.e., the solution) is selected and stored to a .cpp/h ﬁle and its timing information as well as its properties (e.g., blocking, loop order) are saved in the SQL database for future references. For each and every transposition, TTC explores the following tuning opportunities: — Explicit vectorization (Section 3.1) — Blocking (Section 3.2) — Loop-reordering (Section 3.3) — Software prefetching (Section 3.4) — Parallelization (Section 3.5) — Non-temporal stores, if applicable The following sections discuss these optimizations in greater detail. For the remainder of this article, let w denote the vector-width, in elements, for any given precision and architecture (e.g., w = 8 for single-precision calculations on an AVX-enabled architecture).

3.1. Vectorization With respect to vectorization, we distinguish two different cases: the stride-1 index (i.e., the leftmost index) of the input and output tensors is constant (see Fig. 3a) or not (see Fig. 3b). To achieve optimal performance, these cases require signiﬁcantly different implementations.

3.1.1. Case 1: π(i1) = i1. When the stride-1 index does not change, vectorization is straightforward. In this case, the transposition moves a contiguous chunk of memory

b B b A w (2,1,3) w i i 2 w 1 w B(:,:,s3 ) A(:,:,s3 ) b A bB i i A(:,:,2) 3 A(:,:,2) 3 B(:,:,1) A(:,:,1) i 1 i 2

Fig. 4: Overview of the blocking mechanism for the permutation Π(i1, i2, i3) = (i2, i1, i3), with bA = bB = 2w.

(i.e., the first column) of the input/output tensor at once as opposed to a single element. Hence, the operation is essentially a series of nested loops around a memcopy of the size of the first column. In terms of the memory access pattern, this scenario is especially favorable, because of the available spatial data locality. Since the vectorization in this case does not require any in-register transpositions and merely boils down to a couple of vectorized loads and stores, we leave the vectorization to the compiler, which in this specific scenario is expected to yield good performance.

3.1.2. Case 2: π(i1) =6 i1. In order to take full advantage of the SIMD capabilities of modern processors, this second case requires a more sophisticated approach. Without loss of generality, let us assume that the index ib, with ib = i , will become the stride-1 6 1 index in B (e.g., ib = i3 in Fig. 3b). Accesses to B are contiguous in memory for successive values of ib, while accesses to A are contiguous for successive values of i1. Full vectorization is achieved by unrolling the i1 and ib loops by multiples of w elements, giving raise to an w w transpose. Henceforth, we refer to such a w w tile as a micro-tile. The transposition× of a micro-tile is fully vectorized by using an× in-register transposition.10 Using this scheme, an arbitrarily dimensional out-of-place tensor transposition is reduced to a series of independent two-dimensional w w transpositions, each of which accesses w many w-wide consecutive elements of both× A and B.

3.2. Blocking In addition to the w w micro-tiles, we introduce a second level of blocking to further increase locality. The× idea is to combine multiple micro-tiles into a so-called macro-tile of size bA bB, where bA and bB correspond to the blocking in the stride-1 index of A and B, respectively.× 11 This approach is illustrated in Fig. 4. By default, TTC explores the search space of all macro-tiles of size bA bB with × bA, bB w, 2w, 3w, 4w . The flexibility of supporting multiple sizes of macro-tiles has several∈ desirable { advantages:} first, it enables TTC to adapt to different memory systems, which might favor contiguous writes (e.g., 16 32) over contiguous reads (e.g., 32 16) and vice versa; second, it implicitly exploits architectural× features such as ad- jacent× cacheline prefetching, and cacheline-size (e.g., 16 elements for single precision for modern x86 CPUs); finally, it reduces false-sharing of cache lines between different threads in a parallel setting.

10This in-register transposition—written in AVX intrinsics—is automatically generated by another code- generator of ours and will be the topic of a later publication. 11 In the case of π(i1) = i1, such a blocking would not make sense; hence, the blocking takes place along the second index of A and B. For instance, for Bi ,i ,i Ai ,i ,i , the blocking involves i2 and i3. 1 3 2 ← 1 2 3

A i A i+1 A i+2

bA w

w b B Fig. 5: Overview of the software prefetching mechanism for a distance d = 5. Arrows denote the order in which the micro-tiles are being processed.

If desired by the user, TTC can effectively prune the search space and only evaluate the performance of a subset of the available tiles. We designed a heuristic which ranks the blockings for a given transposition and size. Speciﬁcally, the blocking is chosen such that (1) bA and bB are both multiples of the cacheline-size (in elements), and that i i i (2) the remainder r = S1(mod) bi, i A, B , with S1 being the size of the stride-1 index of tensor i A, B , is minimized.∈ { The} quality of this heuristic is demonstrated in Section 4.4. ∈ { }

3.3. Loop order As Fig. 1 already suggested, the choice of the proper loop order has a significant influence on performance. Since the number of available orderings for a tensor with d dimensions is d!, determining the best loop order is by exhaustive search is expensive even for modest values of d. Our heuristic to choose the loop order is designed to increase data locality in both A and B. This strategy fulfills multiple purposes: (1) it reduces cache- and TLB-misses, and (2) it reduces the stride within the innermost loop. The latter is especially important because large strides can prevent modern hardware prefetchers from learning the memory access patterns. For instance, the maximal stride supported by hardware prefetchers of Intel Sandy Bridge CPUs is limited to 2 KiB [Intel Corporation 2015]. Other aspects of the hardware implementation affect the cost of different loop orders; for example, the write-through cache policy of the IBM Blue Gene/Q architecture makes it extremely important to exploit write locality, since writing to a cache line evicts it from cache [Finkel 2015]. The reader interested in further details on this heuristic is referred to the available source-code at www.github.com/HPAC/TTC.

3.4. Software Prefetching Software prefetching is only enabled for the case of π(i ) = i ; indeed, the memory ac- 1 6 1 cess pattern for π(i1) = i1 is so regular that it should be easily caught by the hardware prefetcher. We designed the software prefetching to operate on micro-tiles; hence, a given prefetch distance d has the same meaning irrespective of the chosen macro-blocking. The prefetching mechanism is depicted in Fig. 5. TTC always prefetches entire w w micro-tiles. Before transposing the current micro-tile j, TTC prefetches the micro-tile× which is at distance d ahead of the current tile. This is illustrated by the colors in Fig. 5, where the macro-tile contains n = 4 micro-tiles: before processing the orange w w block of the current macro-tile Ai, TTC already prefetches the orange micro-tile × of the corresponding macro-tile Ap, with p = i + (j + d)/n (i.e., Ai in Fig. 5). b c +1

3.5. Parallelization

1 //variable declaration 2 ... 3 4 //main loops 5 #pragma omp parallel 6 #pragma omp for collapse(3) schedule(static) 7 for(int i3 = 0; i3 < size2 - 15; i3+= 16) 8 for(int i1 = 0; i1 < size0 - 31; i1+= 32) 9 for(int i2 = 0; i2 < size1; i2+= 1) 10 transpose32x16(&A[i1 + i2*lda1 + i3*lda2], lda2, 11 &B[i1*ldb2 + i2*ldb1 + i3], ldb2, alpha, beta); 12 //remainder loops 13 ... Listing 1: Parallel code generated by TTC for the permutation (i3, i2, i1).

Listing 1 contains the generated parallel code12 for a given tensor transposition with software prefetching disabled. The collapse clause increases the available parallelism and improves load balancing among the threads. Each thread has to process roughly the same amount of bA bB tiles. A detailed description× of the parallelization of the prefetching algorithm is beyond the scope of this paper. However, the overall idea of prefetching the blocks remains identical to the scalar version (see previous section). The only difference is that each thread has to keep track of the tiles it will access in the near future in a local data structure; for this task, we use a queue of tiles. The interested reader is again referred to the source-code for further details.

4. PERFORMANCE EVALUATION We evaluate the performance of TTC on two different systems, Intel and AMD. The Intel system consists of two Intel Xeon E5-2680 v3 CPUs (with 12 cores each) based on the Haswell microarchitecture. For all measurements, ECC is enabled, and both Intel Speedstep and Intel TurboBoost are disabled. The compiler of choice is the Intel icpc 15.0.4 with flags -O3 -openmp -xhost. Unless otherwise mentioned, this is the default configuration and system for the experiments. The AMD system consists of a single AMD A10-7850K APU with 4 cores based on the steamroller microarchitecture. The compiler for this system is gcc 5.3 with flags -O3 -fopenmp -march=native. All measurements are based on 24 threads and 4 threads for the Intel and AMD system, respectively (i.e., one thread per physical core). Experimental results suggest that optimal performance is attained with one thread per physical core. We also experimented with thread affinity on both systems.13 The reported bandwidth BW(x) for solution x is computed as 3 S BW(x) = × GiB/s, (2) 230 Time(x) × where S is the size in bytes of the tensor; the prefactor 3 is due to the fact that since in all our measurements B is updated (i.e., β = 0, see Eqn. 1), one has to account for reading B as well. 6 The reported memory bandwidth by the STREAM benchmark [McCalpin 1995] for the Intel (AMD) system is 105.6 (12.2), 105.9 (12.2), 111.6 (13.1) and 112.2 (13.0) GiB/s

12 We do not present the code that handles the remainder generated when bA or bB do not evenly divide the leading dimension of A or B. 13On the Intel systems setting the afﬁnity to KMP AFFINITY=granularity=fine,compact,1,0 (i.e., hyper- threading will not be used) yields optimal results. The performance on the AMD system, on the other hand, was not sensitive to the thread afﬁnity as long as the threads were not pinned to the same physical core.

for the copy (b a), scale (b αa), add (c a + b) and triad (c αa + b) test-cases, respectively. ← ← ← ←

4.1. Transposition Benchmark To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we designed a synthetic benchmark. The benchmark comprises 19 different transpositions,14 chosen so that no indices can be merged. For each tensor dimension (2D–6D), we included the inverse permutation (e.g., Bi ,i ,i Ai ,i ,i , 3 2 1 ← 1 2 3 Bi4,i3,i2,i1 Ai1,i2,i3,i4 , . . . ), and exactly one permutation for which the stride-1 index does not← change. These two scenarios typically cover both ends of the spectrum, yielding the worst and the best performance, respectively. In the benchmark, each transposition is evaluated in three different conﬁgurations—for a total of 57 test cases: one where all indices are of roughly the same size, and two where the tensors have a ratio of 6 between the largest and smallest index. The desired volume of the tensors across all dimensions are roughly the same and can be chosen by the user; in our experiments, we ﬁxed it to 200 MiB, which is bigger than any L3 cache in use today. The benchmark is publicly available at www.github.com/HPAC/TTC/benchmark.

4.2. TTC-generated code We now present the performance of the fastest implementations—generated by TTC— for all transpositions in the benchmark. Furthermore, we analyze the influence of the individual optimizations (blocking, loop-reordering, software prefetching, and explicit vectorization) on the performance. Fig. 6 illustrates the attained bandwidth and speedup across the benchmark for the Intel and AMD systems. The speedups are measured over a reference routine con- sisting of N perfectly nested loops annotated with both #pragma omp parallel for collapse(N 1) on the outermost loop, and #pragma omp simd on the innermost loop. Moreover, the− loop order for the reference version is chosen such that the output tensor B is accessed in a perfectly linear fashion; this loop order reduces false sharing of cache lines between the threads. With this setup, the compiler is assistedA:11 as much asA:11 A:11 possible to yield a competitive routine. Figs. 6a and 6c show the attained bandwidth of the TTC-generated solutions across the benchmark for the Intel and AMD system, respectively. The transpositions are clas- sified in three categories: stride-1 ( ), inverse ( ), and general ( ), respectively denoting those permutations in which the first index does not change, the inverse permutations, and those transpositions which do not fall into either of the previous two categories. In addition to the bandwidth, these figures also report the STREAM-triad bandwidth (solid green line), as well as the bandwidth of a SAXPY (i.e., single-precision vector- vector addition of the form y αx + y, α R, x, y Rn, see solid black line). The figures illustrate that TTC achieves← a significant∈ fraction∈ of the SAXPY-bandwidth on both architectures (the average across the entire benchmark is 91.68% and 78.30% for the Intel and AMD systems, respectively). With the exception of some performance- outliers on the AMD system, the performance of TTC is stable across the entire benchmark. It is interesting to note that on the AMD system, TTC attains much higher bandwidth than the STREAM-triad benchmark. This phenomenon is due the fact that the STREAM benchmark does not account for the write-allocate traffic . In terms of speedups over the reference implementation, TTC achieves considerable results on both systems (see Fig. 6b and 6d). When looking closely at the plots, it

14 One 2D transposition, three 3D, and ﬁve each for 4D, 5D and 6D.

(a) Intel:(a) Bandwidth. Intel: Bandwidth. (b) Intel:(b) Speedup. Intel: Speedup. (a) Intel: Bandwidth. (b) Intel: Speedup.

(c) AMD:(c) Bandwidth. AMD: Bandwidth. (d) AMD:(d) Speedup. AMD: Speedup. (c) AMD: Bandwidth. (d) AMD: Speedup. Fig. 6: TTC.Fig. 6: Bandwidth TTC. Bandwidth and speedup and speedup of the fastest of the implementations fastest implementations across the across benchmark the benchmark Fig. 6: TTC.for Bandwidth the Intelfor the and and Intel AMD speedup and system. AMD of the system.Vertical, fastest Vertical, implementations dashed lines dashed separate lines across separate transpositions the benchmark transpositions of di↵erent of di↵erent for the Inteldimensions. and AMDdimensions. system. Vertical, dashed lines separate transpositions of di↵erent dimensions. In termsIn of termsspeedups of speedups over the referenceover the reference implementation, implementation, TTC achieves TTC achieves considerable considerable results results on both systemson both (seesystems fig. 6b (see and fig. 6d). 6b and When 6d). looking When closely looking at closely the plots, at the it becomesplots, it becomes apparent apparent In terms of speedupsthat over the the inverse reference permutations implementation, benefit TTC the most achieves from considerable TTC, attaining results speedups of up to on both systemsthat (see the inverse fig. 6b and permutations 6d). When benefit looking the closely most at from the plots, TTC, it attaining becomes speedups apparent of up to 8.84 and8.84 18.15andon 18. the15 Intelon the and Intel AMD and system, AMD respectively.system, respectively. While the While speedups the speedups for the for the that the inverse permutations⇥ benefit⇥ the most from TTC, attaining speedups of up to stride-1⇥ stride-1 transpositions⇥ transpositions are much are smaller, much smaller, they can they still can be as still high be as high1.66 as. For 1.66 general. For general 8.84 and 18.15 on the Intel and AMD system, respectively. While the speedups for the ⇥ transpositions,transpositions, the speedups the speedups range from range 1.02 fromto 1 8.02.71 to, and 8.71 from, and 1.28 fromto 1 19⇥.28.44to, 19for.44 the, for the stride-1⇥ transpositions⇥ are much smaller, they can still⇥ be as⇥⇥ high as⇥ 1.66 . For⇥ general⇥ ⇥ ⇥ Intel andIntel AMD and system, AMD respectively. system, respectively. ⇥ transpositions, the speedupsTo gain range insights from on 1.02 whereto the 8.71 performance, and from gains 1.28 cometo 19 from,.44 , in for Fig. the 7 we report the To gain insights on where the⇥ performance⇥ gains come from,⇥ in Fig.⇥ 7 we report the Intel and AMDspeedup system,speedup due respectively.to each due of to TTC’s each of optimizations TTC’s optimizations separately. separately. For each For test-case each test-case from the from bench- the bench- To gain insightsmark, the onmark, speedup where the speedup the is measured performance is measured as the gains bandwidth as the come bandwidth of from, the fastest in of Fig. the candidate fastest 7 we report candidate without the without the par- the par- 10 speedup dueticular to each optimizationticular of TTC’s optimization optimizations over the over fastest the separately. candidate fastest candidate For with each the with test-case particular the particular from optimization the optimization bench- enabled. enabled.10 mark, the speedupFig. 7a is presents measuredFig. 7a presents the as speedup the the bandwidth speedup that can ofthat be the gained can fastest be over gained candidate a fixed over 8 a without fixed8 blocking. 8 the8 blocking. par- The optimal The optimal ⇥ 10 ticular optimizationblockingblocking overcan results the can fastest in results up candidate to in 35% up performance to with 35% the performance particular increase increase optimization and motivates⇥ and motivates enabled. our search our in search this in this Fig. 7a presentssearch thedimension.search speedup dimension. Software that can Software prefetching be gained prefetching over (Fig. a 7b) fixed (Fig. also 8 7b) yields8 blocking. also a yields noticeable The a noticeable optimal speedup speedup of up of up blocking can results in up to 35% performance increase and motivates⇥ our search in this 10 search dimension.10Note that SoftwareNote the remaining that prefetching the remaining parameters (Fig. parameters of 7b) the optimal also of the yields solutions optimal a noticeable solutionsare allowed are speedup to allowed change to (e.g.,of change up the (e.g., fastest the fastest solution without software prefetching might require a blocking of 8 8whileablockingof16 16 is optimal solution without software prefetching might require a blocking of 8 8whileablockingof16⇥ 16 is optimal⇥ for solutionsfor with solutions software with prefetching software prefetching enabled). enabled). ⇥ ⇥ 10Note that the remaining parameters of the optimal solutions are allowed to change (e.g., the fastest solution without software prefetching might require a blocking of 8 8whileablockingof16 16 is optimal for solutions with software prefetching enabled). ⇥ ⇥ ACM JournalACM Name, Journal Vol. Name, V, No. Vol. N, ArticleV, No. N, A, Article Publication A, Publication date: January date: 2016. January 2016.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January 2016. A:12 Paul Springer et al.

stride-1 general inverse stride-1 general inverse 120

SAXPY STREAM Triad 8.0 100

80 4.0 60 2D 3D 4D 5D 6D 2D 3D 4D 5D 6D Speedup 40 2.0 Bandwidth [GiB/s]

20

0 1.0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 Test case Test case (a) Intel: Bandwidth. (b) Intel: Speedup.

stride-1 general inverse stride-1 general inverse 20 SAXPY 16.0

15

8.0 STREAM Triad

10 4.0 Speedup

Bandwidth [GiB/s] 5 2.0

0 1.0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 Test case Test case (c) AMD: Bandwidth. (d) AMD: Speedup.

Fig. 6: TTC. Bandwidth and speedup of TTC’s fastest solution across the benchmark for the Intel and AMD system. The vertical lines identify the dimensionality of the tensors. becomes apparent that the inverse permutations beneﬁt the most from TTC, attaining speedups of up to 8.84 and 18.15 on the Intel and AMD system, respectively. While the speedups for the stride-1× transpositions× are much smaller, they can still be as high as 1.66 . For general transpositions, the speedups range from 1.02 to 8.71 , and from 1.28 to× 19.44 , for the Intel and AMD system, respectively. × × To× gain insights× on where the performance gains come from, in Fig. 7 we report the speedup due to each of TTC’s optimizations separately. For each test-case from the benchmark, the speedup is measured as the bandwidth of the fastest candidate without the particular optimization over the fastest candidate with the particular optimization enabled, while all other optimizations are still enabled.15 Fig. 7a presents the speedup that can be gained over a ﬁxed 8 8 blocking. The optimal blocking results in up to 35% performance increase and motivates× our search in this search dimension. Software prefetching (Fig. 7b) also yields a noticeable speedup of up to 11%. In contrast to the high speedups gained by loop-reordering shown in

15Note that the remaining parameters of the optimal solutions are allowed to change (e.g., the fastest solution without software prefetching might require a blocking of 8 8 while a blocking of 16 16 is optimal for solutions with software prefetching enabled). × ×

1.40 1.40 max max 1.35 min 1.35 min avg avg 1.30 1.30

1.25 1.25

1.20 1.20

Speedup 1.15 Speedup 1.15

1.10 1.10

1.05 1.05

1.00 1.00 2 3 4 5 6 2 3 4 5 6 Dimension Dimension (a) Blocking (b) Software prefetching

1.40 1.40 max max 1.35 min 1.35 min avg avg 1.30 1.30

1.25 1.25

1.20 1.20

Speedup 1.15 Speedup 1.15

1.10 1.10

1.05 1.05

1.00 1.00 2 3 4 5 6 2 3 4 5 6 Dimension Dimension (c) Loop-reordering (d) Explicit vectorization

Fig. 7: Breakdown of the speedups for the Intel system.

Section 1, TTC only exhibits much smaller ones (see Fig. 7c). The reason for this be- haviour is twofold: first, we chose a good loop order for the reference implementation (i.e., the loop order for which the output tensor B is accessed in a linear fashion); second, some of the drawbacks of a suboptimal loop order might be mitigated by the other optimizations—especially blocking. Even then, an additional search yields speedups of up to 22% over the reference loop order. Despite the fact that transpositions are memory bound, we see an appreciable speedup by implementing an in-register transpose via AVX intrinsics (see Fig. 7d). The speedup for vectorization is obtained by replacing our explicitly vectorized 8 8 micro kernel with a scalar implementation (i.e., two perfectly nested loops with the× loop-trip-counts being fixed to 8). While the reference implementation is also vectorized by the compiler, the compiler fails to find an in-register transpose implementation. All in all we see that each optimization has a positive effect on the attained bandwidth; the combination of all these optimizations results in significant speedups over modern compilers.

4.3. Reduction of the search space We discuss the possibility of lowering the compilation time by reducing the search space by identifying “universal” settings that yield nearly optimal performance. Intuitively, the optimal prefetch distance (for software prefetching) should only de- pend on the memory latency to the main memory and thus be independent of the actual

1.00

0.98

0.96

0.94 Efficiency

0.92 min max avg 0.90 0 1 2 3 4 5 6 7 8 Prefetch distance Fig. 8: Minimum, average and maximum efﬁciency for a ﬁxed prefetch distance across the benchmark.

transposition (see [Lee et al. 2012] for details). This observation motivates us to seek a “universal” prefetch distance which would reduce TTC’s search space. To evaluate the performance of TTC for a fixed prefetch distance d, we introduce the concept of efficiency. Let Ct be the set of all candidates for the tensor transposition t, x a particular candidate implementation, and dx the prefetch distance used by candi- max date x; furthermore, let BW(x) be the bandwidth attained by candidate x, and BWt the maximum bandwidth among all candidates for transposition t. Then the efficiency E(d, t)—which quantifies the loss in performance one would experience if the prefetch distance were fixed to d—is defined as BW(x) E(d, t) = max max . (3) x∈Ct, BWt A:11 A:11 dx=d A:11

The efficiency is bounded from above and below by 1.0 and 0.0, respectively—with 1.0 being the optimum. Fig. 8 presents the maximum ( ), minimum ( ) and average ( ) efficiency across all the transpositions of the benchmark as a function of the prefetch distance. We notice that (1) there is at least one transposition within the benchmark for which the influence of software-prefetching is negligible (see the leftmost, blue triangle ), (2) for each fixed prefetch distance d, there is at least one transposition for which d is suboptimal (see cyan line ), (3) both the minimum and average efficiency increase with d (cyan and beige lines), and (4) once d is “large enough”, the efficiency does not improve much. Quantitatively, a prefetch distance greater or equal to five increases the average and the minimum efficiency across the benchmark to more than 99% and roughly 98%, respectively. Hence, fixing d to any value between 5 and 8 is a good choice for the given system, effectively reducing the search space by a factor of 9, without introducing a performance penalty.

4.4. Quality of Heuristics On our Intel system, TTC evaluated roughly 8 candidates per second across the whole benchmark (for tensors of size 200 MB); this includes all the necessary steps from code- generation to compilation and measurement. If a solution has to be generated in a

(a) Intel: Bandwidth.(a) Intel: Bandwidth. (b) Intel: Speedup.(b) Intel: Speedup. (a) Intel: Bandwidth. (b) Intel: Speedup.

(c) AMD: Bandwidth.(c) AMD: Bandwidth. (d) AMD: Speedup.(d) AMD: Speedup. (c) AMD: Bandwidth. (d) AMD: Speedup. Fig. 6: TTC.Fig. Bandwidth 6: TTC. Bandwidth and speedup and of speedup the fastest of implementationsthe fastest implementations across the benchmark across the benchmark Fig. 6: TTC.for theBandwidth Intelfor and the and AMD Intel speedup system.and AMD of the Vertical, system. fastest dashed implementations Vertical, lines dashed separate lines across transpositions separate the benchmark transpositions of di↵erent of di↵erent for the Inteldimensions. and AMDdimensions. system. Vertical, dashed lines separate transpositions of di↵erent dimensions. In terms of speedupsIn terms of over speedups the reference over the implementation, reference implementation, TTC achieves TTC considerable achieves considerable results results on both systemson both (see systems fig. 6b and(see 6d). fig. 6b When and looking6d). When closely looking at the closely plots, at it the becomes plots, itapparent becomes apparent In terms of speedupsthat over the the inverse reference permutations implementation, benefit TTC the most achieves from considerable TTC, attaining results speedups of up to on both systemsthat the (see inverse fig. 6b permutations and 6d). When benefit looking the closely most from at the TTC, plots, attaining it becomes speedups apparent of up to 8.84 and 188.84.15 andon the 18.15 Intelon and the AMD Intel system, and AMD respectively. system, respectively. While the speedups While the for speedups the for the that the inverse permutations⇥ benefit⇥ the most from TTC, attaining speedups of up to stride-1⇥ transpositionsstride-1⇥ transpositions are much smaller, are much they smaller, can still they be can as highstill be as as 1.66 high. For as 1 general.66 . For general 8.84 and 18.15 on the Intel and AMD system, respectively. While the speedups for the ⇥ transpositions,transpositions, the speedups the range speedups from range 1.02 fromto 8.71 1.02, andto 8 from.71 1,. and28 fromto 19⇥ 1.44.28 ,to for 19 the.44 , for the stride-1⇥ transpositions⇥ are much smaller, they can still⇥ be as⇥⇥ high as 1⇥.66 ⇥. For general⇥⇥ ⇥ Intel and AMDIntel system, and AMD respectively. system, respectively. ⇥ transpositions, the speedupsTo gain range insights from on 1.02 whereto the8.71 performance, and from gains1.28 cometo 19. from,44 , for in Fig. the 7 we report the To gain insights on where the performance⇥ ⇥ gains come from,⇥ in Fig.⇥ 7 we report the Intel andspeedup AMD system, duespeedup to respectively. each due of TTC’s to each optimizations of TTC’s optimizations separately. separately. For each test-case For each from test-case the bench- from the bench- To gainmark, insights the speeduponmark, where the is the measuredspeedup performance is as measured the bandwidth gains as the come bandwidth of from,the fastest in of Fig. thecandidate 7 fastest we report without candidate the the without par- the par- 10 speedup dueticular to each optimizationticular of TTC’s optimization over optimizations the fastest over candidateseparately.the fastest with candidate For the each particular test-case with the optimization particular from the bench- optimization enabled.10 enabled. mark, the speedupFig. 7a presentsis measuredFig. 7a the presents speedup as the bandwidththe that speedup can be of that gained the can fastest over be gained a candidate fixed over 8 a8 without fixedblocking. 8 the8 The blocking. par- optimal The optimal ⇥ 10 ticular optimizationblocking canblocking over results the fastestcan in up results to candidate 35% in up performance to with 35% the performance particular increase and optimization increase motivates⇥ and our motivatesenabled. search our in this search in this Fig. 7asearch presents dimension. thesearch speedup dimension. Software that can prefetching Software be gained prefetching (Fig. over 7b) a fixed also (Fig. 8 yields 7b)8 blocking. also a noticeable yields The a noticeablespeedup optimal of speedup up of up blocking can results in up to 35% performance increase and motivates⇥ our search in this 10 search dimension.10Note that Software theNote remaining that prefetching the parameters remaining (Fig. of parameters the 7b) optimal also of solutions yieldsthe optimal a are noticeable solutions allowed to are speedup change allowed (e.g., to of change theup fastest (e.g., the fastest solution without software prefetching might require a blocking of 8 8whileablockingof16 16 is optimal solution without software prefetching might require a blocking of 8 8whileablockingof16⇥ 16 is optimal⇥ for solutions withfor solutions software with prefetching software enabled). prefetching enabled). ⇥ ⇥ 10Note that the remaining parameters of the optimal solutions are allowed to change (e.g., the fastest solution without software prefetching might require a blocking of 8 8whileablockingof16 16 is optimal for solutions with software prefetching enabled). ⇥ ⇥ ACM JournalACM Name, Journal Vol. V, Name, No. N, Vol. Article V, No. A, Publication N, Article A, date: Publication January date: 2016. January 2016.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January 2016. TTC: A high-performance Compiler for Tensor Transpositions A:15

1 10 100 ∞

8.0

4.0 Speedup

2.0

1.0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 Test case Fig. 9: Speedup over the reference implementation with the number of TTC-generated candidates limited to 1, 10, 100 and (i.e., no limit). ∞ short period of time (i.e., the search space needs to be pruned efficiently), the quality of the heuristics to choose a proper loop order and blocking becomes especially important. Fig. 9 presents the speedups that TTC achieves if the user limits the number of generated candidates to 1, 10 and 100, and when no limit is given (i.e., reflects the same results as in Fig. 6b and 6d). When the number of generated candidates∞ is limited to one, no search takes place, i.e., the loop order and the blocking for the generated implementation are determined solely by our heuristics. Even in this extreme case, TTC still exhibits remarkable speedups over modern compilers. Specifically, with a search space of 1, 10 and 100 candidates, TTC achieves 94.58%, 97.35% and 99.10% of the performance of the unlimited search (averaged across the whole benchmark). In other words, one can rely on the heuristics for the most part and resort to search only in scenarios in which even the last few bits of performance are critical. In those cases, we observe that a search space of 100 candidates already yields results within 1% from those obtained with an exhaustive search.

5. CONCLUSION AND FUTURE WORK We presented TTC, a compiler for multidimensional tensor transpositions. By deploy- ing various optimization techniques, TTC signiﬁcantly outperforms modern compilers, and achieves nearly optimal memory bandwidth. We investigated the source of the performance gain and illustrated the individual impact of blocking, software prefetching, explicit vectorization and loop-reordering. Furthermore, we showed that the heuristics used by TTC efﬁciently prune the search space, so that the remaining candidates are easily ranked via exhaustive search. For the future, should compilation time become a concern, iterative compilation techniques should be considered [Kisuki et al. 2000; Knijnenburg et al. 2002]. While this current work focused on architectures using the AVX instruction set, TTC is designed to accommodate other instruction sets; in general, the effort to port TTC to a new architecture is related to optimizations such as explicit vectorization and software prefetching. As a next step, we plan to support the AVX512 instruction set (e.g., used by Intel’s upcoming Knights Landing microarchitecture), ARM and IBM Power CPUs as well as NVIDIA GPUs. Finally, we will be using TTC as a building block for our upcoming Tensor Contraction Compiler.

ACKNOWLEDGMENT Financial support from the Deutsche Forschungsgemeinschaft (DFG) through grant GSC 111 is gratefully acknowledged.

REFERENCES Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and others. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. (2015). https://www.tensorflow.org Rodney J. Bartlett and Monika Musiał. 2007. Coupled-cluster theory in quantum chemistry. Reviews in Modern Physics 79, 1 (2007), 291–352. Gerald Baumgartner, Alexander Auer, David E Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert J Harrison, So Hirata, Sriram Krishnamoorthy, and others. 2005. Syn- thesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93, 2 (2005), 276–292. Siddhartha Chatterjee and S. Sen. 2000. Cache-efficient matrix transposition. (2000), 195–205. DOI:http://dx.doi.org/10.1109/HPCA.2000.824350 Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs/1410.0759 (2014). http://arxiv.org/abs/1410.0759 Edoardo Di Napoli, Diego Fabregat-Traver, Gregorio Quintana-Ort´ı, and Paolo Bientinesi. 2014. Towards an efficient use of the BLAS library for multilinear tensor contractions. Appl. Math. Comput. 235 (2014), 454 – 468. DOI:http://dx.doi.org/10.1016/j.amc.2014.02.051 Chris HQ Ding. 2001. An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures. Parallel and Distributed Systems, IEEE Transactions on 12, 3 (2001), 306– 315. Jack J. Dongarra, Jermey Du Cruz, Sven Hammerling, and I. S. Duff. 1990. Algorithm 679: A Set of Level 3 Basic Linear Algebra Subprograms: Model Implementation and Test Programs. ACM Trans. Math. Softw. 16, 1 (March 1990), 18–28. DOI:http://dx.doi.org/10.1145/77626.77627 John Drake, Ian Foster, John Michalakes, Brian Toonen, and Patrick Worley. 1995. Design and performance of a scalable parallel community climate model. Parallel Comput. 21, 10 (1995), 1571–1591. Hal Finkel. 2015. Optimizing for Blue Gene/Q. (2015). https://www.alcf.anl.gov/files/Optimizing for BGQ rev.pdf Matteo Frigo and Steven G. Johnson. 2005. The Design and Implementation of FFTW3. Proc. IEEE 93, 2 (Feb 2005), 216–231. DOI:http://dx.doi.org/10.1109/JPROC.2004.840301 Kang Su Gatlin and Larry Carter. 1999. Memory hierarchy considerations for fast transpose and bit- reversals. In High-Performance Computer Architecture, 1999. Proceedings. Fifth International Sympo- sium On. IEEE, 33–42. Jurgen¨ Gauss, John F. Stanton, and Rodney J. Bartlett. 1991. Coupledcluster openshell analytic gradients: Implementation of the direct product decomposition approach in energy gradient calculations. The Jour- nal of Chemical Physics 95, 4 (1991), 2623–2638. DOI:http://dx.doi.org/10.1063/1.460915 Geoffrey C Goldbogen. 1981. PRIM: A fast matrix transpose method. IEEE Trans. Software Eng. 7, 2 (1981), 255–257. Jeff Hammond. 2009. Automatically tuned libraries for native-dimension tensor transpose and contraction algorithms. (2009). https://github.com/jeffhammond/spaghetty/blob/master/papers/old/spaghetty chapter v1.pdf Robert J. Harrison, Gregory Beylkin, Florian A. Bischoff, Justus A. Calvin, George I. Fann, Jacob Fosso- Tande, Diego Galindo, Jeff R. Hammond, Rebecca Hartman-Baker, Judith C. Hill, Jun Jia, Jakob S. Kottmann, M.-J. Yvonne Ou, Laura E. Ratcliff, Matthew G. Reuter, Adam C. Richie-Halford, Nichols A. Romero, Hideo Sekino, William A. Shelton, Bryan E. Sundahl, W. Scott Thornton, Edward F. Valeev, Alvaro´ Vazquez-Mayagoitia,´ Nicholas Vence, and Yukina Yokoi. 2015. MADNESS: A Multiresolution, Adaptive Numerical Environment for Scientific Simulation. CoRR abs/1507.01888 (2015). http://arxiv. org/abs/1507.01888 Yun He and Chris HQ Ding. 2002. MPI and OpenMP paradigms on cluster of SMP architectures: the vacancy tracking algorithm for multi-dimensional array transposition. In Supercomputing, ACM/IEEE 2002 Conference. IEEE, 6–6.

So Hirata. 2003. Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories. Journal of Physi- cal Chemistry A 107 (2003), 9887–9897. Intel Corporation. 2015. Intel 64 and IA-32 architectures optimization reference manual. (2015). http://www.intel.com/content/dam/www/public/us/en/documents/manuals/ 64-ia-32-architectures-optimization-manual.pdf Jose L Jodra, Ibai Gurrutxaga, and Javier Muguerza. 2015. Efficient 3D Transpositions in Graphics Pro- cessing Units. International Journal of Parallel Programming (2015), 1–16. Toru Kisuki, P Knijnenburg, M OBoyle, and H Wijshoff. 2000. Iterative compilation in program optimization. (2000), 35–44. Peter MW Knijnenburg, Toru Kisuki, and Michael FP OBoyle. 2002. Iterative compilation. In Embedded processor design challenges. Springer, 171–187. Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455–500. Karol Kowalski, Jeff R. Hammond, Wibe A. de Jong, Peng-Dong Fan, Marat Valiev, Dunyou Wang, and Niranjan Govind. 2011. Coupled Cluster Calculations for Large Molecular and Extended Systems. In Computational Methods for Large Systems: Electronic Structure Approaches for Biotechnology and Nan- otechnology, Jeffrey R Reimers (Ed.). Wiley. Jaekyu Lee, Hyesoon Kim, and Richard W Vuduc. 2012. When Prefetching Works, When It Doesn’t, and Why. TACO 9, 1 (2012), 2. Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, and Richard Vuduc. 2015. An Input-adaptive and In-place Approach to Dense Tensor-times-matrix Multiply. In Proceedings of the International Confer- ence for High Performance Computing, Networking, Storage and Analysis (SC ’15). ACM, New York, NY, USA, Article 76, 12 pages. DOI:http://dx.doi.org/10.1145/2807591.2807671 Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Ort´ı. 2015. Analytical Models for the BLIS Framework. ACM Trans. Math. Software (2015). Pending. Qingda Lu, Sriram Krishnamoorthy, and P Sadayappan. 2006. Combining analytical and empirical approaches in tuning matrix transposition. In Proceedings of the 15th international conference on Parallel architectures and compilation techniques. ACM, 233–242. Dmitry I Lyakh. 2015. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Physics Communications 189 (2015), 84–91. Igor L Markov and Yaoyun Shi. 2008. Simulating quantum computation by contracting tensor networks. SIAM J. Comput. 38, 3 (2008), 963–981. Gabriel Mateescu, Gregory H Bauer, and Robert A Fiedler. 2012. Optimizing matrix transposes using a POWER7 cache model and explicit prefetching. ACM SIGMETRICS Performance Evaluation Review 40, 2 (2012), 68–73. John McCalpin and Mark Smotherman. 1995. Automatic benchmark generation for cache optimization of matrix operations. In Proceedings of the 33rd annual on Southeast regional conference. ACM, 195–204. John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Comput- ers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter (Dec. 1995), 19–25. Dmitry Pekurovsky. 2012. P3DFFT: A framework for parallel computations of Fourier transforms in three dimensions. SIAM Journal on Scientific Computing 34, 4 (2012), C192–C209. Robert NC Pfeifer, Jutho Haegeman, and Frank Verstraete. 2014. Faster identification of optimal contraction sequences for tensor networks. Physical Review E 90, 3 (2014), 033315. Edgar Solomonik, Devin Matthews, Jeff R Hammond, John F Stanton, and James Demmel. 2014. A mas- sively parallel tensor contraction framework for coupled-cluster computations. J. Parallel and Distrib. Comput. 74, 12 (2014), 3176–3190. Marin van Heel. 1991. A fast algorithm for transposing large multidimensional image data sets. Ultrami- croscopy 38, 1 (1991), 75–83. Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. (2014). http://arxiv.org/abs/ 1412.7580 Andrey Vladimirov. 2013. Multithreaded transposition of square matrices with common code for Intel Xeon processors and Intel Xeon Phi coprocessors. (2013). https://www.researchgate. net/profile/Andrey Vladimirov/publication/258048110 Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors/links/ 00463526c2fa40a1f3000000.pdf

Lai Wei and John Mellor-Crummey. 2014. Autotuning Tensor Transposition. In Parallel & Distributed Pro- cessing Symposium Workshops (IPDPSW), IEEE International. IEEE, 342–351.