Stripe: Tensor Compilation via the Nested Polyhedral Model

Tim Zerrell1 and Jeremy Bruestle1 [email protected], [email protected] 1Intel

March 18, 2019

Abstract 1.1 Kernel Libraries Algorithmic developments aimed at improving accu- Hardware architectures and machine learning (ML) racy, training or inference performance, regulariza- libraries evolve rapidly. Traditional compilers often tion, and more continue to progress rapidly [27, 35]. fail to generate high-performance code across the Nevertheless, these typically retain the same regu- spectrum of new hardware offerings. To mitigate, larities that make specialized ML architectures fea- engineers develop hand-tuned kernels for each ML li- sible, and could in principle be efficiently run on brary update and hardware upgrade. Unfortunately, ML accelerators. However, the traditional kernel li- this approach requires excessive engineering effort brary approach requires a kernel for each hardware- to scale or maintain with any degree of state-of-the- network architectural feature pair, creating a com- art performance. Here we present a Nested Poly- binatorial explosion of optimization work that is in- hedral Model for representing highly parallelizable feasible with the rapid growth of both hardware and computations with limited dependencies between it- algorithm designs. erations. This model provides an underlying frame- One way to achieve all these optimizations is to work for an intermediate representation (IR) called write an extensively customized kernel to account Stripe, amenable to standard compiler techniques for each supported machine learning operation and while naturally modeling key aspects of modern ML each materially different input and output shape on computing. Stripe represents parallelism, efficient each supported hardware platform. As new hard- memory layout, and multiple compute units at a ware architectures develop, new kernels must be level of abstraction amenable to automatic optimiza- written for all supported operations. If new op- tion. We describe how Stripe enables a compiler for erations are devised, they must be added to each ML in the style of LLVM that allows independent hardware target. This coupling creates a mainte- development of algorithms, optimizations, and hard- nance nightmare; decoupling them via automatic ware accelerators. We also discuss the design explo- code generation, where the hardware configuration ration advantages of Stripe over kernel libraries and is separate from both the optimization passes and schedule-based or schedule-space-based code gener- the operations themselves, would be ideal. Unfor- arXiv:1903.06498v1 [cs.DC] 14 Mar 2019 ation. tunately, general-purpose compilers are not aware of the regularities of ML workloads (as discussed in section 2), and so the required optimizations are in- 1 Introduction tractable.

The industry is producing an explosion of var- 1.2 Compilers ied and creative hardware accelerator architectures [6, 9, 14, 15, 22, 24, 36]. Designs tend to be opti- LLVM [18] is a community-driven compiler infras- mized for specific goals, such as power efficiency or tructure that demonstrates features we want in a performance, and often breed even more specialized compiler for ML workloads. Its suite of optimiza- architectures for very targeted use cases, such as the tion and transformation passes applied to a consis- MAERI [17]. To take advantage of these innovative tent IR makes it possible to optimize workloads in designs, ML software must be optimized to target a progressive and modular fashion. Direct trans- the specialized hardware. formation of the IR allows for deeper changes than

1 those available to a model maintaining some fixed workloads, or (as is the more common use case for structure along with a modifiable implementation or nGraph) to output inference computations in envi- interpretation. Moreover, an excellent community of ronments where low-latency is important. nGraph experts with varying priorities actively contributes may be used in conjunction with PlaidML (see sec- to the ecosystem. Unfortunately, and despite these tion 3.4) to provide complementary graph optimiza- benefits, LLVM is still general purpose and cannot tions. be directly used to compile high performance ML Glow [29] offers graph compilation and does not code. generate code for operations like GEMMs or con- Special-purpose optimization and compilation volutions, instead relying on kernel libraries or techniques have also been developed. Loop nest accelerator-specific compilers. transformation and compilation algorithms [16, 20], Other machine learning domain specific compil- including the development of the polyhedral model ers include XLA [30], Diesel [10], DLVM [32], and [1, 34], optimize constrained loop nests via tiling and TensorComprehensions [31]. data layout transformations. Frameworks for ex- tracting fine-grained parallelism in traditional work- 1.3 Stripe loads and applying such polyhedral techniques, in- cluding Polly [13] and PLuTo [3], have proven bene- We propose a compiler structured along the same ficial. However, these techniques have not been suf- lines as LLVM: it lowers source code to an interme- ficient to achieve peak performance for many ML diate representation (IR) and selects and parame- workloads [2]. terizes a list of optimization passes from a common Various frameworks, including URUK [12], Halide pool; these passes are then iteratively applied to the [26], and Tiramisu [2] separate loop nest seman- IR; and only after all have been applied is the IR tics from execution order via a scheduling language. code lowered to hardware-specific instructions. A TVM [4] also does this, building on Halide by key innovation of our proposed compiler is the IR, creating a tensor expression language and adopt- called Stripe, which abstracts to a granularity fine ing a “decoupled” scheduling approach that al- enough to represent the full new functionality avail- lows for hardware-specific optimizations. The re- able on ML accelerators and coarse enough to al- sult is a cleaner separation of expertise between low automatic compilation of high performance ML network architecture and hardware design; see for code. example Liu et al. [21] on optimizing for CPUs in Stripe is built to represent tensor operations via TVM. AutoTVM [5] introduces automatic selection the Nested Polyhedral Model (Section 3.1). This of a schedule from a schedule search space using model nests polyhedra in the sense that, for each a deep learning approach with transfer learning. point in a parent polyhedron, a child polyhedron This means hardware-specific optimizations can be is defined. This nesting naturally represents tiling, written with a focus on getting the right structure partitioning, tensorization, and other “blocking” op- (whether to try tiling, whether to try different mem- erations. It also allows assignment of nested polyhe- ory layouts (and which), whether to try tensoriza- dra to nested memory units, giving the compiler a tion, etc.) without needing to manually experiment way to match the compute structure to the caching with the exact parameters for these optimizations. structure for multilevel hardware topologies. These schedule spaces are still coupled to both the At the same time, when this hardware-based com- operation and the hardware architecture. plexity is not needed, Stripe does not require it to Relay [28] is an IR for tensor expressions used in be specified. Stripe code representing a single tensor the TVM stack. While its functionality has some operation can be represented as an unnested poly- overlap with Stripe (transformations enabling ten- hedron, and a network can be represented as a list of sorization, for example), it is mostly a higher level polyhedra. This allows straightforward lowering to IR than Stripe; many of its tasks are represented in Stripe from a language that uses a syntax directly Tile in the PlaidML stack (automatic differentiation representing mathematical formulas for the tensor for example) or even at the graph level. operations (PlaidML’s Tile language, for example). nGraph [8] provides optimization opportunities at The representation is then transformed through a the graph level, where the network-to-device compi- series of optimization passes to divide and rewrite lation can be managed with a series of “subgraphs”. these basic polyhedra into nested polyhedra appro- Since graphs can be managed at this level in both priate for maximizing performance on the hardware a static and dynamic manner, the performance in- target. crease can be used to further accelerate training We see several key advantages to Stripe. Fore-

2 Kernel Library Schedule Space Searching Stripe foreach HW Architecture foreach Kernel foreach Kernel foreach HW Version write_algorithm write_algorithm foreach Kernel foreach HW Architecture foreach Input Shape write_schedule_space foreach HW Architecture foreach Output Shape foreach HW Version create_stripe_config write_kernel foreach Input Shape foreach HW Version foreach Output Shape set_config_params select_schedule

Figure 1: A comparison of the manual engineering needed under different code generation approaches. For a sched- ule search space approach like AutoTVM, autotuning is used to select the schedule; other scheduling approaches may instead manually write schedules for the various hardware versions and tensor shapes (and thus do not use a schedule search space in favor of manual schedule “selection”). For Stripe, note that hardware configuration is done independently of the kernels. most, Stripe removes the combinatorial explosion tractable optimizations to target complex hardware of engineering work from the interaction between topologies based on those features. Most ML frame- growth in accelerators and growth in operations. works today that proffer state-of-the-art perfor- The classic compiler approach of Stripe means that mance do not have a compiler that satisfies these algorithms can be written on a per-operation ba- requirements, and thus instead use expansive kernel sis and optimizations can be written on a per- libraries. architecture basis; notably, neither must be written based on both the operation and the hardware ar- chitecture. Even with a schedule-space autotuning 2.1 Data Use Analysis approach like AutoTVM, schedule spaces must be Analyzing the performance of a machine learning written for each combination of operation type and workload with any degree of accuracy requires clear architecture type. For kernel libraries, manually- analysis of data usage. Particularly important are engineered code must also include hardware and op- how much data is used (i.e., Is dataflow split into eration parameters (see Figure 1). appropriately sized chunks for the memory units be- Stripe’s compiler provides modular and extensi- ing used?) and which data is used (i.e., What pro- ble optimization passes, allowing novel optimiza- duces and depends on this data? How much reuse tions without requiring redevelopment of existing is possible if the data is retained at various levels optimizations. Stripe’s optimization passes are of the memory hierarchy?). Tracking details such generic and parameterized, enabling reuse across as these in a general-purpose compiler can be ex- any hardware target for which the pass is beneficial. tremely challenging, and even limited solutions are Stripe’s nested polyhedral model naturally repre- important research areas. sents memory hierarchies of nested and potentially- Machine learning workloads are typically highly heterogeneous depth, thereby supporting complex structured in several ways that provide key clues hardware topologies. The compilation model of for how our analysis should proceed. ML workloads Stripe doesn’t require physical hardware or even a have few control dependencies (exceptions are gener- cycle-accurate model, just a selection of optimiza- ally minimal and straightforward: reversing the non- tion passes with appropriate parameters; in con- padding portion of a sequence in a recurrent network trast to autotuning approaches this allows software- may depend on the length of the sequence, for exam- hardware codesign early in the development cycle ple). Thus, we can calculate, rather than estimate, and at relatively low cost. what data will need to be used or reused. Moreover, the calculations necessary to track data dependency and aliasing for machine learning workloads are of- 2 Requirements of Current ten reasonably straightforward and tractable. The Machine Learning Execution natural representation of data is typically a tensor with several dimensions of various sizes. The oper- To successfully produce high-performance code, an ations performed generally access input and output ML compiler must first analyze, accurately, any tensors in a highly regular manner; specifically, an defining features in the dataflow and then perform iteration space of indexes can be defined, and all

3 tensor accesses can be defined as affine polynomials and takes full use of available space without spilling in terms of indexes in the iteration space. Addition- – this usually involves division into multiple and dis- ally, intra-operation dependencies most frequently tinct levels of memory. Often there are hardware- take the form of commutative and associative aggre- specific instructions requiring exact stencil sizes, es- gations, such as the sum of a multiply-accumulate pecially on accelerators. Where multiple hardware or the max of a maxpool. units are available, the work must be appropriately balanced, even when the units provide heteroge- 2.2 Complex Hardware Topologies neous functionality. Finally, complex scheduling problems arise from doing all of this in a context Appropriately distributing ML tasks to accelerators with deep chains of massive and mostly parallel op- requires optimizations not typically required in the erations. CPU or GPU cases, nor are they required at the same level of generality. A GPU may need explicit Autotiling Large tensors may need to be split memory movement between internal compute units, into smaller tiles to optimize cache reuse. Autotiling though it is unlikely to need this across multiple must evaluate the performance of potential tilings levels of memory hierarchy or multiple memory par- and split loops into tiles accordingly. The autotiling titions. A CPU might need vectorization to use vec- pass drives many Stripe design choices and will be tor instructions, but it is unlikely to have compute discussed in further detail in Section 3.3. units that operate only on tensorized data. Parti- tioning work amongst multiple heterogeneous hard- ware units may also be necessary to appropriately Microarchitectural Transposition Advanced distribute an ML workload. instructions or specialized compute units may re- Supporting even one accelerator will very likely quire data in a specific layout. Code that could take require memory management at multiple levels and advantage of these instructions or compute units if distribution of work to compute units with varied its data were transposed must be found, and the capabilities. With a kernel library, utilization of transposition performed. these features will be written directly into the kernel. With a compiler, optimizations appropriate to these features will be automatically generated. An opti- Microarchitectural Stenciling The microarchi- mization that decides which data should be moved tecture may need a specific tile size (stencil), in ad- from larger distant memory to smaller closer mem- dition to the required dimension-order for its data ory (at whatever level of the overall hierarchy these layout. Code that could use specialized instructions memory units reside) readily generalizes to multiple or compute units if the data matched a specific sten- accelerator architectures. Similarly, an optimization cil must be found, and that data must be reshaped that can distribute work to heterogeneous compute to the stencil. units of varied complexity will generalize to varied accelerator architectures. Banking and Partitioning It may be useful for A hardware runtime will still be necessary, even multiple compute units to work in parallel on dif- with a compiler. However, with the compiler per- ferent portions of the same data. For operations forming general machine learning optimizations tar- that can be run in parallel in this way, the relevant geted to the hardware architecture, the runtime can tensors must be partitioned into different compute be developed in a low-level, hardware-facing man- unit-specific caches or into different banks to enable ner without requiring optimizations for specific ML this parallel work without conflict. tasks.

2.3 Tractable Optimizations Fusion To maximize cache reuse, it may be better to perform multiple operations on only one or a few The patterns of control flow and data use in ma- tiles of data before proceeding to other data. Code chine learning workloads make optimization eas- may include a series of loops that could potentially ier, not harder. Many optimizations become share the same outer loop and internally perform tractable, and indeed necessary, to achieve state-of- those operations in serial. The relative performance the-art performance. In both programmer-managed of such a fusion must be compared to other possible and automatically-cached memory scenarios, data fusions (or no fusion at all); where a fusion is valu- should be loaded in a manner that maximizes reuse able, the code must be rewritten to a fused form.

4 Scalarization and Memory Localization as a set of ~x ∈ Zn satisfying A~x + ~b ≥ ~0 (e.g. in Transient intermediates produced in registers may Bondhugula et al. [3]); instead, it is the intersection not need to be stored into memory and reloaded into of a lattice with a real convex polyhedron. For con- registers. Temporary memory may only be needed venience, we will use the term “polyhedron” to refer in inner portions of the memory hierarchy. Memory specifically to bounded integer polyhedra that are allocation must be pulled inside loops where legal subsets of Zn. and semantically equivalent, and unnecessary stores The polyhedral model [1, 11, 12, 25, 34] is a model and loads must be found and eliminated. for performing iterative computations over an index space defined by a polyhedron, with dependencies Scheduling Operations reading and writing logi- between steps generally also defined. This paper will cal tensor data must be rewritten to access physi- not go into detail on this model; an overview can be cal device memory. This requires assigning physical found in Girbal et al. [12]. Instead, we will develop memory locations for logical tensor data, scheduling a nested polyhedral model of iterative computation data movement to and from the physical memories that most notably differs from the polyhedral model accessible to compute units, and reordering the op- in its dependency structure. erations to take advantage of data locality. 3.1.2 Parallel Polyhedral Blocks Separating Interior and Boundary Tiles In the Nested Polyhedral Model, there are no de- Some workloads do not evenly divide into tiles, pendencies between iterations, with the possible ex- or they might have special boundary conditions or ception of reduction dependencies. This is specified other irregularities that do not affect most tiles, but more precisely in Definition 2. that must be considered nonetheless. These irregu- larities are best handled separately from the general Definition 2. A parallel polyhedral block tiles. (P, SP , D, AD) consists of a polyhedron P called the iteration space, a map SP from points in P 3 Stripe Design & Implemen- to lists of statements, a set of I/O buffers D, and a map AD from buffers of D to associative and tation commutative operations called the aggregation operations satisfying the following: The Stripe IR is designed to provide both a level and type of granularity appropriate to optimizing ma- 1. Statements in SP may only read or write to chine learning tasks. In Section 3.1 we discuss the buffers in D or to internally-scoped temporaries Nested Polyhedral Model, which provides the theo- that are not shared between iterations. A sin- retical underpinnings for Stripe. In Section 3.2 we gle statement list Sp ∈ SP may have arbitrary describe the Stripe IR, discussing how it implements dependencies between its statements and is in- the Nested Polyhedral Model, and how this imple- terpreted as running serially. mentation enables important optimizations. In Sec- 2. If the statements si ∈ SP for iteration i ∈ P tion 3.3 we detail how autotiling is performed when write to a buffer element b ∈ B ∈ D, no state- compiling Stripe, and demonstrate how optimiza- ments sj ∈ SP for j ∈ P, j 6= i may read from tion passes function with Stripe. this buffer element b. 3.1 Nested Polyhedral Model 3. When a buffer element b ∈ B ∈ D is written to by statements in statement lists Si0 ,...,Sin ∈ 3.1.1 The Polyhedral Model SP for multiple index values i0, . . . , in ∈ P, the Definition 1. An integer polyhedron P is a set of value written to b is all ~x ∈ Qn such that aB(vi0 , . . . , vin ) A~x +~b ≥ ~0, and where vi0 , . . . , vin are the values for b computed A~x +~b ∈ m Z by the statement lists Si0 ,...,Sin and aB ∈ AD is the aggregation operation associated with B. where A ∈ m×n and ~b ∈ m. Q Q When a buffer element b is written to by the Note that this definition is not equivalent to the statements in statement list Si for exactly one definition sometimes used of an integer polyhedron iteration i ∈ P, then the value vi computed for

5 element b by si is written to b, regardless of the patterns (such as those arising from vectorization, aggregation operation. tensorization, tiling, and partitioning).

Imposing these dependency restrictions makes it more straightforward to parallelize execution over different elements of the iteration space. The only necessary additions to the statements in SP involve how to handle aggregation (i.e. adding aggregation operations and temporary storage for intermediate results—and even this may be unnecessary if, for example, the aggregations can be made atomic). At the same time, the statements within a block are semantically serial (although they may, of course, be executed in parallel if the compiler can determine that doing so is equivalent to serial execution), and thus within a block (intra-block) statements may have complex dependencies (except insofar as they modify the externally-visible buffers in D). Note that as defined, a parallel polyhedral block need not demonstrate the regularities common to machine learning workloads. Such blocks can in- volve statements with complex control dependencies (they are restricted in what buffer elements they can Figure 2: Two tilings of a tensor iterated over by nested read or write by other iterations); they make no re- polyhedral blocks. Points of the same color are iter- strictions requiring affine data access patterns; they ated over in the same inner block. In the upper tiling, can have statement lists that are altogether unre- the inner-block access steps one unit horizontally or ver- lated for different iteration indexes. This makes ver- tically as the appropriate index is incremented, while ifying that the dependency conditions are satisfied the outer-block access steps three units horizontally or two vertically. In the lower tiling, the outer-block access challenging, especially for optimization passes that steps one unit as the appropriate index is incremented automatically rewrite the parallel polyhedral block and the inner-block access that steps in units of three or to code that must be proven semantically equiva- two. Either is readily expressed in the Nested Polyhedral lent. Moreover, utilizing specialized hardware can Model, and as there are no conflicting accesses, no serial be challenging. For example, if the statements differ statements need be used. Thus, both are hierarchically between every iteration index, utilizing SIMD hard- parallelizable. ware effectively is essentially impossible. Additional restrictions Stripe makes to match ML workloads to This nesting of parallel polyhedral blocks can be hardware and to make execution efficient are dis- extended to as many levels as appropriate for the cussed in Section 3.2. problem, creating a hierarchy of parallelizable code. Figure 3 illustrates what regions of a tensor might 3.1.3 Nested Polyhedral Model be accessed in a multilevel nest of parallel poly- hedral blocks constructed from partitioning, tiling, The Nested Polyhedral Model is built from paral- and tensorization passes to target a hardware archi- lel polyhedral blocks by defining one or more state- tecture. ments of a parallel polyhedral block to be the exe- cution of another parallel polyhedral block. 3.2 Structure of Stripe Ensuring that the dependency conditions of Defi- nition 2 are satisfied will almost always require the Stripe represents parallel polyhedral blocks with the inner parallel polyhedral block to depend on the it- block structure. A Stripe block captures the poly- eration indexes of the outer block. Stripe accom- hedral iteration space by specifying a list of index plishes this by offsetting memory accesses in the names, a range for each index, and a list of affine inner block based on the iteration indexes of the constraints on the indexes. There is a single state- outer block (as well as of the inner block). See Fig- ment list that does not vary between iteration in- ure 2 for examples of the resulting access patterns; dexes. The statements do access different buffer el- as illustrated, this readily represents “block” access ements in different iterations, and statements that

6 way. Tags are more general than just this use case: any element of Stripe code may be given an arbi- trary set of strings which are its tags. These tags have no semantic meaning (in the sense that they do not change the expected program output), but in- stead provide additional information to Stripe opti- mization passes and the hardware abstraction layer. Other use cases for tags include storing results from analysis passes to avoid repeating the analysis in later passes where such recomputation may be ex- pensive or challenging. To clarify the memory access of a block, all buffers Figure 3: The memory regions accessed by statements used in a block must be explicitly declared, and the in the parallel polyhedral blocks at various levels in a scope of a buffer is limited to the block it is declared Nested Polyhedral Model. Each column shows the mem- in. In particular, buffers are not in scope within ory accesses from a different nesting depth. The columns child blocks unless explicitly passed to the child. are labeled with hardware features that might be tar- geted by blocks at that level of nesting. Stripe uses refinements to declare passing a buffer to a child block. The refinement declares whether the child buffer is to be used for input, output, or are themselves blocks may have their inner itera- both, and indicates what subregion of the parent tion space modified based on the outer iteration. block is represented—child buffers do not have to With the restriction to a single statement list, as- bring the entire parent buffer into scope in the child signing work to SIMD hardware becomes efficient. block. Typically they don’t, which enables verifi- The I/O buffers of a Stripe block are explicitly de- cation of parallelizability of the nested polyhedral clared, along with an aggregation operation for each structure. A refinement also describes the memory buffer. Stripe includes an assign aggregation op- layout of the child buffer, indicating the size and eration that indicates it is illegal for values in the stride (i.e. memory layout) of each dimension. Pass- buffer to be written to by multiple iterations. ing only these restricted views to inner blocks natu- Buffer accesses in Stripe are affine functions of rally represents memory and compute structures for the iteration indexes, potentially including indexes optimizations like tiling and vectorization. of all parent blocks. This makes aliasing analysis Refinements may also include the hardware loca- much easier, which is critical for verifying that all tion of the buffer: the name of the memory unit (e.g. properties of a parallel polyhedral block remain sat- “SRAM”), a bank number (if applicable) which may isfied after an automatic rewrite. Analysis is also be determined from the iteration indexes if appropri- simplified by requiring any parent index used to be ate, and a memory address. Buffer locations are not explicitly passed to the child block. required, and hardware-specific optimization passes Stripe statements can be another block, an intrin- will need to be run before it is possible to set buffer sic, or a special function. An intrinsic works with locations sensibly. Specifying buffer locations al- scalar values: it can read or write a scalar from a lows for more precise analysis of available resources buffer (using a buffer access that is an affine polyno- and is a crucial step for devices with programmer- mial of index values as described above), or perform controlled memory. simple operations on scalars, such as addition or a Blocks may contain multiple statements, and trig function. Special functions perform complex op- these statements must be executed as if in serial. erations on tensors that are inappropriate to repre- However, when the compiler can verify that paral- sent as blocks of operations on scalars, e.g. scatter lel execution would not change the semantics, this or gather. parallel execution is allowed. A scheduling pass is Operations expressed as scalars in Stripe are not used on multi-statement blocks to construct a di- always performed by manipulating scalars at the rected acyclic graph of dependencies between the hardware level (e.g. due to vectorization). For sit- statements. Where applicable, information about uations where blocks of scalar statements have ap- the memory access patterns of statements (e.g. from propriate semantics that translate in whole to hard- child block refinements) is used to determine if state- ware instructions, Stripe includes tags which sig- ments are independent. This can be especially im- nal to optimizations passes and the lowerer that a portant for partitioning of work into heterogeneous chunk of code is intended to be lowered in certain units, where distinct block structures are needed for

7 the different units. ory and that the inner memory is efficiently filled to Stripe allows arbitrary integer polyhedra to be maximize reuse. In architectures with queue-based used as the iteration spaces of blocks. However, multiprocessing, tiling breaks operations into work- its syntax encourages the use of rectilinear con- groups effectively. straints by requiring a range to be specified for The autotiling optimization for Stripe explores a each index and optionally allowing additional non- space of tile sizes using a cost function that models rectilinear constraints. This structure models the the potential performance impacts described above almost-rectilinear nature of common operations like (an example of this is illustrated in Figure 4). Sev- convolutions with boundary conditions. Maintain- eral constraints can exclude tile sizes from the space ing as much rectilinearity of constraints as possible to be explored: for instance, the total memory used in the IR is valuable, as hardware targets often per- may not exceed the total available memory; also, form better on rectilinear iteration spaces but need if the operation is already applied to dimensioned to handle tasks that are not perfectly rectilinear. blocks (from an earlier vectorization or tensoriza- One minor divergence of the implementation of tion pass, for example), then the tile size must be an Stripe from theoretical parallel polyhedral blocks as even multiple of this size. Search-space heuristics, specified in Definition 2 is that aggregation oper- such as only considering power-of-2 dimensions to ations may be only approximately associative and optionally improve compile performance, may also commutative (floating point addition is a common constrain the tile sizes considered. aggregation operation that only approximately has The design of the Stripe IR makes it straightfor- these properties, for example). In such situations, ward to rewrite blocks to introduce an intermediate executing a Stripe program is ill-defined and nonde- block of the selected tile size (example code pro- terministic; however, this nondeterminism typically duced by such rewriting is provided in Figure 5). leads to negligible errors in practice for the same In the basic case, simply splitting the index ranges reasons floating point errors are typically negligible. such that the inner iteration space shape matches In situations where this nondeterminism cannot be the selected tile size, and the outer iteration space safely ignored, fixed point or integer types may be shape is the quotient of the original index ranges, used instead. and passing the tensors into the inner block with appropriate offsets will create an effective rewrite. The common complexities of tiling that arise in ML 3.3 Autotiling workloads are also readily represented: To illustrate how Stripe IR automates effective op- • When different coordinates in an output ten- timizations, let’s consider one of the key optimiza- sor need to read from the same coordinates on tion passes: autotiling. Machine learning operations an input tensor along large dimensions (e.g. for are routinely too large and must be split into pieces a non-pointwise convolution), the required it- (“tiles”) that fit into local resources. The autotiling eration space will be polyhedral and not per- optimization pass determines the shape of these tiles fectly rectilinear. Constraints representing the that brings the overall operation’s performance clos- boundary / “halo” conditions define such an it- est to the roofline [33] implied by the available com- eration space. pute and I/O bandwidth. Depending on the hard- ware target, several costs may need to be considered, • For regions that are already not perfectly rec- including the amount of memory reuse, whether the tilinear, the existing constraints can be pulled tile shape evenly divides all dimensions of all tensors into the inner block to maintain the same poly- (and how large any overflow is), whether any reduc- hedral structure. tions have been split to multiple tiles and the rel- • When the optimal tile size does not evenly di- ative cost of computing those reductions later, and vide a dimension, round up the computed quo- the interaction of the cache width with the layout of tient for the outer block (causing an overflow). each tensor as restricted by the tile shape. Then remove the overflow by adding a con- In architectures with automatic caching, this straint based on both the outer and inner in- tiling optimization improves cache performance by dex value to not perform any calculations in selecting tile sizes where full tiles can fit into cache the out-of-bounds overflow region that this in- simultaneously, maximizing cache hits. In architec- troduced. tures requiring explicit memory transfers, tiling de- termines what data will be transferred, with the tile Stripe’s nested block structure readily allows for size ensuring that all data fits in the inner mem- multiple tiled layers. This is useful not only for cases

8 (a) Cost: 4.666 (b) Cost: 4.5

(c) Excluded from search space for requiring too (d) Cost: 5.625. many elements in memory for a single tile.

Figure 4: Four different tilings along with associated costs. This example shows the input and output tensors of a 3 × 3 convolution (the weights tensor is not shown and all examples treat it as untiled). We use a hypothetical cost model of number of cache lines accessed, divided by the number of multiply-accumulate operations performed. Tiles on the inputs are shown including overflows; accesses to these elements are removed by constraints in execution but still increase the cost. Only the spatial dimensions are shown, but for the cost model we assume non-batched data with 8 input channels, 16 output channels, and a cache line size of 8 elements. We cap the total memory available for both the input and output tensor tiles at 512 elements.

b l o c k [ ] : 1 ( b l o c k [ ] : 1 ( in I[0, 0, 0] i8(12, 16, 8):(128, 8, 1) in I[0, 0, 0] i8(12, 16, 8):(128, 8, 1) in F[0, 0, 0, 0] i8(3, 3, 16, 8):(384, 128, 8, 1) in F[0, 0, 0, 0] i8(3, 3, 16, 8):(384, 128, 8, 1) out O[0, 0, 0]:assign i8(12, 16, 16):(256, 16, 1) out O[0, 0, 0]:assign i8(12, 16, 16):(256, 16, 1) ) { ) { 0 : 0 : block [x:12, y:16, i:3, j:3, c:8, k:16] ( block [x:4, y:4, i:1, j:1, c:1, k:1] ( −1 + x + i >= 0 i n I [ 3 ∗ x − 1 , 4∗y − 1, 0] i8(5, 6, 8):(128, 8, 1) 12 − x − i >= 0 in F[0, 0, 0, 0] i8(3, 3, 16, 8):(384, 128, 8, 1) −1 + y + j >= 0 out O[ 3 ∗ x , 4∗y, 0]:add i8(3, 4, 16):(256, 16, 1) 12 − y − j >= 0 ) { i n I [ x+i −1, y+j −1, c] i8(1, 1, 1):(128, 8, 1) 0 : in F[i, j, k, c] i8(1, 1, 1, 1):(384, 128, 8, 1) block [x:3, y:4, i:3, j:3, c:8, k:16, xo=3∗x , yo=4∗y ] ( out O[x, y, k]:add i8(1, 1, 1):(256, 16, 1) −1 + xo + x + i >= 0 ) { 12 − xo − x − i >= 0 0: $I = load(I) −1 + yo + y + j >= 0 1: $F = load(F) 16 − yo − y − j >= 0 2: $O = mul($I, $F) in I[x + i, y + j, c] i8(1, 1, 1):(128, 8, 1) 3: O = store($O) in F[i, j, k, c] i8(1, 1, 1, 1):(384, 128, 8, 1) } out O[x, y, k]:add i8(1, 1, 1):(256, 16, 1) } ) { $ I = l o a d ( I ) $F = l o a d (F) (a) Before tiling $O = mul($I, $F) O = store($O) } } }

(b) After tiling

Figure 5: Example Stripe code before and after the tiling pass shown in Figure 4b. Note how the iteration space is specified by giving a range for each variable and also specifying any additional non-rectilinear constraints. In Figure 5b, notice how the overlap between tiles manifests as the size of I on the middle (tiled) block being larger than the strides of the corresponding spatial indexes. On the innermost block of Figure 5b the values of the x and y indexes on the parent block are explicitly passed in so they may be used in the constraints on the child block.

9 like tensorization, as alluded to above, but also for Producing such a release will be key to making the more general use cases like hardware topologies with Stripe IR useful to the open source community. multiple layers of memory, or when generally parti- MLIR [19] is an upcoming compiler infrastruc- tioning work amongst multiple units. ture providing an IR with multiple “dialects”. This allows IRs of various forms to all be embedded 3.4 Stripe in PlaidML as dialects within a broader MLIR infrastructure, improving inter-operability and optimization pass Stripe is the core IR in a larger PlaidML tensor com- reuse. From our current understanding of MLIR piler as shown in Figure 6. PlaidML first lowers net- we believe that both Stripe and MLIR would bene- works from a source like nGraph [8], [7], or fit from adding Stripe as an MLIR dialect. In par- ONNX [23] into Tile, which is PlaidML’s high-level ticular, we expect this would provide better inte- IR representing ML operations in a form reminis- gration of Stripe with other stages of the machine cent of Einstein notation. Gradients are computed learning code generation and execution pipeline, and in Tile if desired, and this Tile code is lowered to would enable greater sharing of optimization passes Stripe in a general, hardware-agnostic form. Stripe between Stripe and other compilers. code is then compiled via a series of optimization We hope the extensible nature of Stripe’s opti- passes targeting the desired hardware, and the resul- mization passes will enable new optimization tech- tant code is lowered to a hardware abstraction layer, niques expressed in the Nested Polyhedral Model to accelerator runtime, or other hardware-appropriate be added to Stripe. code.

5 Conclusions

In this paper, we introduced a domain-specific IR called Stripe that uses the Nested Polyhedral Model to enable automatic generation of machine learn- ing kernels for a variety of hardware targets. We presented the mathematical underpinnings of the Nested Polyhedral Model, and discussed how it re- stricts legal schedules that model the extreme paral- Figure 6: The PlaidML internal stack lelism available in machine learning, and how it uses a nesting structure analogous to patterns common in loop nest optimizations and accelerator topolo- gies. We described how a compiler based on Stripe 4 Future Work enables powerful, extensible, and configurable op- The most crucial future work will be to verify the timizations to be developed independently of the performance of a variety networks compiled through machine learning operations and algorithms being Stripe on a variety of hardware targets. We are ea- optimized. ger to share our approach with the broader commu- nity, and we believe that performance for various GPU targets with our pre-Stripe, fixed compilation pass technology demonstrates that Stripe is well- 6 Acknowledgments positioned to automatically generate high perfor- mance ML kernels. Nonetheless, producing bench- We would like to gratefully acknowledge the contri- marks for full, modern networks on state-of-the-art butions and feedback of a number of people with- hardware is critical. We are actively working on such out whom this paper would not have been possi- benchmarks and will be publishing them. ble. Particular thanks to Leona Cook for editing, We will also continue to release Stripe on the open and thanks to Madhur Amilkanthwar, Priya Arora, source PlaidML GitHub repository. Most notably, Mika¨elBourges-S´evenier, Cormac Brick, Diego Ca- while we have released preliminary versions of Stripe ballero, Namrata Choudhury, Rob Earhart, Frank already, we do not yet have a release that uses Stripe Laub, Alessandro Palla, Brian Retford, Mars Sax- as part of the main PlaidML compilation pipeline. man, Yao Shi, and Matt Westervelt.

10 References nGraph: an intermediate representation, com- piler, and executor for deep learning. CoRR, [1] J. M. Anderson and M. S. Lam. Global opti- abs/1801.08058, 2018. mizations for parallelism and locality on scal- able parallel machines. In Proceedings of the [9] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, ACM SIGPLAN’93 Conference on Program- T. Luo, X. Feng, Y. Chen, and O. Temam. ming Language Design and Implementation, ShiDianNao: shifting vision processing closer to 1993. the sensor. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Archi- [2] Riyadh Baghdadi, Jessica Ray, Malek Ben tecture (ISCA), pages 92–104, June 2015. Romdhane, Emanuele Del Sozzo, Abdurrah- man Akkas, Yunming Zhang, Patricia Suri- [10] Venmugil Elango, Norm Rubin, Mahesh Rav- ana, Shoaib Kamil, and Saman Amarasinghe. ishankar, Hariharan Sandanagobalane, and Tiramisu: A polyhedral compiler for express- Vinod Grover. Diesel: DSL for linear alge- ing fast and portable code. In Proceedings of bra and neural net computations on GPUs. In the 2019 IEEE/ACM International Symposium Proceedings of the 2nd ACM SIGPLAN Inter- on Code Generation and Optimization, CGO national Workshop on Machine Learning and 2019, pages 193–205, Piscataway, NJ, USA, Programming Languages, MAPL 2018, pages 2019. IEEE Press. 42–51, New York, NY, USA, 2018. ACM.

[3] Uday Bondhugula, Albert Hartono, J. Ra- [11] Paul Feautrier. Some efficient solutions to the manujam, and P. Sadayappan. A practical au- affine scheduling problem. I. One-dimensional tomatic polyhedral parallelizer and locality op- time. International Journal of Parallel Pro- timizer. SIGPLAN Not., 43(6):101–113, June gramming, 21(5):313–347, Oct 1992. 2008. [12] Sylvain Girbal, Nicolas Vasilache, C´edricBas- [4] Tianqi Chen, Thierry Moreau, Ziheng Jiang, toul, Albert Cohen, David Parello, Marc Sigler, Haichen Shen, Eddie Q. Yan, Leyuan Wang, and Olivier Temam. Semi-automatic compo- Yuwei Hu, Luis Ceze, Carlos Guestrin, and sition of loop transformations for deep paral- Arvind Krishnamurthy. TVM: end-to-end op- lelism and memory hierarchies. International timization stack for deep learning. CoRR, Journal of Parallel Programming, 34(3):261– abs/1802.04799, 2018. 317, Jun 2006.

[5] Tianqi Chen, Lianmin Zheng, Eddie Q. Yan, [13] Tobias Grosser, Armin Groesslinger, and Chris- Ziheng Jiang, Thierry Moreau, Luis Ceze, tian Lengauer. Polly - performing polyhedral Carlos Guestrin, and Arvind Krishnamurthy. optimizations on a low-level intermediate rep- Learning to optimize tensor programs. CoRR, resentation. Parallel Processing Letters, 22(04), abs/1805.08166, 2018. 2012.

[6] Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. [14] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Eyeriss v2: A flexible and high-performance ac- Ardavan Pedram, Mark A. Horowitz, and celerator for emerging deep neural networks. William J. Dally. EIE: Efficient inference engine CoRR, abs/1807.07928, 2018. on compressed deep neural network. SIGARCH Comput. Archit. News, 44(3):243–254, June [7] Fran¸coisChollet et al. Keras. https://keras.io, 2016. 2015. [Online; accessed 8-March-2019]. [15] Norman P. Jouppi, Cliff Young, Nishant Patil, [8] Scott Cyphers, Arjun K. Bansal, Anahita David Patterson, Gaurav Agrawal, Ramin- Bhiwandiwalla, Jayaram Bobba, Matthew der Bajwa, Sarah Bates, Suresh Bhatia, Nan Brookhart, Avijit Chakraborty, William Con- Boden, Al Borchers, Rick Boyle, Pierre-luc stable, Christian Convey, Leona Cook, Omar Cantin, Clifford Chao, Chris Clark, Jeremy Kanawi, Robert Kimball, Jason Knight, Niko- Coriell, Mike Daley, Matt Dau, Jeffrey Dean, lay Korovaiko, Varun Kumar, Yixing Lao, Ben Gelb, Tara Vazir Ghaemmaghami, Rajen- Christopher R. Lishka, Jaikrishnan Menon, dra Gottipati, William Gulland, Robert Hag- Jennifer Myers, Sandeep Aswath Narayana, mann, C. Richard Ho, Doug Hogberg, John Adam Procter, and Tristan J. Webb. Intel Hu, Robert Hundt, Dan Hurt, Julian Ibarz,

11 Aaron Jaffey, Alek Jaworski, Alexander Ka- [22] Bert Moons and Marian Verhelst. An energy- plan, Harshit Khaitan, Daniel Killebrew, Andy efficient precision-scalable convnet processor Koch, Naveen Kumar, Steve Lacy, James in 40-nm CMOS. J. Solid-State Circuits, Laudon, James Law, Diemthu Le, Chris Leary, 52(4):903–914, 2017. Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gor- don MacKean, Adriana Maggiore, Maire Ma- [23] ONNX. Open neural network exchange. hony, Kieran Miller, Rahul Nagarajan, Ravi https://github.com/onnx/onnx. [Online; ac- Narayanaswami, Ray Ni, Kathy Nix, Thomas cessed 8-March-2019]. Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir [24] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, Salek, Emad Samadiani, Chris Severn, Gregory R. Venkatesan, B. Khailany, J. Emer, S. W. Sizikov, Matthew Snelham, Jed Souter, Dan Keckler, and W. J. Dally. Scnn: An accelerator Steinberg, Andy Swing, Mercedes Tan, Gre- for compressed-sparse convolutional neural net- gory Thorson, Bo Tian, Horia Toma, Erick works. In 2017 ACM/IEEE 44th Annual Inter- Tuttle, Vijay Vasudevan, Richard Walter, Wal- national Symposium on Computer Architecture ter Wang, Eric Wilcox, and Doe Hyun Yoon. (ISCA), pages 27–40, June 2017. In-datacenter performance analysis of a tensor [25] William Pugh. Uniform techniques for loop processing unit. SIGARCH Comput. Archit. optimization. In Proceedings of the 5th Inter- News, 45(2):1–12, June 2017. national Conference on Supercomputing, ICS [16] Ken Kennedy and Kathryn S. McKinley. Max- ’91, pages 341–352, New York, NY, USA, 1991. imizing loop parallelism and improving data lo- ACM. cality via loop fusion and distribution. In Pro- [26] Jonathan Ragan-Kelley, Connelly Barnes, An- ceedings of Languages and Compilers for Par- drew Adams, Sylvain Paris, Fr´edo Durand, allel Computing (LCPC), 1993. and Saman Amarasinghe. Halide: A language [17] Hyoukjun Kwon, Ananda Samajdar, and and compiler for optimizing parallelism, lo- Tushar Krishna. MAERI: enabling flexible cality, and recomputation in image processing dataflow mapping over DNN accelerators via pipelines. SIGPLAN Not., 48(6):519–530, June reconfigurable interconnects. SIGPLAN Not., 2013. 53(2):461–475, March 2018. [27] Prajit Ramachandran, Barret Zoph, and [18] Chris Lattner. LLVM: An Infrastructure for Quoc V. Le. Searching for activation functions. Multi-Stage Optimization. Master’s thesis, CoRR, abs/1710.05941, 2017. Computer Science Dept., University of Illinois at Urbana-Champaign, Urbana, IL, Dec 2002. [28] Jared Roesch, Steven Lyubomirsky, Logan We- See https://llvm.org/. ber, Josh Pollock, Marisa Kirisame, Tianqi Chen, and Zachary Tatlock. Relay: A new IR [19] Chris Lattner, Jacques Pienaar, and everyone for machine learning frameworks. In Proceed- on the MLIR team. MLIR primer: A com- ings of the 2nd ACM SIGPLAN International piler infrastructure for the end of Moore’s law. Workshop on Machine Learning and Program- In Proceedings of the 2019 IEEE/ACM Inter- ming Languages, MAPL 2018, pages 58–68, national Symposium on Code Generation and New York, NY, USA, 2018. ACM. Optimization, Compilers for Machine Learning Workshop, C4ML 2019, 2019. [29] Nadav Rotem, Jordan Fix, Saleem Abdulra- sool, Summer Deng, Roman Dzhabarov, James [20] A. W. Lim, G. I. Cheong, and M. S. Lam. An Hegeman, Roman Levenstein, Bert Maher, Na- affine partitioning algorithm to maximize par- dathur Satish, Jakob Olesen, Jongsoo Park, allelism and minimize communication. In Pro- Artem Rakhov, and Misha Smelyanskiy. Glow: ceedings of the 13th ACM SIGARCH Interna- Graph lowering compiler techniques for neural tional Conference on Supercomputing, 1999. networks. CoRR, abs/1805.00907, 2018.

[21] Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, [30] The XLA Team. XLA - TensorFlow, compiled. Vin Sharma, and Yida Wang. Optimiz- https://developers.googleblog.com/2017/03/xla- ing CNN model inference on cpus. CoRR, tensorflow-compiled.html, March 2017. [On- abs/1809.02697, 2018. line; accessed 8-March-2019].

12 [31] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high- performance machine learning abstractions. CoRR, abs/1802.04730, 2018. [32] Richard Wei, Vikram S. Adve, and Lane Schwartz. DLVM: A modern compiler infras- tructure for deep learning systems. CoRR, abs/1711.03016, 2017.

[33] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76, April 2009.

[34] M. E. Wolf and M. S. Lam. A data local- ity optimizing algorithm. In Proceedings of the ACM SIGPLAN’91 Conference on Pro- gramming Language Design and Implementa- tion, 1991.

[35] Yuxin Wu and Kaiming He. Group normal- ization. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, pages 3–19, Cham, 2018. Springer International Publishing.

[36] S. Yin, P. Ouyang, S. Tang, F. Tu, X. Li, L. Liu, and S. Wei. A 1.06-to-5.09 tops/w reconfigurable hybrid-neural-network processor for deep learning applications. In 2017 Sympo- sium on VLSI Circuits, pages C26–C27, June 2017.

13