Stripe: Tensor Compilation Via the Nested Polyhedral Model
Total Page:16
File Type:pdf, Size:1020Kb
Stripe: Tensor Compilation via the Nested Polyhedral Model Tim Zerrell1 and Jeremy Bruestle1 [email protected], [email protected] 1Intel March 18, 2019 Abstract 1.1 Kernel Libraries Algorithmic developments aimed at improving accu- Hardware architectures and machine learning (ML) racy, training or inference performance, regulariza- libraries evolve rapidly. Traditional compilers often tion, and more continue to progress rapidly [27, 35]. fail to generate high-performance code across the Nevertheless, these typically retain the same regu- spectrum of new hardware offerings. To mitigate, larities that make specialized ML architectures fea- engineers develop hand-tuned kernels for each ML li- sible, and could in principle be efficiently run on brary update and hardware upgrade. Unfortunately, ML accelerators. However, the traditional kernel li- this approach requires excessive engineering effort brary approach requires a kernel for each hardware- to scale or maintain with any degree of state-of-the- network architectural feature pair, creating a com- art performance. Here we present a Nested Poly- binatorial explosion of optimization work that is in- hedral Model for representing highly parallelizable feasible with the rapid growth of both hardware and computations with limited dependencies between it- algorithm designs. erations. This model provides an underlying frame- One way to achieve all these optimizations is to work for an intermediate representation (IR) called write an extensively customized kernel to account Stripe, amenable to standard compiler techniques for each supported machine learning operation and while naturally modeling key aspects of modern ML each materially different input and output shape on computing. Stripe represents parallelism, efficient each supported hardware platform. As new hard- memory layout, and multiple compute units at a ware architectures develop, new kernels must be level of abstraction amenable to automatic optimiza- written for all supported operations. If new op- tion. We describe how Stripe enables a compiler for erations are devised, they must be added to each ML in the style of LLVM that allows independent hardware target. This coupling creates a mainte- development of algorithms, optimizations, and hard- nance nightmare; decoupling them via automatic ware accelerators. We also discuss the design explo- code generation, where the hardware configuration ration advantages of Stripe over kernel libraries and is separate from both the optimization passes and schedule-based or schedule-space-based code gener- the operations themselves, would be ideal. Unfor- arXiv:1903.06498v1 [cs.DC] 14 Mar 2019 ation. tunately, general-purpose compilers are not aware of the regularities of ML workloads (as discussed in section 2), and so the required optimizations are in- 1 Introduction tractable. The industry is producing an explosion of var- 1.2 Compilers ied and creative hardware accelerator architectures [6, 9, 14, 15, 22, 24, 36]. Designs tend to be opti- LLVM [18] is a community-driven compiler infras- mized for specific goals, such as power efficiency or tructure that demonstrates features we want in a performance, and often breed even more specialized compiler for ML workloads. Its suite of optimiza- architectures for very targeted use cases, such as the tion and transformation passes applied to a consis- MAERI [17]. To take advantage of these innovative tent IR makes it possible to optimize workloads in designs, ML software must be optimized to target a progressive and modular fashion. Direct trans- the specialized hardware. formation of the IR allows for deeper changes than 1 those available to a model maintaining some fixed workloads, or (as is the more common use case for structure along with a modifiable implementation or nGraph) to output inference computations in envi- interpretation. Moreover, an excellent community of ronments where low-latency is important. nGraph experts with varying priorities actively contributes may be used in conjunction with PlaidML (see sec- to the ecosystem. Unfortunately, and despite these tion 3.4) to provide complementary graph optimiza- benefits, LLVM is still general purpose and cannot tions. be directly used to compile high performance ML Glow [29] offers graph compilation and does not code. generate code for operations like GEMMs or con- Special-purpose optimization and compilation volutions, instead relying on kernel libraries or techniques have also been developed. Loop nest accelerator-specific compilers. transformation and compilation algorithms [16, 20], Other machine learning domain specific compil- including the development of the polyhedral model ers include XLA [30], Diesel [10], DLVM [32], and [1, 34], optimize constrained loop nests via tiling and TensorComprehensions [31]. data layout transformations. Frameworks for ex- tracting fine-grained parallelism in traditional work- 1.3 Stripe loads and applying such polyhedral techniques, in- cluding Polly [13] and PLuTo [3], have proven bene- We propose a compiler structured along the same ficial. However, these techniques have not been suf- lines as LLVM: it lowers source code to an interme- ficient to achieve peak performance for many ML diate representation (IR) and selects and parame- workloads [2]. terizes a list of optimization passes from a common Various frameworks, including URUK [12], Halide pool; these passes are then iteratively applied to the [26], and Tiramisu [2] separate loop nest seman- IR; and only after all have been applied is the IR tics from execution order via a scheduling language. code lowered to hardware-specific instructions. A TVM [4] also does this, building on Halide by key innovation of our proposed compiler is the IR, creating a tensor expression language and adopt- called Stripe, which abstracts to a granularity fine ing a \decoupled" scheduling approach that al- enough to represent the full new functionality avail- lows for hardware-specific optimizations. The re- able on ML accelerators and coarse enough to al- sult is a cleaner separation of expertise between low automatic compilation of high performance ML network architecture and hardware design; see for code. example Liu et al. [21] on optimizing for CPUs in Stripe is built to represent tensor operations via TVM. AutoTVM [5] introduces automatic selection the Nested Polyhedral Model (Section 3.1). This of a schedule from a schedule search space using model nests polyhedra in the sense that, for each a deep learning approach with transfer learning. point in a parent polyhedron, a child polyhedron This means hardware-specific optimizations can be is defined. This nesting naturally represents tiling, written with a focus on getting the right structure partitioning, tensorization, and other \blocking" op- (whether to try tiling, whether to try different mem- erations. It also allows assignment of nested polyhe- ory layouts (and which), whether to try tensoriza- dra to nested memory units, giving the compiler a tion, etc.) without needing to manually experiment way to match the compute structure to the caching with the exact parameters for these optimizations. structure for multilevel hardware topologies. These schedule spaces are still coupled to both the At the same time, when this hardware-based com- operation and the hardware architecture. plexity is not needed, Stripe does not require it to Relay [28] is an IR for tensor expressions used in be specified. Stripe code representing a single tensor the TVM stack. While its functionality has some operation can be represented as an unnested poly- overlap with Stripe (transformations enabling ten- hedron, and a network can be represented as a list of sorization, for example), it is mostly a higher level polyhedra. This allows straightforward lowering to IR than Stripe; many of its tasks are represented in Stripe from a language that uses a syntax directly Tile in the PlaidML stack (automatic differentiation representing mathematical formulas for the tensor for example) or even at the graph level. operations (PlaidML's Tile language, for example). nGraph [8] provides optimization opportunities at The representation is then transformed through a the graph level, where the network-to-device compi- series of optimization passes to divide and rewrite lation can be managed with a series of \subgraphs". these basic polyhedra into nested polyhedra appro- Since graphs can be managed at this level in both priate for maximizing performance on the hardware a static and dynamic manner, the performance in- target. crease can be used to further accelerate training We see several key advantages to Stripe. Fore- 2 Kernel Library Schedule Space Searching Stripe foreach HW Architecture foreach Kernel foreach Kernel foreach HW Version write_algorithm write_algorithm foreach Kernel foreach HW Architecture foreach Input Shape write_schedule_space foreach HW Architecture foreach Output Shape foreach HW Version create_stripe_config write_kernel foreach Input Shape foreach HW Version foreach Output Shape set_config_params select_schedule Figure 1: A comparison of the manual engineering needed under different code generation approaches. For a sched- ule search space approach like AutoTVM, autotuning is used to select the schedule; other scheduling approaches may instead manually write schedules for the various hardware versions and tensor shapes (and thus do not use a schedule search space in favor of manual schedule \selection"). For Stripe, note that hardware configuration is done independently of the