Triton: an Intermediate Language and Compiler for Tiled Neural Network Computations

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations Philippe Tillet H. T. Kung David Cox Harvard University Harvard University Harvard University, IBM USA USA USA Abstract 1 Introduction The validation and deployment of novel research ideas in The recent resurgence of Deep Neural Networks (DNNs) the field of Deep Learning is often limited by the availability was largely enabled [24] by the widespread availability of of efficient compute kernels for certain basic primitives. In programmable, parallel computing devices. In particular, con- particular, operations that cannot leverage existing vendor tinuous improvements in the performance of many-core libraries (e.g., cuBLAS, cuDNN) are at risk of facing poor architectures (e.g., GPUs) have played a fundamental role, device utilization unless custom implementations are written by enabling researchers and engineers to explore an ever- by experts – usually at the expense of portability. For this growing variety of increasingly large models, using more reason, the development of new programming abstractions and more data. This effort was supported by a collection for specifying custom Deep Learning workloads at a minimal of vendor libraries (cuBLAS, cuDNN) aimed at bringing the performance cost has become crucial. latest hardware innovations to practitioners as quickly as We present Triton, a language and compiler centered around possible. Unfortunately, these libraries only support a re- the concept of tile, i.e., statically shaped multi-dimensional stricted set of tensor operations, leaving the implementation sub-arrays. Our approach revolves around (1) a C-based lan- of novel primitives to experts [13, 17, 25]. guage and an LLVM-based intermediate representation (IR) This observation has led to the development of various for expressing tensor programs in terms of operations on Domain-Specific Languages (DSLs) for DNNs, based on poly- parametric tile variables and (2) a set of novel tile-level opti- hedral machinery (e.g., Tensor Comprehensions [43]) and/or mization passes for compiling these programs into efficient loop synthesis techniques (e.g., Halide [37], TVM [10] and GPU code. We demonstrate how Triton can be used to build PlaidML[22]). But while these systems generally perform portable implementations of matrix multiplication and con- well for certain classes of problems such as depthwise-separable volution kernels on par with hand-tuned vendor libraries convolutions (e.g., MobileNet [20]), they are often much (cuBLAS / cuDNN), or for efficiently implementing recent slower than vendor libraries in practice (see, e.g., Figure1), research ideas such as shift convolutions. and lack the expressivity necessary to implement structured sparsity patterns [28, 31, 47] that cannot be directly specified CCS Concepts • Computing methodologies → Paral- using affine array indices in nested loops. lel computing methodologies. Keywords compiler; neural networks; GPU ACM Reference Format: Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Inter- mediate Language and Compiler for Tiled Neural Network Com- putations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL ’19), June 22, 2019, Phoenix, AZ, USA. ACM, New York, NY, USA, 100 10 pages. https://doi.org/10.1145/3315508.3329973 Permission to make digital or hard copies of all or part of this work for Performance (TFLOP/S) Rooﬂine Model personal or classroom use is granted without fee provided that copies cuBLAS 10.0 are not made or distributed for profit or commercial advantage and that Triton Auto-TVM copies bear this notice and the full citation on the first page. Copyrights Tensor Comprehensions 1 for components of this work owned by others than the author(s) must 10− PlaidML be honored. Abstracting with credit is permitted. To copy otherwise, or 101 102 republish, to post on servers or to redistribute to lists, requires prior specific Arithmetic Intensity (TFLOP/GB) permission and/or a fee. Request permissions from [email protected]. MAPL ’19, June 22, 2019, Phoenix, AZ, USA Figure 1. Performance of various implementations of © 2019 Copyright held by the owner/author(s). Publication rights licensed C = ABT vs. Roofline46 [ ] Model (NVIDIA GeForce to ACM. GTX1070), where A 2 R1760×1760 and B 2 RN ×1760 as N ACM ISBN 978-1-4503-6719-6/19/06...$15.00 https://doi.org/10.1145/3315508.3329973 modulates arithmetic intensity. MAPL ’19, June 22, 2019, Phoenix, AZ, USA Philippe Tillet, H. T. Kung, and David Cox These issues have often been addressed by the use of Listing 1. C = A × BT in Triton-C. Keywords specific to micro-kernels [11, 21] – i.e., hand-written tile-level intrin- Triton are shown in purple. sics – but this solution requires a lot of manual labor and // Tile shapes are parametric and can be optimized lacks portability. And while several high-level programming // by compilation backends abstractions for tiling have recently been proposed [23, 41], c o n s t tunable int TM = {16, 32, 64, 128} c o n s t tunable int TN = {16, 32, 64, 128} underlying compiler backends still lack support for tile-level c o n s t tunable int TK = {8, 16} operations and optimizations. To this end we present Tri- //C=A ∗ B.T 1 k e r n e l void matmul_nt(float ∗ a , float ∗ b ,float ∗ c , ton (Figure2), an open-source intermediate language and i n t M, in N, intK) compiler for specifying and compiling tile programs into { //1D tile of indices efficient GPU code. i n t rm[TM] = get_global_range(0); i n t rn[TN] = get_global_range(1); i n t rk[TK] = 0 ... TK; Interface to existing //2D tile of accumulators DSLs f l o a t C[TM, TN] = 0; (Not addressed Triton-C in this paper) //2D tile of pointers f l o a t ∗ pa[TM, TK] = a + rm[:, newaxis] + rk ∗ M; f l o a t ∗ pb[TN, TK] = b + rn[:, newaxis] + rk ∗ K; Triton-IR f o r(int k=K; k>= 0; k −= TK ) { bool check_k[TK] = rk < k; bool check_a[TM, TK] = (rm < M)[:, newaxis] && check_k; Triton-JIT bool check_b[TN, TK] = (rn < N)[:, newaxis] && check_k; Auto-Tuner // load tile operands f l o a t A[TM, TK] = check_a ? ∗ pa : 0 ; Machine Benchmark f l o a t B[TN, TK] = check_b ? ∗ pb : 0 ; -Independent // accumulate Passes C += dot(A, trans(B)); Machine // update pointers -Dependent pa = pa + TK∗M; Passes pb = pb + TK∗N; } Machine-Code // write −back accumulators f l o a t ∗ pc[TM, TN] = c + rm[:, newaxis] + rn ∗ M; bool check_c[TM,TN] = (rm < M)[:, newaxis] && (rn < N); @check_c ∗ pc = C ; Figure 2. Overview of Triton } The main contributions of this paper are summarized as any compilation target; (2) a set of tile-level machine- follows: dependent passes for generating efficient GPU-ready • Triton-C (Section3): A C-like language for expressing LLVM-IR; and (3) an auto-tuner that optimize any tensor programs in terms of parametric tile variables. meta-parameter associated with the above passes. The purpose of this language is to provide a stable in- • Numerical Experiments (Section6): A numerical terface for existing DNN transcompilers (e.g., PlaidML, evaluation of Triton that demonstrates its ability to Tensor Comprehensions) and programmers familiar (1) generate matrix multiplication implementations with CUDA. Listing1 shows the Triton-C source code on par with cuBLAS and up to 3× faster than alterna- associated with a simple matrix multiplication task. tives DSLs on recurrent and transformer neural net- • Triton-IR (Section4): An LLVM-based Intermediate works; (2) re-implement cuDNN’s IMPLICIT_GEMM algo- Representation (IR) that provides an environment suit- rithm for dense convolution without performance loss; able for tile-level program analysis, transformation and and (3) create efficient implementations of novel re- optimization. Listing5 shows the Triton-IR code for a search ideas such as shift-conv [47] modules. Rectified Linear Unit (ReLU) function. Here Triton-IR This paper will be prefaced by a brief analysis of the exist- programs are constructed directly from Triton-C dur- ing related literature (Section2) and concluded by a summary ing parsing, but automatic generation from embedded and directions of future work (Section7). DSLs or higher-level DNN compilers (e.g., TVM) could also be explored in the future. 2 Related Work • Triton-JIT (Section5): A Just-In-Time (JIT) compiler The existence of frameworks [1, 9, 36] and libraries for and code generation backend for compiling Triton-IR Deep Learning has been critical to the emergence of novel programs into efficient LLVM bitcode. This includes (1) neural network architectures and algorithms. But despite a set of tile-level, machine-independent passes aimed advances in analytical [5, 48] and empirical [6, 30] heuris- at simplifying input compute kernels independently of tics for linear algebra compilers, these software still invari- ably rely on hand-optimized sub-routines (e.g., cuBLAS and 1http://triton-lang.org cuDNN). This has led to development of various DSLs and Triton: An Intermediate Language for Tiled Neural Network Computations MAPL ’19, June 22, 2019, Phoenix, AZ, USA compilers for DNNs, generally based on one of three distinct Listing 3. Grammar extensions for Triton-C. We assume the approaches: existence of certain C constructs shown in blue. • Tensor-level IRs have been used by XLA [16] and // Broadcasting semantics Glow [38] to transform tensor programs into prede- slice: ':' | 'newaxis ' slice_list: slice| slice_list ',' slice fined LLVM-IR and CUDA-C operation templates (e.g., slice_expr : postfix_expr | expr '[' slice_list ']' tensor contractions, element-wise operations, etc.) us- // Range initialization constant_range: expr '... ' expr ing pattern-matching. // Intrinsics • The polyhedral model [18] has been used by Tensor global_range: 'get_global_range ''(' constant ')' dot: 'dot ''(' expr ',' expr ')' Comprehensions (TC) [43] and Diesel [14] to parame- trans: 'trans ''(' expr ',' expr ')' terize and automate the compilation of one or many intrinsic_expr: global_range| dot| trans // Predication DNN layers into LLVM-IR and CUDA-C programs.

Triton: an Intermediate Language and Compiler for Tiled Neural Network Computations

SOL: Effortless Device Support for AI Frameworks Without Source Code Changes

Training Professionals and Researchers on Future Intelligent On-Board Embedded Heterogeneous Processing

Plaidml Documentation

Advancing Compiler Optimizations for General-Purpose & Domain-Specific Parallel Architectures

Machine Learning Framework for Petrochemical Process Industry

Fast Geometric Learning with Symbolic Matrices

Benchmarking Contemporary Deep Learning Hardware and Frameworks: a Survey of Qualitative Metrics Wei Dai, Daniel Berleant

Benchmarking Contemporary Deep Learning Hardware And

Stripe: Tensor Compilation Via the Nested Polyhedral Model

Comparing the Costs of Abstraction for DL Frameworks

Towards Transparent Neural Network Acceleration

Package 'Keras'