Array Languages Make Neural Networks Fast

Array Languages Make Neural Networks Fast Artjoms Šinkarovs Hans-Nikolai Vießmann Sven-Bodo Scholz Heriot-Watt University Heriot-Watt University Heriot-Watt University Edinburgh, Scotland, UK Edinburgh, Scotland, UK Edinburgh, Scotland, UK [email protected] [email protected] [email protected] Abstract hardware — modern machine learning applications are envi- Modern machine learning frameworks are complex: they are sioned to run on massively parallel high-throughput systems typically organised in multiple layers each of which is writ- that may be equipped with GPUs, TPUs or even custom-built ten in a different language and they depend on a number of hardware. external libraries, but at their core they mainly consist of ten- Programming such complex systems is very challenging, sor operations. As array-oriented languages provide perfect specifically in an architecture-agnostic way. Therefore, there abstractions to implement tensor operations, we consider a is a big demand for a system that abstracts away architectural minimalistic machine learning framework that is shallowly details letting the users to focus on the machine learning algo- embedded in an array-oriented language and we study its rithms. TensorFlow or PyTorch solve exactly that problem productivity and performance. We do this by implementing — they provide a convenient level of abstraction, offering a a state of the art Convolutional Neural Network (CNN) and number of building blocks that machine learning scientists compare it against implementations in TensorFlow and Py- can use to specify their problems. Productivity of such a Torch — two state of the art industrial-strength frameworks. solution is quite high as these frameworks are embedded It turns out that our implementation is 2 and 3 times faster, into high-level languages such as Python or C++. even after fine-tuning the TensorFlow and PyTorch to our However, turning a framework-based specification into hardware — a 64-core GPU-accelerated machine. The size an efficient code remains challenging. There is a huge se- of all three CNN specifications is the same, about 150 lines mantic gap between the specification and the hardware, yet of code. Our mini framework is 150 lines of highly reusable frameworks such a TensorFlow and PyTorch introduce hardware-agnostic code that does not depend on external li- many levels of abstractions that one needs to deal with. Typ- braries. The compiler for a host array language automatically ically there is a Python front-end, a core library in C++ that generates parallel code for a chosen architecture. The key to depends on numerous external libraries for linear algebra, such a balance between performance and portability lies in tensor operations, libraries for GPUs and other specialised the design of the array language; in particular, the ability to hardware. Such a complexity makes it challenging to deliver express rank-polymorphic operations concisely, yet being excellent performance: optimisations across multiple layers able to do optimisations across them. This design builds on of abstraction as well as across multiple external libraries very few assumptions, and it is readily transferable to other inherently come with overheads. contexts offering a clean approach to high-performance ma- The key question we are investigating is: can we identify chine learning. a single layer of abstraction where on the one hand we can express the core building blocks and generate efficient parallel code, and on the other hand that is high-level enough to 1 Introduction be used as a front-end. With the increasing success of machine learning in various Based on the observation that neural networks can be domains, scientists attempt to solve more and more complex concisely expressed as computations on high-ranked tensors, arXiv:1912.05234v1 [cs.PL] 11 Dec 2019 problems using neural networks and deep learning. Increased we look into using a shape-polymorphic array language à la complexity in the context of deep learning typically means APL [22] as the central layer of abstraction. While APL itself more layers of neurons and larger training sets all of which seems suitable in terms of expressiveness [45] the interpreter- results in the necessity to process larger amounts of data. As based implementation of operators, non surprisingly, does a result modern networks require advanced and powerful not readily provide parallel performance anywhere near that of TensorFlow or PyTorch. Authors’ addresses: Artjoms Šinkarovs, Heriot-Watt University, Heriot-Watt However, over the last 50 years we have seen quite some University, Edinburgh, Scotland, EH14 4AS, UK, [email protected]; Hans-Nikolai Vießmann, Heriot-Watt University, Heriot-Watt University, research into compilation of array languages into efficient Edinburgh, Scotland, EH14 4AS, UK, [email protected]; Sven-Bodo Scholz, parallel code [3, 5, 19, 35, 37]. These languages leverage Heriot-Watt University, Heriot-Watt University, Edinburgh, Scotland, EH14 whole program optimisations and they offer decent levels of 4AS, UK, [email protected]. parallel performance. They also offer high program portability, as inputs are typically hardware-agnostic and all the 2020. 2475-1421/2020/1-ART1 $15.00 decisions on optimisations and code generation are taken by https://doi.org/ 1 Conference’17, July 2017, Washington, DC, USA Artjoms Šinkarovs, Hans-Nikolai Vießmann, and Sven-Bodo Scholz the compiler. A user can influence these decisions by passing Y = f X, in the best possible way, according to some cost options, but no code modifications are required. function. After f is found for the existing sample, we would For the purposes of this paper we use SaC [35], a func- like to make new predictions for new inputs. tional array language, as our implementation vehicle. We focus on a simple yet frequently benchmarked CNN for recog- Linear Regression The simplest example of a machine nising handwritten characters. First we implement building learning algorithm is linear regression [11, 24]. It is probably blocks that are required to define a chosen CNN in native one of the most well-understood algorithms in the area, yet SaC and then we use these building blocks to define the it demonstrates fundamental principles that will be also used network. We compare the resulting code size and perfor- in CNNs. Given a set of n statistical units fyi ; xi1;:::; xim g, mance against TensorFlow and PyTorch. We observe that for i 2 f1;:::;ng, we assume that the relationship between the overall problem can be expressed concisely (300 lines ys and xs is linear, so that each yi can be computed as: of native1 SaC code) and on a GPU-accelerated 64-core ma- yi = β0 + β1xi1 + ··· + βmxim + ϵi . This can be written chine, our solution performs two and three times faster than in matrix form as: the TensorFlow- and PyTorch-based implementations. The y = X β + ϵ where key aspect of such good performance is first-class support for multi-dimensional arrays in a functional setting followed by y1 1 x11 ··· x1m β0 ϵ1 a number of well-known code-generation techniques used © : ª ©: : ª © : ª © : ª y = : ® X = : : : ® β = : ® ϵ = : ® by the chosen compiler. ® ® ® ® y 1 x ··· x β ϵ This example suggests that at least for this particular do- « n¬ « n1 nm¬ « m¬ « n¬ main, the trade-off between conciseness, performance and There exists a large number of methods to estimate or in- development time is quite satisfying. fer parameters β and ϵ such that our model function “best” The individual contributions of the paper are: fits the data. For example, one commonly used method is • we make a case for using array languages to host a linear least squares [24]. We assume that ϵ = 0 and the cost Ín 2 machine-learning framework, function that we want to minimise is: i=1 ¹yi − yˆi º where • we provide a concise implementation of the CNN for yˆi = X β. The attractiveness of this method lies in existence hand-written image recognition in SaC without using of the closed solution for the parameter vector β given by any domain-specific libraries, and the formula: β = ¹X >Xº−1X >y. • we present a performance evaluation of the CNN in Note two important aspects. First, instead of searching SaC against multiple variants of the PyTorch- and through all the functions from X to Y , we restrict the general TensorFlow-based versions of the algorithm on a shape of that function and introduce a set of parameters high-performance cluster node. (β-s in our case). The search of a function reduces to the The rest of the paper is organised as follows. In Section2 search of the parameters. Secondly, computationally, most we briefly introduce machine learning algorithms and state of the involved operations can be reduced to linear algebra of the art frameworks. In Sections2 and3 we introduce the operations. This means that we will need a representation notion of functional arrays and describe our implementation for vectors, matrices, tensors and common operations on of the CNN. Section4 presents performance and productivity them when implementing machine learning algorithms. evaluation. Section5 reviews related work, and we conclude Continuing on from linear regression, in Section6. Neural Networks we can consider that the function f : x ! Y that we want 2 Background to learn as a composition of functions дi that can be further decomposed into smaller functions. Overall such a compo- In the last decade machine learning attracted a lot of atten- sition forms a graph (or network) connecting inputs X with tion as it offered solutions to several practical problems that outputs Y . mainly have to do with automatic recognition of complex A typical function composition takes the form: f x = patterns: objects in images or videos, automatic text transla- Í A ¹ i wi ¹дi xºº where A is an activation function (usually tion, recommendation systems, speech recognition, etc.

Load more