The Functional Design of Scientific Computing Library

Liang Wang

University of Cambridge, UK

ABSTRACT in Python (such as Caffe [10] and TensorFlow [5]), they Owl is an emerging library developed in the OCaml lan- often provide Python bindings to take advantage of the guage for scientific and engineering computing. It focuses existing numerical infrastructure in Numpy and Scipy. on providing a comprehensive set of high-level numerical On the other hand, the supporting of basic scien- functions so that developers can quickly build up any data tific computing in OCaml is rather fragmented. There analytical applications. After over one-year intensive devel- have been some initial efforts (e.g., Lacaml [2], Oml [4], opment and continuous optimisation, Owl has evolved into a Pareto, and etc.), but their APIs are either too low-level powerful software system with competitive performance to to offer satisfying productivity, or the designs overly mainstream numerical libraries. Meanwhile, Owl’s overall focus on a specific problem domain. Moreover, incon- architecture remains simple and elegant, its small code base sistent data representation and careless use of abstract can be easily managed by a small group of developers. types make it difficult to pass data across different li- In this paper, we first present Owl’s design, core compo- braries. Consequently, developers often have to write nents, and its key functionality. We show that Owl benefits a significant amount of boilerplate code just to finish greatly from OCaml’s module system which not only allows rather trivial numerical tasks. As we can see, there is us to write concise generic code with superior performance, a severe lack of a general purpose numerical library in but also leads to a very unique design to enable parallel and OCaml ecosystem. We believe OCaml per se is a good distributed computing. OCaml’s static type checking signif- candidate for developing such a general purpose numer- icantly reduces potential bugs and accelerates the develop- ical library for two important reasons: 1) we can write ment cycle. We also share the knowledge and lessons learnt functional code as concise as that in Python with type- from building up a full-fledge system for scientific comput- safety; 2) OCaml code often has much superior perfor- ing with the functional programming community. mance comparing to dynamic languages such as Python and Julia. However, designing and developing a full-fledged nu- 1. INTRODUCTION merical library is a non-trivial task, despite that OCaml Thanks to the recent advances in machine learning has been widely used in system programming such as and deep neural networks, there is a huge demand on MirageOS [15]. The key difference between the two various numerical tools and libraries in order to facili- is obvious and interesting: system libraries provide a tate both academic researchers and industrial develop- lean set of APIs to abstract complex and heterogeneous ers to fast prototype and test their new ideas, then de- physical hardware, whilst numerical library offer a fat arXiv:1707.09616v3 [cs.MS] 28 Aug 2018 velop and deploy analytical applications at large scale. set of functions over a small set of abstract number Take deep neural network as an example, Google invests types. heavily in TensorFlow [5] while Facebook promotes their When Owl project started in 2016, we were imme- PyTorch [16]. Beyond these libraries focusing on one diately confronted by a series of fundamental questions specific numerical task, the interest on general purpose like: ”what should be the basic data types”, ”what should tools like Python and Julia [1] also grows fast. be the core data structures”, ”what modules should be Python has been one popular choice among devel- designed”, and etc. In the following development and opers for fast prototyping analytical applications, one performance optimisation, we also tackled many research important reason is because Scipy [12] and Numpy [3] and engineering challenges on a wide range of different two libraries, tightly integrated with other advanced topics such as software engineering, language design, functionality such as plotting, offer a powerful environ- system and network programming, and etc. ment which lets developers write very concise code to In this paper, we will introduce Owl1, a new general finish complicated numerical tasks. As a result, even for the frameworks which were not originally developed 1https://github.com/ryanrhymes/owl

1 purpose numerical library for scientific computing de- • Base is the basis of all other libraries in Owl. veloped in OCaml.2 We show that Owl benefits greatly Base defines core data structures, exceptions, and from OCaml’s module system which not only allows us part of numerical functions. Because it contains to write concise generic code with superior performance, pure OCaml code so the applications built atop but also leads to a very unique design to enable par- of Base can be safely compiled into native code, allel and distributed computing. OCaml’s static type bytecode, Javascript, even into unikernels. Fortu- checking significantly reduces potential bugs and accel- nately, majority of Owl’s advanced functions are erate the development cycle. We would like to share implemented in pure OCaml. the knowledge and lessons learnt from building up a full-fledge system for scientific computing with the func- • Owl is the backbone of the numerical subsystem. tional programming community. It depends on Base but replaces some pure OCaml functions with implementations (e.g. vectorised 2. ARCHITECTURE math functions inNdarray module). Mixing C code into the library limits the choice of backends (e.g. Owl is a complex library consisting of numerous func- browsers and MirageOS) but gives us significant tions (over 6500 by the end of 2017), we have strived for performance improvement when running applica- a modular design to make sure that the system is flex- tions on CPU. ible enough to be extended in future. In the following, we will present its architecture briefly. • Zoo [20] is designed for packaging and sharing 2.1 Hierarchy code snippets among users. This module targets small scripts and light numerical functions which may not be suitable for publishing on the heavier Owl Actor OPAM system. The code is distributed via gists on Composable Services Github, and Zoo is able to resolve the dependency GPGPU Dataset & NLP Neural Network Model Zoo and automatically pull in and cache the code.

Map-Reduce • Top is the Toplevel system of Owl. It automati- Classic Analytics Engine cally loads both Owl and Zoo, and installs multiple Algorithmic Linear Algebra Plot Differentiation Parameter pretty printers for various data types. Server Engine Maths & Stats Regression Optimisation The subsystem on the right is called Actor Sub- Peer-to-Peer Engine system [19] which extends Owl’s capability to parallel Core and distributed computing. The addition of Actor sub- Ndarray & CBLAS, LAPACKE Parallel & Synchronous system makes Owl fundamentally different from main- Matrix Interface Distributed Engine Parallel Machine stream numerical libraries such as Scipy and Julia. The core idea is to transform a user application from se- Figure 1: Two subsystems: Owl on the left handles quential execution mode into parallel mode (using var- numerical calculations while Actor on the right takes ious computation engines) with minimal efforts. The care of parallel and distributed computing. method we used is to compose two subsystems together with functors to generate the parallel version of the Figure 1 gives a bird view of Owl’s system architec- module defined in the numerical subsystem. We will ture, i.e. the two subsystems and their modules. The elaborate this idea and its design in detail in Section subsystem on the left part is Owl’s Numerical Sub- 5.1. system. The modules contained in this subsystem fall into three categories: (1) core modules contains basic 3. CORE DATA STRUCTURE data structures and foreign function interfaces to other N-dimensional array and matrix are the building blocks libraries (e.g., CBLAS and LAPACKE); (2) classic ana- of Owl library, their functionality are implemented in lytics contains basic mathematical and statistical func- Ndarray and Matrix modules respectively. Matrix is tions, linear algebra, regression, optimisation, plotting, a special case of n-dimensional array, and in fact many and etc.; (3) composable service includes more advanced functions in Matrix module call the corresponding func- numerical techniques such as deep neural network, nat- tions in Ndarray directly. ural language processing, data processing and service For n-dimensional array and matrix, Owl supports deployment tools. both dense and sparse data structures. The dense data The numerical subsystem is further organised in a structure is built atop of OCaml’s native Bigarray mod- stack of smaller libraries, as follows. ule hence it can be easily interfaced with other libraries 2Owl can be installed with OPAM ”opam install owl”. like BLAS and LAPACK. Owl also supports both single

2 and double precisions for both real and complex num- Algorithm 1 Get and Set a slice in an ndarray ber. Therefore, Owl essentially has covered all the nec- 1: let x = sequential [|10; 10; 10|];; essary number types in most common scientific com- 2: let a = x.${ [[0;4]; [6;-1]; [-1;0]] };; putations. The functions in Ndarray modules can be 3: x.${ [[0;4]; [6;-1]; [-1;0]] } <- zeros (shape a);; divided into following three groups. • The first group contains the vectorised mathemat- to the ndarrays with dimensionality greater than four. ical functions such as sin, cos, relu, and etc. Owl exploits the extended indexing operators recently • The second group contains the high-level function- introduced in OCaml 4.06 to provide a unified indexing ality to manipulate arrays and matrices, e.g., in- and slicing interface. dex, slice, tile, repeat, pad, and etc. • .%{ } get • The third group contains the linear algebra func- tions specifically for matrices. Almost all the lin- • .%{ }<- set ear algebra functions in Owl call directly the cor- responding functions in CBLAS and LAPACKE. • .${ } get_slice Aforementioned three groups of functions together • .${ }<- set_slice provide a strong support for developing high-level nu- • .{ } get_fancy merical functions. Especially the first two groups turn out to be extremely useful for writing machine learning • .{ }<- set_fancy and deep neural network applications. Function poly- morphism is achieved using GADT (Generalized alge- For example, the code snippet 1 demonstrates how to braic data type), therefore most functions in Generic use these operator to manipulate slices in an ndarray. module accept the input of four basic number types. The grammar for slicing unfortunately still seems heav- ier than that of the Bigarray’s native indexing. How- 3.1 Primitives ever, with the new proposal of the extended indexing Ndarray module has three primitives, i.e. map, fold, introduced in OCaml 4.083, the aforementioned slicing and scan. The latter two accept the axis parameter operator becomes much lighter as x.${[0;4]; [6;-1]; to specify the dimentionality along which the operation [-1;0]}, similarly for indexing grammar. shall be performed. Practically all the vectorised math functions can be implemented using these three primi- 3.3 Broadcasting operation tives, as follows. In numerical applications, we often need to operate ndarrays of different shapes together, e.g. adding a • map: ceil, sin, cos, relu ... vector to a matrix. Conceptually, this is achieved by • fold: sum, min, mean, std ... tiling the vector into a full matrix so that two operands have the same shape. In practice, tiling consumes more • scan: cumsum, cumprod, cummin ... memory and can cause non-negligible burdens on GC. Broadcasting represents the trade-off between memory Besides vectorised math, the three primitives also footprint and execution speed. accept user-defined functions so they are sufficient to Broadcasting does not require tiling and is implic- implement many other more complicated higher-order itly supported by majority of the binary math opera- functions. tors in Owl. If the two operands have different number 3.2 Indexing and slicing of dimensions, the one of less dimensions will first be expanded using efficient function. Then the Indexing and slicing are arguably the most important reshape broadcasting operation checks whether the each dimen- and fundamental functions in all numerical libraries. sion of two ndarrays complies with the following rules: For basic slicing, each dimension in the slice definition 1) both are equal; 2) either is one. must be defined in the format of [start:stop:step]. If two operands are of the same shape, the more effi- Owl provides two functions get_slice and set_slice cient functions (with less embedded loops) will be to retrieve and assign slice values respectively. If the for triggered to guarantee the best performance. slice definition cannot be expressed by the basic for- mat as [start:stop:step], Owl provides another two 3.4 Operator design functions get_fancy and set_fancy that accept more complicated definition. OCaml does not support operator overloading at the Because Owl’s Ndarray is based on Genarray, the moment which poses a big challenge when designing the indexing grammar like x.{i, j, k ...} only applies 3https://github.com/ocaml/ocaml/pull/1726

3 operators in Owl. All the operators are centrally defined tion phase whereas the dynamic graph is dynamically in a functor called Owl_operator, which allows us to constructed during the runtime. plug in different number types and avoids maintaining Both solutions have pros and cons. For static graph, redundant definitions in the same code base. because the structure is already known beforehand, it Basic operators does not allow interoperation on dif- is easier to perform various (graph) structural optimi- ferent number types. On one hand, it is desirable to sation and pre-allocate the memory during compilation use a set of unified and consistent operators so that we which leads to better runtime performance. However, can write concise code without inserting too much code it is less intuitive to introduce control flow such as loop merely for casting data types. On the other hand, we and if-else in static graphs, special nodes structure is might not be able to take fully advantage of the static often required. TensorFlow and Theano [6] are typical type checking since different number types are wrapped examples of those using static graph. into the unified type using different type constructors. For dynamic graph, the structure is constructed on We have been leaning more towards readability and the fly during the runtime. This certainly introduces convenience when designing Owl’s operators. However, runtime overhead and makes it difficult to optimise the this does not necessarily mean we have to sacrifice the graph structure. However, dynamic graph can be im- static type checking and performance in the library. plicitly supported by operator overloading and intro- The operators which allow interoperation on different ducing control flow is trivial since native language con- number types are implemented in a specific module called struct such as while, for, and if can be directly used in Ext. Each operator in Ext corresponds to several func- programming without any modification. PyTorch [16] tions (as well as same name operators) defined in the is the typical example building atop of dynamic graph. modules of specific types. It means we can always choose to call the functions of explicit type directly instead of 4.2 Lazy Evaluation and Dataflow using unified operators for performance and type check- Different from Haskell, OCaml eagerly evaluates the ing concerns. expressions. However, laziness can be very valuable Operators are not meant to replace those functions, in some numerical applications which involve complex which leaves us to decide how and when to use them. computation graphs and large data chunks such as ma- By the time of writing, Owl has implemented over 20 trices. Laziness can be exploited to potentially reduce common operators such as +, -, *, /, >, <, >., %, and unnecessary memory allocations, optimise computation etc. graph, incrementally update results, and etc. 3.5 Performance critical code Owl is able to simulate the lazy evaluation using its Lazy functor. Lazy functor is optimised for memory In Base library, all aforementioned basic operations allocation in numerical computing. To reuse the allo- such as slicing, broadcasting, and vectorised math oper- cated memory, Owl includes many inpure functions for ators have pure OCaml implementations. These OCaml in-place modification. E.g., Arr.sin is a pure function functions are replaced with sequential C implementa- and always returns a new ndarray whilst Arr.sin_ per- tion in the core library. Since version 0.3.7, Owl has forms in-place modification and overwrites the original included the initial support for OpenMP with a con- ndarray. However, these inpure functions should not be figurable compiling switch. If the OpenMP switch is called directly by a programmer rather than being used turned on in building, the sequential C code is further internally by Owl. replaced by the parallel OpenMP implementations then Lazy functor automatically tracks the reference num- compiled into the library. ber of each node in the computation graph and tries to reuse the allocated memory as much as possible, same 4. COMPUTATION GRAPH as the technique called linear types used by compilers. Computation graph plays a critical role in our sys- By so doing, Lazy module often brings us a huge ben- tem, e.g. algorithmic differentiation, lazy evaluation, efit in both performance and memory usage. We can and GPU computing modules all implicitly or explicitly safely construct very complicated computation graphs use computation graph to perform calculations. We will without worrying about exhausting the memory. Lazy present the design of these modules over this topic. functor offers the feature of self-adjusting computing, the graph can be incrementally evaluated if only part 4.1 Dynamic vs static graph of it needs to be updated. Computation graph has a close connection to dataflow programming [11]. There are two ways of implement- 4.3 Algorithmic Differentiation ing computation graphs, i.e. static graph and dynamic Atop of the core components, we have developed sev- graph. The fundamental difference is that the structure eral modules to extend Owl’s numerical capability. E.g., of static graph is already determined during compila- Maths module includes many basic and advanced math-

4 Algorithm 2 Build a Feedforward neural network

1: let network = input [| 784 |] Noop #4 2: |> linear 300 ∼act typ:Activation.Tanh F(5) 3: |> linear 100 ∼act typ:Activation.Softmax 4: in train network x y Const Sqrt_D Add_D_D #9 #10 #6 F(1) F(2.23607) F(10) ematical functions, whist Stats module provides vari- Mul_C_D Const Sin_D #8 #11 #5 ous statistical functions such as random number gen- F(2.23607) F(7) F(-0.544021) erators, hypothesis tests, and so on. The most im- portant extended module is Algodiff, which imple- Noop Div_D_C Mul_D_D #13 #7 #3 ments the algorithmic differentiation functionality [9]. Arr(3,4) F(0.319438) F(-2.72011) OwlˆaA˘Zs´ Algodiff module is based on the core nested automatic differentiation algorithm and differentiation Relu_D Add_D_D #12 #2 API of DiffSharp [7], which provides support for both Arr(3,4) F(-2.40067) forward and reverse differentiation and arbitrary higher- order derivatives. Mul_D_D #1 The Algodiff module essentially bridges the gap be- Arr(3,4) tween low-level numerical functions and high-level ad- vanced analytical tasks such as optimisation, regression, Sum_D #0 and machine learning. Owl’s algorithmic differentiation F(-17.4576) module implements both forward and backward mode (a.k.a reverse mode) and supports arbitrarily higher- order derivatives. Because Algodiff.Generic is imple- Figure 2: The computation graph generated by differ- mented as a functor that takes the core data structure entiating a simple math function in reverse mode. Ndarry module as inputs, it supports both single and double precisions. We will also introduce the support on complex numbers in future. module and Neural Network module because both are With such capability, developing advanced data ana- essentially mathematical optimisation problems. lytical applications becomes trivial. E.g., we can write The flexibility in optimisation engine leads to an ex- up the core back-propagation algorithm in neural net- tremely compact design and small code base. For a full- work from scratch in just 13 lines of OCaml code. In fact fledge deep neural network module, we only use about we have already included a specific module in Owl for 2500 LOC and its inference performance on CPU is developing applications of deep neural networks. The superior to specialised frameworks such as TenserFlow code snippet (2) implements a simple feedforward neu- and Caffee. The code snippet 3 is taken from Regres- ral network which can be used to recognise hand-writing sion module and implements Lasso regression. As we digits. Moreover, we can also use Owl to build more can see, the implementation is merely the configuration complicated neural networks with graph topology such of the optimisation by specifying loss function, gradient as Google’s Inception network. function, regularisation term, and etc, which similarly Owl is able to visualise the computation graph used applies to other regression models such as ols, ridge, in the algorithmic differentiation. The computation svm, and etc. graph contains the detailed information of operators and operands, including the type information and shape Algorithm 3 Implementation of lasso regression of the data. The Fig. 2 demonstrates the computa- 1: let lasso ?(i=false) ?(alpha=0.001) x y = tion graphs from a simple math function let f x y = 2: let params = Params.config Maths.((x * sin (x + x) + ( F 1. * sqrt x) / F 3: ∼batch:Full ∼loss:Quadratic ∼gradient:GD 7.) * (relu y) |> sum), whist Fig 3 shows the one 4: ∼regularisation:(L1norm alpha) from a VGG-like neural network. 5: ∼learning rate:(Adagrad 1.) 6: ∼stopping:(Const 1e-16) 1000. 4.4 Optimisation engine 7: in Algodiff module is able to provide the derivative, 8: linear reg i params x y Jacobian, and Hessian of a large range of functions, we exploits this power to further build the optimisation en- gine. The optimisation engine is light and highly config- urable, and also serves as the foundation of Regression 5. PARALLEL COMPUTING

5 Parallelism can take place at various levels, e.g. on multiple cores of the same CPU, or on multiple CPUs in a network, or even on a cluster of GPUs. OCaml official release only supports single threading model at

Const Noop #38 #39 Arr(5,32,32,3) Arr(1,1,1,3) the moment, and the work on Multicore OCaml [8] is

Mul_C_D Noop #37 #40 Arr(5,32,32,3) Arr(1,1,1,3) still ongoing in the Computer Lab in Cambridge. In the

Add_D_D Noop #36 #41 Arr(5,32,32,3) Arr(3,3,3,32) following, we will present how parallelism is achieved in

Conv2D_D_D Noop #35 #42 Arr(5,32,32,32) Arr(32) Owl to speed up numerical computations.

Add_D_D #34 Arr(5,32,32,32)

Noop Relu_D 5.1 Actor Subsystem #43 #33 Arr(3,3,32,32) Arr(5,32,32,32)

Conv2D_D_D #32 Arr(5,30,30,32) The design of distributed and parallel computing mod-

Add_D_D #31 ule essentially differentiates Owl from other mainstream Arr(5,30,30,32)

Relu_D #30 numerical libraries. For most libraries, the capability Arr(5,30,30,32)

Maxpool2D_D Const #29 #44 of distributed and parallel computing is often imple- Arr(5,15,15,32) Arr(5,15,15,32)

Const Mul_D_C mented as a third-party library, and the users have to #18 #28 F(1.11111) Arr(5,15,15,32)

Noop Mul_C_D deal with low-level message passing interfaces. How- #45 #27 Arr(3,3,32,64) Arr(5,15,15,32)

Noop Conv2D_D_D ever, Owl achieves such capability through its Actor #46 #26 Arr(64) Arr(5,15,15,64) subsystem. Add_D_D #25 Arr(5,15,15,64) The Actor subsystem implements three computation Noop Relu_D #47 #24 Arr(3,3,64,64) Arr(5,15,15,64) engines to handle both data-parallel and model-parallel Conv2D_D_D #23 Arr(5,13,13,64) computational jobs, as the following list shows. Add_D_D #22 Arr(5,13,13,64)

Relu_D #21 Arr(5,13,13,64) 1. Map-Reduce: implements APIs like map, re-

Maxpool2D_D Const #20 #48 Arr(5,6,6,64) Arr(5,6,6,64) duce, collect, and etc. The engine is for data-

Mul_D_C #19 Arr(5,6,6,64) parallel computational jobs.

Mul_C_D #17 Arr(5,6,6,64)

Reshape_D Noop #16 #49 2. Parameter Server: implements APIs like reg- Arr(5,2304) Arr(2304,512)

Dot_D_D Noop ister, push, pull, and etc. The engine is for #15 #50 Arr(5,512) Arr(1,512)

Add_D_D model-parallel jobs. #14 Arr(5,512)

Relu_D Noop #13 #51 Arr(5,512) Arr(512,10) 3. Peer-to-Peer: very similar to Parameter Server Dot_D_D Noop #12 #52 Arr(5,10) Arr(1,10) engine but without centralised server, each node

Add_D_D #11 Arr(5,10) in the system can serve part of the model.

Get_Row_D Const Get_Row_D Const Get_Row_D Const Get_Row_D Const Get_Row_D Const #10 #53 #70 #71 #76 #77 #64 #65 #58 #59 Arr(1,10) F(0.886806) Arr(1,10) F(0.917824) Arr(1,10) F(0.570415) Arr(1,10) F(0.800436) Arr(1,10) F(0.884163)

Sub_D_C Sub_D_C Sub_D_C Sub_D_C Sub_D_C #9 #69 #75 #63 #57 Arr(1,10) Arr(1,10) Arr(1,10) Arr(1,10) Arr(1,10) Owl’s numerical functionality is ”glued” with the Ac-

Exp_D Exp_D Exp_D Exp_D Exp_D #8 #68 #74 #62 #56 Arr(1,10) Arr(1,10) Arr(1,10) Arr(1,10) Arr(1,10) tor’s distribution functionality using a well-defined func-

Sum_D Sum_D Sum_D Sum_D Sum_D #54 #72 #78 #66 #60 F(5.65194) F(5.59524) F(6.80323) F(6.11484) F(5.48714) tion to achieve distributed analytics. The users do not

Div_D_D Div_D_D Div_D_D Div_D_D Div_D_D #7 #67 #73 #61 #55 need to deal with low-level message passing since this Arr(1,10) Arr(1,10) Arr(1,10) Arr(1,10) Arr(1,10)

Of_Rows_D #6 is already implemented in the engines. The users only Arr(5,10)

Log_D Const #5 #4 need to focus on the high-level application logic by se- Arr(5,10) Arr(5,10)

Mul_C_D lecting the proper engine to use. #3 Arr(5,10)

Sum_D Actor’s engine can be used to distributed and paral- #2 F(-12.0137)

Neg_D Const lelise both low-level and high level data structures. E.g., #1 #79 F(12.0137) F(5) the Line 1 and 2 in code4 generates a distributed Ndar- Div_D_C #0 F(2.40273) ray module by combining Owl’s native Ndarray module and Actor’s map-reduce engine. Figure 3: The computation graph generated from a VGG-like convolution neural network. Arrows repre- Algorithm 4 Implementation of lasso regression sent data flows and nodes represents operations on the data. Each node contains the information on data type, 1: module M1 = Owl parallel.Make shape of ndarray, type of operation, and a unique index. 2: (Dense.Ndarray.S) (Actor.Mapre) This information is valuable in debugging complicated 3: numerical computations. 4: module M2 = Owl neural parallel.Make 5: (Neural.S.Graph) (Actor.Param)

Similarly, this method also applies to advanced data structure like neural network. Line 4 and 5 generates a

6 Algorithm 5 Perform computation on GPU and CPU Algorithm 6 Mixing OpenCL code with OCaml code 1: let cpu compute x = 1: let add two array () = 2: mul x x |> sin |> cos |> tan |> neg 2: let code = ” 3: 3: kernel void add xy 4: let gpu compute x = 4: ( global float *a, global float *b) { 5: let open Owl opencl dense ndarray in 5: int gid = get global id(0); 6: let x = of ndarray x in 6: a[gid] = a[gid] + b[gid]; 7: let y = mul x x |> sin |> cos |> tan |> neg in 7: }” 8: to ndarray float32 y 8: in 9: Context.(add kernels default [|code|]); 10: let x = Dense.Ndarray.S.uniform [|20;10|] in new neural network module which enables distributed 11: let y = Dense.Ndarray.S.sequential [|20;10|] in training by combining Owl’s Neural module and Actor’s 12: Context.eval ∼param:[|F32 x; F32 y|] ”add xy” Parameter Server engine. 5.2 GPU Computing module, OpenCL module internally maintains a compu- Scientific computing involves intensive computations, tation graph for an expression. The expression is lazily and GPU has become an important option to accelerate evaluation to reduce the number of memory copying these computations by performing parallel computation between the host and GPU devices. of_ndarray con- on its massive cores. There are two popular options in verts a normal CPU ndarray to a GPU ndarray, whilst GPGPU computing: CUDA and OpenCL. CUDA is to_ndarray implicitly calls eval function and copies developed by Nvdia and specifically targets their own the result back to the host memory. hardware platform whereas OpenCL is a cross platform Considering a user may have implemented many func- solution and supported by multiple vendors. Owl cur- tions previously in OpenCL C code, Owl also provides rently supports OpenCL and CUDA support is included flexible mechanisms to allow the user to include the in our future plan. OpenCL code and run it directly on OCaml’s native Bigarray. As presented in code snippet 6, line 3 to 7 is GPU-based Ndarray an OCaml native string containing the OpenCL C code which adds up two arrays. These raw OpenCL code will be dynamically compiled during the runtime by calling Pre-compiled Kernels for Vectorised Maths add_kernels function then the compiled kernel is in- cluded in Owl’s kernel cache to avoid recompilation in future invocations. Map, Fold, Scan Primitives Line 12 invokes the OpenCL function by passing the its name to Context.eval and corresponding parame- ters. Note that the OpenCL function names must be Platform, Context, Device, Mem, Queue, Prog, Kernel unique since they are all maintained in a hash table. 5.3 OpenMP OCaml Interface to OpenCL Functions OpenMP uses shared memory multi-threading model to provide parallel computation. It is requires both Figure 4: Module stack of Owl’s OpenCL sublibrary. compiler support and linking to specific system libraries. OpenMP support is transparent to programmers, it can The OpenCL support is implemented as a sub-library, be enabled by turning on the corresponding compilation and Figure 4 presents its module stack. The bottom two switch in jbuild file. layers re-wrap OpenCL objects (e.g. Platform, Context, After enabling OpenMP, many vectorised math op- Device, and etc.) into modules using OCaml interface erators are replaced with the corresponding OpenMP to OpenCL C functions. The middle layer implements implementation in the compiling phase. Parallelism of- map, fold, and scan three primitive with which a com- fered by OpenMP is not a free lunch, the scheduling prehensive set of math functions (e.g. add, sub, mul, mechanism adds extra overhead to a computation task. sin, etc.) is implemented as default kernels. There ker- If the task per se is not computation heavy or the ndar- nels are pre-compiled every time when Owl library is ray is small, OpenMP often slows down the computa- loaded. tion. We therefore set a threshold on ndarray size be- In code snippet 5, the two functions run the same cal- low which OpenMP code is not triggered. This simple culation on CPU and GPU respectively. Similar to Lazy mechanism turns out to be very effective in practice.

7 6. INTERFACED LIBRARIES

0.5 0.5 0.5 Some functionality such as plotting and linear alge- z z z 0.0 0.0 0.0

-0.5 -0.5 -0.5 bra is included into the system by interfacing to other li- -1.0 -1.0 -1.0

12 2 012 0 1 2 012 0 1 2 braries. Rather than simply exposing the low-level func- -2-1 -2-1 -2-10 -1 0 1 y -2 -1 x y -2 -1 x y -2 x tions, we carefully design easy-to-use high-level APIs and this section will cover these modules.

0.5 0.5 0.5

z z z 0.0 0.0 0.0 6.1 Linear Algebra -0.5 -0.5 -0.5 -1.0 -1.0 -1.0

12 12 2 12 2 -10 -10 -10 -2 -1 0 1 2 -2 -1 0 1 -2 -1 0 1 Even though is no longer among the top choices y -2 x y -2 x y -2 x as a programming language, there is still a large body of Fortran numerical libraries whose performance still re- Figure 5: Subplots generated by Owl’s Plot module. main competitive even by today’s standard, e.g. BLAS and LAPACK. When designing the linear algebra module, we decide The Plot.output function on Line 16 evaluates all the to interface to CBLAS and LAPACKE (i.e. the cor- cached function closures on the canvas and output the responding C interface of BLAS and LAPACK) then final figure. further build higher-level APIs atop of the low-level Fortran functions. The high-level APIs hides many te- Algorithm 7 Using Plot module to make subplots dious tasks such as setting memory layout, allocating 1: let make subplots x y z = workspace, calculating strides, and etc. 2: let open Plot in One thing worth noting is that Owl only uses C layout 3: let h = create ∼m:2 ∼n:3 ”demo.pdf” in (i.e. row-based layout) to represent ndarray in the mem- 4: subplot h 0 0; ory. This decision is based on two considerations: first, 5: mesh ∼h ∼spec:[ ZLine XY ] x y z; using one layout significantly reduces the code base, we 6: subplot h 0 1; do not have to implement two separate versions of func- 7: mesh ∼h ∼spec:[ ZLine X ] x y z; tions for each ndarray operation in order to obtain the 8: subplot h 0 2; best performance. Second, using Fortran layout indi- 9: mesh ∼h ∼spec:[ ZLine Y ] x y z; cates indexing starts from 1 whereas OCaml’s native 10: subplot h 1 0; indexing (e.g. list and array) starts from 0. Mixing two 11: mesh ∼h ∼spec:[ ZLine Y; NoMagColor ] x y z; indexing styles turns out to be a source of bugs in our 12: subplot h 1 1; experience. 13: mesh ∼h ∼spec:[ ZLine Y; Contour ] x y z; 14: subplot h 1 2; 6.2 Plotting Functions 15: mesh ∼h ∼spec:[ ZLine XY; Curtain ] x y z; Plotting is an indispensable function in modern nu- 16: output h merical libraries. We build Plot module on top of PLplot [14] which is a powerful cross-platform plotting library. We introduce a named parameter called spec in each However PLPlot only provides very low-level functions plotting function so that the programmer can pass in to interact with its multiple plotting engines, eveb mak- a list of specification to further configure a plot. This ing a simple plot involves very lengthy and tedious con- gives Plot module the flexibility for future extension, the trol sequence. Using these low-level functions directly unknown or unsupported specification will be ignored requires developers to understand the mechanisms in by the plotting function. depth, which not only significantly reduces the produc- tivity but also is prone to errors. Inspired by Matlab, we implement Plot module to 7. PRELIMINARY EVALUATION provide developers a set of high-level APIs. The core Since Owl provides the underlying numerical func- plotting engine is very lightweight and only contains tions for actor, we have compared its performance with about 200 LOC. Its core design is to cache all the plot- several other mainstream tools. Even though Tensor- ting operations as a sequence of function closures and Flow is not a general purpose numerical library, it is execute them all when we output the figure. included due to relatively strong support for tensor op- The Figure 5 presents a subplot generated by the erations. The numerical library often contains thou- code snippet 7. The Plot.create function on the third sands of APIs, therefore we cannot provides the com- line returns a handle of a 2 × 3 subplot. Then the plete comparison due to the space limit. In this pa- following calls on Plot.subplot i j shifts the ”focus” per, we only select one function from each category and on each subplot with specified coordinates (i, j) to let present the results in table 1. Note all the evaluations the programmer perform further plotting operations. are performed CPU only.

8 Table 1: Performance comparison of core functions between different libraries. The time is measured in ms.

Operation Owl Scipy Julia TensorFlow slicing [*; 0:499] 0.80 (0.07) 0.94 (0.21) 0.78 (0.19) 1.04 (0.04) slicing [0:499; *] 0.65 (0.04) 0.79 (0.12) 0.70 (0.10) 0.84 (0.08) relu (map) 3.51 (0.18) 3.67 (0.24) 3.48 (0.15) 3.53 (0.09) sum (fold) 0.57 (0.04) 3.58 (0.18) 1.84 (0.08) 0.74 (0.06) cumsum (scan) 6.80 (0.21) 9.21 (0.31) 6.53 (0.94) 6.62 (0.84) x + y 3.83 (0.46) 5.82 (0.38) 4.39 (0.50) 4.02 (0.10) inv(x) 88.5 (2.23) 91.6 (3.38) 95.8 (1.70) 472 (10.1) iter 39.3 (1.31) 4378 (283) 538 (32.2) 13.2 (0.35)

We use Owl (version 0.3.0) in the evaluation and compare to Numpy (version 1.8.0rcl) and Julia (version 0.5.0) [1]. All evaluations are done on the same Mac- Book Air (1.6GHz Intel Core i5, 8GB memory). In the 1200 evaluation, we create a 1000×1000 random matrix (uni- Owl TensorFlow Caffe2 form distribution in [0, 1]). Each experiment is repeated 1000 100 times and both the mean and the standard devia- tion (in the parenthesis) are reported. The first ten 800 evaluations serve as warm-up phase and are excluded from calculation. This is especially important for Ju- 600 lia because its first run requires a significant amount of extra time for compilation. 400

Regarding the selected operations, we first choose Inference Time (ms) Time Inference slicing function since it is a fundamental operation of 200 n-dimensional array which serves as a core data struc- ture in all the numerical libraries. Second, many op- 0 erations on n-dimensional array can be categorised as LeNet-5 InceptionV3 VGG16 map, fold (or reduce), and scan three groups, hence we choose one function from each. Third, we include (a) Inference performance one binary math operator + for addition as well as inv for matrix inversion as basic linear algebra examples. In the last, we also include iter function for iterating all the elements which can be used to build many useful 380 operations. 370 For both slicing and vectorised mathematical opera- tions, we can see all the libraries achieve similar perfor- 360 mance and in most cases Owl is the best whereas Scipy is the worst. Owl and Scipy have slightly higher stan- 350 dard deviation which is the effects of GC. TensorFlow 340 offloads many functions to Eigen4, and it is interest- ing to notice that a library developed in C++ does not 330 necessarily indicate better performance. (iteration/ms) speed Training For inv, TensorFlow is significantly slower since it 320 uses its native implementation whereas the rest three Owl TensorFlow Caffe2 interfaces to Lapack(e). For iteration function, Ten- (b) Training performance sorFlow is much faster since the loop overhead is sig- nificantly smaller in C++ whereas the rest three are Figure 6: Performance comparison between different slower. However, Owl still possesses superior perfor- deep neural network frameworks. Owl is able to achieve mance to Julia and Scipy due to OCaml highly efficient superior performance in both training and inference. language runtime. Besides benchmarking the basic numerical functions, 4http://eigen.tuxfamily.org/

9 we also evaluate a set of more advanced examples and Implementation (OSDI 16), pages 265–283, report the results in Figure 6. In this set of evalua- Savannah, GA, 2016. USENIX Association. tions, we compare Owl to two popular deep neural net- [6] F. Bastien, P. Lamblin, . Pascanu, J. Bergstra, work frameworks Caffe2 (built on original Caffe [10]) I. Goodfellow, A. Bergeron, N. Bouchard, and TensorFlow [5]. . Warde-Farley, and Y. Bengio. Theano: new In Figure 6a, we evaluate the inference performance of features and speed improvements. arXiv preprint three different neural networks, i.e. LeNet-5, Google’s arXiv:1211.5590, 2012. Inception [18], and VGG16 [17]. Owl is able to achieve [7] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, the best performance in both Inception and VGG16. All and J. M. Siskind. Automatic differentiation in three frameworks have similar performance on LeNet, machine learning: a survey. Journal of Machine but the results are subject to higher variance due to Learning Research, 18(153):1–43, 2018. LeNet’s small size. As the neural network gets more [8] S. Dolan, L. White, K. Sivaramakrishnan, complicated, the performance gain of using Owl be- J. Yallop, and A. Madhavapeddy. Effective comes more noticeable. concurrency through algebraic effects. In OCaml In Figure 6b, we build and train a VGG16 network Workshop, 2015. using CIFAR [13] dataset. On all three frameworks, [9] A. Griewank and A. Walther. Evaluating we try to keep most experiment settings the same for derivatives: principles and techniques of fair comparison, e.g. all use Adagrad 0.005 as learning algorithmic differentiation. SIAM, 2008. rate. Due to the same setting, the convergence speed [10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, w.r.t number of iterations is the same on all frameworks, J. Long, R. Girshick, S. Guadarrama, and therefore we measure the time spent in each iteration as T. Darrell. Caffe: Convolutional architecture for its training speed. The evaluation shows the difference fast feature embedding. In Proceedings of the 22nd between these frameworks is not significant with Caffe2 ACM international conference on Multimedia, often has slightly better performance and TensorFlow pages 675–678. ACM, 2014. has the worse one. Facebook has devoted a lot of ef- [11] W. M. Johnston, J. R. P. Hanna, and R. J. Millar. forts in optimising Caffe2 for production use whereas Advances in dataflow programming languages. TensorFlow and Owl target remain higher flexibility as ACM Comput. Surv., 36(1):1–34, Mar. 2004. a framework for research purpose. [12] E. Jones, T. Oliphant, and P. Peterson. {SciPy}: open source scientific tools for {Python}. 2014. 8. CONCLUSION [13] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009. In this paper, we presented Owl - an OCaml library [14] M. LeBRUN and G. Furnish. The plplot plotting for scientific and engineering computing. We introduced library programmer’s reference manual. Institute Owl’s design principles, as well as its core components, for Fusion Studies, Univ. of Texas, Austin, TX key functionality, advanced algorithmic differentiation. (September 1994), 2005. We also provided several examples to show how to take [15] A. Madhavapeddy, R. Mortier, C. Rotsos, advantage of Owl to fast prototype new ideas. Our D. Scott, B. Singh, T. Gazagnaire, S. Smith, preliminary evaluation has shown that Owl has superior S. Hand, and J. Crowcroft. Unikernels: Library performance comparing to other mainstream numerical operating systems for the cloud. SIGPLAN Not., libraries, meanwhile its architecture remains simple and 48(4):461–472, Mar. 2013. the code case is small enough to be managed by a small [16] A. Paszke, S. Gross, S. Chintala, G. Chanan, group of developers. E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation 9. REFERENCES in pytorch. In NIPS-W, 2017. [17] K. Simonyan and A. Zisserman. Very deep [1] Julia language. https://julialang.org/, No convolutional networks for large-scale image date. Accessed January 20, 2017. recognition. arXiv preprint arXiv:1409.1556, 2014. [2] Lacaml. https://github.com/mmottl/lacaml, [18] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and No date. Accessed January 20, 2017. Z. Wojna. Rethinking the inception architecture [3] Numpy. https://github.com/numpy/numpy, No for computer vision. In 2016 IEEE Conference on date. Accessed January 20, 2017. Computer Vision and Pattern Recognition [4] Oml. https://github.com/rleonid/oml, No (CVPR), pages 2818–2826, June 2016. date. Accessed January 20, 2017. [19] L. Wang, B. Catterall, and R. Mortier. [5] M. Abadi and et al. Tensorflow: A system for Probabilistic synchronous parallel. arXiv preprint large-scale machine learning. In 12th USENIX arXiv:1709.07772, 2017. Symposium on Operating Systems Design and

10 [20] J. Zhao, R. Mortier, J. Crowcroft, and L. Wang. Privacy-preserving machine learning based data analytics on edge devices. 2018.

11