Compact Native Code Generation for Dynamic Languages on Micro-Core Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Compact Native Code Generation for Dynamic Languages on Micro-core Architectures Maurice Jamieson Nick Brown [email protected] EPCC at the University of Edinburgh EPCC at the University of Edinburgh Edinburgh, United Kingdom Edinburgh, United Kingdom Abstract technologies. However, the trait that all these share is that Micro-core architectures combine many simple, low mem- they typically provide a very constrained programming en- ory, low power-consuming CPU cores onto a single chip. Po- vironment and complex, manually programmed, memory tentially providing significant performance and low power hierarchies. In short, writing code for these architectures, consumption, this technology is not only of great interest in typically in C with some bespoke support libraries, is time embedded, edge, and IoT uses, but also potentially as acceler- consuming and requires expertise. There are numerous rea- ators for data-center workloads. Due to the restricted nature sons for this, for instance, not only must the programmer of such CPUs, these architectures have traditionally been handle idiosyncrasies of the architectures themselves but challenging to program, not least due to the very constrained also, for good performance, contend with fitting their data amounts of memory (often around 32KB) and idiosyncrasies and code into the tiny amount (typically 32 - 64 KB) of on- of the technology. However, more recently, dynamic lan- core fast scratchpad memory. guages such as Python have been ported to a number of Little wonder then that there have been a number of at- micro-cores, but these are often delivered as interpreters tempts to provide programming technologies that increase which have an associated performance limitation. the abstraction level for micro-cores. Some early successes Targeting the four objectives of performance, unlimited involved a simple port of OpenCL [33] and OpenMP [15], code-size, portability between architectures, and maintaining however, whilst this helped with the marshalling of overall the programmer productivity benefits of dynamic languages, control, the programmer still had to write low-level C code the limited memory available means that classic techniques and handle many of the architectural complexities. More employed by dynamic language compilers, such as just-in- recently, a number of dynamic languages have been ported time (JIT), are simply not feasible. In this paper we describe to these very constrained architectures. From an implemen- the construction of a compilation approach for dynamic lan- tation of a Python interpreter [17], to [11] and [30], these guages on micro-core architectures which aims to meet these are typically provided as interpreters but the severe mem- four objectives, and use Python as a vehicle for exploring ory limits of these cores is a major limiting factor to these the application of this in replacing the existing micro-core approaches. interpreter. Our experiments focus on the metrics of perfor- It was our belief that a compilation, rather than inter- mance, architecture portability, minimum memory size, and preted, approach to these dynamic languages would offer programmer productivity, comparing our approach against a much better solution on the micro-cores. However, the that of writing native C code. The outcome of this work is very nature of the technology requires considerable thought the identification of a series of techniques that are not only about how best to construct an approach which can deliver suitable for compiling Python code, but also applicable to a the following objectives: arXiv:2102.02109v1 [cs.PL] 3 Feb 2021 wide variety of dynamic languages on micro-cores. • Performance at or close to directly written C code • The ability to handle code sizes of any arbitrary length Keywords: native code generation, Python, micro-core ar- • Portability between different micro-core architectures chitectures, Epiphany, RISC-V, MicroBlaze, ARM, soft-cores • Maintaining the programmer productivity benefits of dynamic languages 1 Introduction In this paper we describe our approach to the construction Micro-core architectures combine many simple, low power, of a compilation approach for dynamic languages of micro- CPU cores on a single processor package. Providing signifi- core architectures. The paper is organised as follows; in cant parallelism and performance at low power consumption, Section2 we explore background and related work from a these architectures are not only of great interest in embedded, compilation perspective and why existing approaches did edge, and IoT applications, but also demonstrate potential in not match the objectives described above. Section3 then high performance workloads as many-core accelerators. describes our approach in detail, before applying this to From the Epiphany [7], to the PicoRV-32 [36], to the Pezy- an existing Python interpreter, called ePython, for micro- SC2, this class of architecture represents a diverse set of cores in Section4. In Section 4.2 we not only explore the Maurice Jamieson and Nick Brown application of our approach to compiling Python for micro- for operations such as arithmetic calculations, binary oper- cores, but also the performance and memory characteristics ations and comparisons. Effectively, this method removes of our approach compared to writing directly in C. Section5 the overhead of the VM’s dispatch loop, whilst leveraging then draws a number of conclusions and discusses further the existing VM functions and capabilities. The viper emit- work. ter takes this approach further by also generating machine code instructions for operations, rather than calling the VM functions. As we would expect, this increases performance 2 Background and Related Work yet further as arithmetic operations are performed inline Whilst implementations of dynamic languages, such as Python, rather than via a procedure call. This results in performance for micro-core architectures greatly reduces the time and approximately 10 times that of the native emitter and around effort to develop applications in comparison to writing them 24 times faster than the VM [21]. Whilst the MicroPython using the provided C software development kits (SDKs), there emitters are not JIT compilers, as the native code is gener- remains the performance overhead of interpreting dynamic ated before execution begins and the bytecode is not profiled languages over compiled C binaries. There are two major to select compilation candidates, the code generation is per- approaches to accelerating Python codes, just-in-time (JIT) formed on the microcontrollers themselves. compilation of the bytecodes at runtime and ahead-of-time Our native code generation approach is similar to that of (AOT) compilation before the code executes. In desktop and the MicroPython viper emitter in that native code is gen- server environments, the Numba JIT compiler [27] acceler- erated for all Python code, including operations such as ates Python applications by specifying a subset of the lan- arithmetic and comparisons. However, as will be discussed guage that is compiled to native code for central process- in Section3, we generate C source code that is compiled to a ing units (CPUs) and and graphics processing units (GPUs). native binary for download and execution on the micro-core Numba uses a Python function decorator, or directive, ap- device. As well as the Nuitka and Cython AOT Python com- proach to annotating the code to compile to native code, pilers, a number of existing programming languages have defining the @jit decorator to target CPUs, and the @cuda.jit used this approach, including Eiffel [12], Haskell [34] and decorator to target GPUs and perform the necessary data LOLCODE [29]. Whilst Haskell has deprecated the C back- transfer. Numba’s JIT compiler can speed up codes by a factor end in preference to one based on LLVM [2], the former is of around 20 times that of the standard Python interpreter still beneficial for porting to a new platform as it produces [31]. vanilla code, requiring only gcc, as and ld tools [1]. Likewise, The Nuitka [22] and Cython [14] AOT Python compilers C was chosen as the backend for ePython native code gen- generate C source code that is compiled and linked against eration to enhance portability, particularly as a number of the CPython runtime libraries. Whilst, like Numba, they are the target micro-core architectures, including the Adapteva highly compliant with, and produce binaries that are sig- Epiphany and Xilinx MicroBlaze, are not supported by LLVM. nificantly faster than, the standard CPython interpreter, up Furthermore, code generators, such as MicroPython’s emit- to 30 times faster in the case of Cython [32], the overall ters, need to be specifically written to support new processor binary size is large. For instance, if we take the seven line instruction set architectures (ISAs), and the C backend en- code example by [18] and compile it on x86 Linux, Nuitka sures that native code generation is immediately available produces a dynamically-linked binary that is 154KB, and on all micro-core platforms that ePython supports. Crucially, Cython produces a much smaller binary at 43KB. However, our approach generates high-level C source code that retains when the required dynamic libraries are included, the over- the overall structure of the Python source program rather all size is significantly larger, at 29MB for the archive file than emitting machine code representations of the bytecode produced by Nuitka using the --standalone