Institute of and Computer Engineering University of Stuttgart Pfaffenwaldring 47 D–70569 Stuttgart

Diplomarbeit Nr. 2980

Retargeting a C to the HAPRA/FAPRA architecture

Tilmann Scheller

Course of Study: Software Engineering

Examiner: Prof. Dr. Hans-Joachim Wunderlich

Supervisor: Dipl.-Ing. Christian Z¨ollin Dipl.-Inform. Melanie Elm

Commenced: October 19, 2009

Completed: April 20, 2010

CR-Classification: C.0, C.4, D.3.4

Abstract

The HAPRA and FAPRA architectures are simple 32-bit RISC architectures which are used for educational purposes. Without a compiler for a high-level language, software for the HAPRA/- FAPRA architecture needs to be written in . This is un- fortunate since writing software in assembly language is time-consuming, error-prone and results in unportable software. The goal of this thesis is to develop a complete C-based toolchain for the HAPRA/FAPRA architecture, including an assembler, a C compiler and porting a C standard library. The resulting toolchain is used to compare the different subsets of the HAPRA/FAPRA instruction set in terms of runtime performance and space efficiency. With the availability of a C compiler for the target architecture it is significantly easier to measure the impact of extensions or modifications of the existing instruction set. A wide spectrum of portable open source software exists today. The avail- ability of a C-based toolchain for the HAPRA/FAPRA architecture enables, at one go, access to this large software stack. It is expected that this toolchain will be used to port a full to the HAPRA/FAPRA archi- tecture. The ability of running an entire operating system is also likely to be a great motivator for students designing their own custom implementations of the HAPRA/FAPRA architecture as part of their lab courses.

3 4 Contents

1 Introduction 13

2 Architecture & Compiler Fundamentals 15 2.1 Instruction set architecture...... 15 2.1.1 FAPRA architecture...... 15 2.1.2 HAPRA architecture...... 18 2.1.3 HASE/Angora...... 20 2.2 Compiler pipeline...... 21 2.2.1 Frontend...... 22 2.2.2 Middle-end...... 23 2.2.3 Backend...... 24 2.3 LLVM Compiler Infrastructure...... 25 2.3.1 Overview...... 25 2.3.2 C frontend...... 26 2.3.3 Intermediate representations...... 26 2.3.4 Target-independent code generator...... 28

3 Implementation 31 3.1 Application Binary Interface...... 31 3.2 HAPRA/FAPRA backend...... 33 3.2.1 Overview...... 34 3.2.2 Machine description...... 38 3.3 C frontend...... 39 3.4 C standard library...... 39 3.5 Assembler...... 40 3.6 Simulator...... 40 3.7 Linker...... 42

4 Results 43 4.1 Example...... 43 4.2 Measurements...... 46 4.2.1 Instruction count...... 46 4.2.2 Code size...... 48 4.3 Simulator...... 50 4.4 Experience...... 52 4.5 Remarks...... 52

5 Conclusion & Future Work 55

5 A Appendix 57

6 List of Figures

1 The HASE GUI...... 20 2 The compiler pipeline...... 21 3 The LLVM architecture...... 26 4 The target-independent code generator architecture...... 28 5 The HAPRA/FAPRA ABI stack frame layout...... 32 6 The newlib architecture...... 40 7 The libcpu architecture...... 41 8 Comparing the number of executed instructions between HAPRA and FAPRA...... 48 9 Comparing the binary size of benchmarks compiled for HAPRA and FAPRA...... 50

7 8 List of Tables

1 The FAPRA instruction set architecture...... 17 2 The HAPRA instruction set architecture...... 19 3 Registers in the HAPRA/FAPRA ABI...... 32 4 Size and alignment of scalar data types on the FAPRA archi- tecture with byte-addressing...... 33 5 Size and alignment of scalar data types on the FAPRA archi- tecture with word-addressing and on the HAPRA architecture. 34 6 The Stanford benchmark suite...... 46 7 The Computer Language Shootout benchmark suite...... 47 8 Instruction distribution when executing hello...... 49 9 Executing the Stanford benchmark suite on simulators..... 51 10 Executing the Computer Language Shootout benchmark suite on simulators...... 51

9 10 Listings

1 Instruction format description in TableGen syntax...... 38 2 Instruction description in TableGen syntax...... 39 3 C source code of Mandelbrot example...... 43 4 LLVM IR of Mandelbrot example...... 44 5 FAPRA assembly code of Mandelbrot example...... 45

11 12 1 Introduction

The RISC processor architectures introduced in the undergraduate and grad- uate lab courses of the institute use a rather simple, nevertheless complete, instruction set (see table1 and2). While the programming environment features an assembly simulation and debugging environment, larger software projects would benefit from the availability of a compiler for a high-level language. The goal of this thesis is to develop a complete C-based toolchain for the HAPRA/FAPRA architecture, including an assembler, a C compiler and porting a C standard library. A wide spectrum of portable open source software exists today, with the kernel being undisputedly among the most popular open source projects. The availability of a C-based toolchain for the HAPRA/FAPRA architecture enables, at one go, access to a large software stack, which, once the initial work of porting the Linux kernel is done, can be run with minimal effort. Without a compiler for a high-level language, software for the HAPRA/- FAPRA architecture needs to be written in assembly language. This is un- fortunate since writing software in assembly language is time-consuming, error-prone and results in unportable software. It is expected that this toolchain will be used to port the µClinux [uCl] kernel and userspace to the HAPRA/FAPRA architecture, making it possi- ble to run a full operating system on the hardware. This has the nice side effect that students which design custom implementations of the HAPRA/- FAPRA architecture as part of their lab courses, get the ability to run a whole operating system on their hardware, which is expected to be a great motivator for them. With the availability of a C compiler for the target architecture it is significantly easier to compare the different subsets of the HAPRA/FAPRA instruction set in terms of runtime performance and space efficiency and to measure the effect of extensions or modifications of the existing instruction set. This thesis is organized as follows: Chapter2 describes the the HAPRA/- FAPRA architecture, the compiler pipeline and the compiler framework used in this thesis. Chapter3 discusses the implementation of the toolchain for the HAPRA/FAPRA architecture. Chapter4 presents results of measure- ments, the experiences made during the implementation of the toolchain and suggestions on how to improve the HAPRA/FAPRA architecture. Chapter5 draws a conclusion and gives an outlook on future work.

13 14 2 Architecture & Compiler Fundamentals

This chapter describes the HAPRA and FAPRA instruction set architectures (ISA), presents a brief overview of the compilation process and introduces the Low Level Virtual Machine (LLVM) Compiler Infrastructure.

2.1 Instruction set architecture This section describes the HAPRA and FAPRA architecture and their re- spective assembly language development environments.

2.1.1 FAPRA architecture The FAPRA architecture is a simple 32-bit RISC architecture, which was designed for educational purposes. The instruction set encompasses 29 dif- ferent instructions which can be divided into 5 instruction classes: load/store instructions, load immediate instructions, control flow instructions, arith- metic/logical instructions and comparison instructions. Table1 shows the FAPRA ISA. SIMD-variants of the architecture also exist, which allow operations on vectors of two or four 32-bit integers, increasing the data-parallelism respec- tively. The memory transfer size is not widened in the SIMD-variants and is always 32-bit, thus loading an arbitrary 128-bit value from memory requires four load instructions and additional instructions to place the individual vec- tor elements at the desired positions within the destination vector register. As the native word size is 32-bit, there is a register file consisting of 32 registers with a width of 32-bit, except for the SIMD-variants where the registers are 64-bit or 128-bit wide, forming a unified register file which con- tains both scalar and vector values. The program counter is stored in a dedicated register which can only be read/written implicitly, e.g. through certain control flow instructions. A special status register with flags for re- sults of arithmetic/logical instructions does not exist, instead comparison instructions compute their results directly into their destination register. There are no native floating-point instructions, floating-point operations need to be implemented in software. As usual for RISC architectures, instructions have a fixed width of 32-bit in order to simplify the decoding and prefetching hardware. There are two major instruction formats, the first one consisting of a 6-bit opcode field and three 5-bit fields, specifying two source registers and one destination register. The second format includes a 6-bit opcode field and two 5-bit fields which specify one source register and the destination register.

15 The difference is that there is a 16-bit immediate field instead of a field for a second source register. There is only one addressing mode, which calculates the effective memory address by adding a signed 16-bit immediate value to the value of a base register. Regarding memory addressing, there are two variations of the FAPRA architecture:

Word-addressing • The smallest addressable unit is 32-bit.

Byte-addressing • The smallest addressable unit is 8-bit.

Relative branch instructions have a signed 16-bit offset which is added to the current value of the program counter. This allows forwards/back- wards branches with a distance of up to 8192 instructions when using byte- addressing and a distance of up to 32768 instructions when using word- addressing. Unusual is the fact that there is no native logical shift right and xor instruction.

16 Instruction Opcode (31 downto 16) 15 downto 0 Meaning Example

LD.W 01 0000 ddddd aaaaa iiiii iiiii iiiiii R [3 : 0] M(Ra[0] + SignExt(i)) ld.w $d, i($a) d ← ST.W 01 0001 ddddd aaaaa iiiii iiiii iiiiii M(Ra[0] + SignExt(i)) R [0] st.w $d, i($a) ← d LD.B 01 1100 ddddd aaaaa iiiii iiiii iiiiii Rd[3 : 0] ”000...”&M(Ra[0] + SignExt(i)) ld.b $d, i($a)

al :TeFPAisrcinstarchitecture. set instruction FAPRA The 1: Table ← ST.B 01 1101 ddddd aaaaa iiiii iiiii iiiiii M(Ra[0] + SignExt(i)) R [0](7 : 0) st.b $d, i($a) ← d LDIH 10 0000 ddddd -mmmm iiiii iiiii iiiiii s mRd[s](31 : 16) i ldih $d[0, 2], 0x0815 ∀ !∈ ← LDIL 10 0001 ddddd -mmmm iiiii iiiii iiiiii s mRd[s](15 : 0) i ldil $d[1, 3], 12 al :IsrcinStArchitecture Set Instruction 1: Table ∀ !∈ ← JMP 11 0000 ----- aaaaa ------pc Ra[0] jmp $a ← BRA 11 0100 ------iiiii iiiii iiiiii pc pc + SignExt(i) bra label ← BZ 11 0101 -mmmm aaaaa iiiii iiiii iiiiii pc pc + SignExt(i) if s mRa[s] == 0 bz label, $a[0,2] ← ∀ !∈ BNZ 11 0110 -mmmm aaaaa iiiii iiiii iiiiii pc pc + SignExt(i) if s mRa[s] =0 bnz label, $a[0,2] ← ∀ !∈ # NOP 11 0010 ------no operation nop

CALL 11 0011 ddddd aaaaa ------pc Ra[0],Rd pc +4 call $d, $a

1 ← ←

17 BL 11 0111 ddddd ----- iiiii iiiii iiiiii pc pc + SignExt(i),R pc +4 bl $d, label ← d ← RFE 11 1111 ------PC PC rfe ← backup ADDI 00 1111 ddddd aaaaa iiiii iiiii iiiiii R Ra + SignExt(i) addi $d,$a,i d ← ADD 00 0000 ddddd aaaaa bbbbb ------R Ra + R add $d,$a,$b d ← b SUB 00 0001 ddddd aaaaa bbbbb ------R Ra R sub $d,$a,$b d ← − b AND 00 0010 ddddd aaaaa bbbbb ------R Ra&R and $d,$a,$b d ← b OR 00 0011 ddddd aaaaa bbbbb ------R Ra R or $d,$a,$b d ← | b NOT 00 0101 ddddd aaaaa ------R Ra not $d,$a d ←∼ SARI 00 1011 ddddd aaaaa iiiii iiiii iiiiii R Ra i, R (31 : 32 i) Ra(31), i<0: sali sari $d,i d ← & d − ← SAL 00 0110 ddddd aaaaa bbbbb ------R Ra R sal $d,$a d ← ' b SAR 00 0111 ddddd aaaaa bbbbb ------R Ra R ,R (31 : 32 R ) Ra(31) sar $d,$a d ← & b d − b ← MUL 00 1000 ddddd aaaaa bbbbb ------s 0,1,2,3 Rd[s] Ra[s] Rb[s] mul $d,$a,$b ∀ ∈{ } ← ∗ SWAP 00 1001 ddddd aaaaa ------R [0] Ra[1], R [1] Ra[0] swap $d,$a d ← d ← PERM 00 1001 ddddd aaaaa ------ll kkjjii R (Ra[i],Ra[j],Ra[i],Ra[j]) perm $d,$a[i,j,k,l] d ← RDC8 00 1010 ddddd aaaaa ------R [0] Ra[3](7 : 0)&...&Ra[0](7 : 0) rdc8 $d,$a d ← TGE 00 1100 ddddd aaaaa bbbbb ------s 0,1,2,3 Rd[s] = (Ra[s] Rb[s]) ? 0 : 1 tse $d,$a, $b ∀ ∈{ } 2.1.2 HAPRA architecture The HAPRA architecture is a subset (17 instructions) of the FAPRA ar- chitecture, the key differences is that the HAPRA architecture uses word- addressing instead of byte-addressing. Table2 shows the HAPRA ISA. The HAPRA architecture is scalar only, there are no SIMD variants. There are no indexed addressing modes for load/store instructions and no immediate addressing modes for arithmetic/logical instructions. No rela- tive branch instructions are available, branch target addresses need to be loaded explicitly into a register and used with an indirect branch instruction. Multiplication needs to be implemented in software as there is no native mul- tiplication instruction. In contrast to the FAPRA architecture there are no native comparison instructions. Since the HAPRA/FAPRA architecture is used mainly for educational purposes, namely teaching microprocessor design and due to the high costs of ASIC design and manufacturing, implementations of the HAPRA/FAPRA architecture are generally realized with FPGAs.

18 1 1.2: Übersicht über die CPU

1.2.3 Befehlssatz ld $d, ($a) st $b, ($a) jmp $a add $d,$a,$b sub $d,$a,$b and $d,$a,$b or $d,$a,$b cp $d,$a not $d,$a rfe jz $a, $b sal $d,$a sar $d,$a call $d, $a ldih $d, 0x0815 ldil $d, 12 Beispiel nop [31] a R ← +1 [31] [15 : 0] d pc [15 : 0] IR == 0 ,R b b ] ← b IR 1 1 b b a R R b d R ← R R R backup ← [ a R $ % + − & | if ,R R a a a a a a a ← M a a a PC R R R R R R R ] R R R a ← ← ← ← ← ← ← ←∼ ← ← R [15 : 0] [31 : 16] ← ← ← [ d d d d d d d d d d d Bedeutung R M R R pc pc no operation pc R R R R R R R R PC 15 downto 0 ------bbbbb ------iiiii iiiii iiiiii iiiii iiiii iiiiii ------bbbbb ------bbbbb ------bbbbb ------bbbbb ------bbbbb ------Opcode (31 downto 16) 01 0000 ddddd01 aaaaa 0001 ----- aaaaa 10 0000 ddddd ----- 11 0000 -----11 aaaaa 0001 -----11 aaaaa 0010 -----11 ----- 0011 ddddd aaaaa 00 0001 ddddd aaaaa 00 0100 ddddd00 aaaaa 0101 ddddd00 aaaaa 0110 ddddd00 aaaaa 0111 ddddd aaaaa 11 0100 ------10 0001 ddddd ----- 00 0000 ddddd aaaaa 00 0010 ddddd00 aaaaa 0011 ddddd aaaaa Befehl LD ST LDIH LDIL JMP JZ NOP CALL ADD SUB AND OR CP NOT SAL SAR RFE

Tabelle 1.2: Befehlssatz Table 2: The HAPRA instruction set architecture.

19

10 2.1.3 HASE/Angora

The Hapra Assembler and Simulation Environment (HASE) and Angora are assembly language development environments for the HAPRA and FAPRA architecture. HASE is used for the HAPRA architecture, while Angora is used for the FAPRA architecture. Figure1 shows the graphical user interface (GUI) of HASE.

Figure 1: The HASE GUI.

They offer an editor for HAPRA/FAPRA assembly language with syntax highlighting and an integrated assembler. There is also an integrated debug- ger which allows single stepping through the instructions of the debugged program, setting breakpoints on arbitrary locations in the program and in- spection and modification of registers or memory. Programs can be debugged on an integrated software simulator with support for peripheral hardware like a VGA display, serial console and timers or through an in-circuit debugger.

20 2.2 Compiler pipeline A compiler is a program which translates from a source language to a tar- get language. Typically a compiler translates from a high-level language to assembly language, relieving the software developer from all the low-level details of the target machine. If the compiler is written in the same lan- guage as its source language, the compiler can compile itself and it becomes a self-hosting compiler. In the development of a compiler the time where the compiler becomes self-hosting is an important milestone, because it is a significant test of the capabilities of the compiler. A compiler is typically composed of different components which are organized in a pipeline fashion. Figure2 shows the compiler pipeline.

C/C++ Source Code Ada Source Code FORTRAN Source Code

C/C++ Frontend Ada Frontend FORTRAN Frontend

Intermediate Representation

Middle-end

Optimized Intermediate Representation

Backend PowerPC Backend ARM Backend

x86 Machine Code PowerPC Machine Code ARM Machine Code

Figure 2: The compiler pipeline.

21 The input of the compiler is a source program in the source language. The output of the compiler is the source program translated into the target lan- guage, e.g. assembly language. In principle the whole process could be done by one single component. However, from an engineering perspective this is inefficient, as for every combination of source language and target language a new compiler needs to be developed. To reduce this effort, often split the whole process into three separate components which are organized in a pipeline fashion. Those components are the frontend, the middle-end and the backend. During compilation a source program is passed through the whole pipeline. Inside the pipeline various analysis and synthesis steps are performed, with analyses being done mostly in the frontend and middle-end parts and synthesis being done in the backend. Well-defined intermediate representations (IR) which are capable of carrying all the semantics of the source program are being used to pass the transformed source program be- tween the different pipeline stages. Depending on the actual stage involved, those intermediate representations vary in terms of underlying data structure and abstraction level. A compiler typically uses several different intermediate representations.

2.2.1 Frontend The first part in the compiler pipeline is the frontend, which reads the source program and eventually produces a translation of the source program into an intermediate representation for consumption by the middle-end. The frontend is specific to the source language and compilers usually have multiple frontends, one for each source language they are supporting. The frontend is split into different phases:

Lexical analysis (tokenization) In a first step, the individual characters of the source file are read and are organized into a stream of tokens.

Syntactical analysis (parsing) Once the source file is represented as a stream of tokens, the parser tries to construct a syntax tree. If the source file is not valid according to the gram- mar of the programming language, the parser will not be able to construct a valid syntax tree for the given source file and will report a syntax error.

22 AST construction Since the initial syntax tree contains many nodes which are important for the parser but aren’t needed for later stages in the compilation process, the initial syntax tree is transformed into an abstract syntax tree (AST), which gets rid of the superfluous nodes, resulting in a much more compact repre- sentation.

Semantic analysis (type checking) During semantic analysis, which is performed on an AST, the compiler checks whether the source program conforms to the rules of the programming lan- guage. E.g. whether a variable is only used after it has been defined. Typ- ically rules which are difficult to model with a context-free grammar are checked in this phase [Iro83].

Translation to intermediate representation In the last phase of the frontend, the AST is translated to the intermediate representation of the middle-end.

2.2.2 Middle-end The middle-end takes a program translated to its intermediate representation and produces a semantically equivalent version which is superior to the pre- vious version in regard to a certain optimization criterion. E.g. a program which is faster, a program which is smaller or a program which consumes less power. In a first step, the program is analyzed with certain control flow and data flow analyses. The goal of these analyses is to obtain safe approxi- mations about the executions of the program at compile time and use these results to perform optimizing transformations. The middle-end does not take the target architecture for which code is generated into account. All optimizations in the middle-end are target- independent.

23 2.2.3 Backend The backend is responsible for translating a program in the intermediate rep- resentation of the middle-end into semantically equivalent machine code for the target architecture. Backends are by definition target-dependent, how- ever compilers usually try to share as much common code between different backends as possible. The key tasks of a backend are:

Instruction selection Instruction selection is the task of mapping an instruction of the intermediate representation to one or more native instructions. It is much more important for CISC architectures than RISC architectures, because CISC architectures tend to have complex addressing modes, while RISC architectures usually only have a few simple addressing modes. Often an approach is used where the intermediate representation is tree-like and a pattern matcher is auto- generated from a machine description of the target architecture. This pattern matcher tries to cover the tree of IR instructions with as few native instruc- tions as possible [AGT89]. Sometimes compilers also use directed acyclic graphs (DAG) instead of trees, however pattern matching on DAGs is much more difficult than pattern matching on trees. For architectures with very simple addressing modes hand-written instruction selectors are also common.

Instruction scheduling Instruction scheduling is the task of assigning an order of execution to the native instructions with some optimization goal. E.g. scheduling for latency minimization or throughput maximization. Scheduling is critical for the ef- ficient use of hardware resources [BR91]. The scheduler is allowed to assign an arbitrary execution order to the instructions being scheduled, as long as all control and data dependencies are respected. The problem of optimal basic block scheduling is NP-hard [GJ+79].

Register allocation During register allocation the decision is made whether a variable is kept in memory or in a register. In general, the register allocator tries to keep as many variables as possible in registers, because the access latency of registers is much smaller than the access latency of memory. However, since there are many more variables than registers in a typical program, keeping variables in memory is usually unavoidable. In the context of register allocation the act of storing a variable in memory instead of a register is called spilling.

24 A common approach is to keep all variables in virtual registers until register allocation. Register allocation maps the virtual registers to physical registers and inserts spill code whenever there is no physical register to map a virtual register to. For a variable which has been spilled to memory, every use of the variable needs to have a preceding load instruction, which loads the current value of the variable from the spill location in memory. Conversely, every definition of a variable needs a successive store instruction which stores the value of the variable to the spill location in memory. The problem of optimal register allocation is NP-complete. It was shown that register allocation is equivalent to the graph coloring problem [CAC+81].

More information about compiler construction can be found in [CT04], [LSUA07] and [Muc97].

2.3 LLVM Compiler Infrastructure

This section gives an introduction to the LLVM Compiler Infrastructure.

2.3.1 Overview

The LLVM compiler infrastructure [LA04][Lat02] is an open source effort to build a set of reusable components which can be used to build compilers for arbitrary programming languages. LLVM offers an aggressive optimizer which can do both powerful intraprocedural and interprocedural analyses and transformations. It supports static and just-in-time compilation for a wide range of architectures (x86, PowerPC, ARM, Alpha, SPARC, MIPS, Cell SPU, System z and more). LLVM is a mature, production quality compiler infrastructure used commercially by Apple, Cray, Intel, NVIDIA, AMD and others. Figure3 shows the LLVM architecture. LLVM follows the traditional separation of logical compilation phases into separate components (frontend, middle-end and backend). The source code of the application being compiled is passed to a target-specific frontend and translated into LLVM IR. Figure 3 shows the clang frontend, but frontends for other programming languages exist as well. The LLVM optimizer forms the middle-end and translates unoptimized LLVM IR as produced by a frontend into optimized LLVM IR. The optimized LLVM IR is passed to the last stage in the compiler pipeline, the LLVM code generator. It transforms the optimized LLVM IR into native machine code for one of the target architectures supported by LLVM.

25 C/C++ Source Code

Clang C/C++ Frontend

LLVM IR

LLVM Optimizer

Optimized LLVM IR

LLVM Code Generator

Machine Code

Figure 3: The LLVM architecture.

2.3.2 C frontend The clang subproject [cla] of LLVM aims to provide a frontend for the C, C++ and Objective-C programming languages. The combination of clang and LLVM offers a full C compiler.

2.3.3 Intermediate representations LLVM IR The LLVM intermediate representation is the core intermediate representa- tion used within LLVM, it can be seen as the programming model of the Low Level Virtual Machine. It is the intermediate representation of the middle- end, thus all target-independent optimizations are performed on LLVM IR. LLVM IR is a RISC-like three-address code in static single assignment form (SSA) [CFR+91] with an infinite number of virtual registers. In contrast to machine code, LLVM IR carries explicit type information, which allow analyses and transformations which are not possible or much more difficult to do on machine code.

26 LLVM IR comes in one of the following three formats:

Bitcode • Bitcode is a compact on-disk representation of LLVM IR, which is usu- ally smaller than the equivalent machine code of the target architecture.

In-memory representation • This is a set of data structures which represent LLVM IR in terms of C++ classes and also the preferred representation for programmatic analysis and transformation of LLVM IR.

Assembly language • This is a textual representation of LLVM IR. LLVM assembly language is easier to read and modify for a human than bitcode, so this is the primary representation used by LLVM developers.

LLVM IR is usually produced by the respective source language frontend. More information about LLVM IR can be found in [LA].

SelectionDAG IR The SelectionDAG intermediate representation is one of the intermediate representations used in the backend of LLVM. In a first step, the LLVM IR received from the middle-end is manually expanded to SelectionDAG IR. The translation to SelectionDAG IR is done on a basic block level. The initial Se- lectionDAG usually contains both illegal types and operations. Illegal means the type or operation is not supported natively by the target architecture. For example, an architecture without a native multiplication instruction can declare the multiplication operation to be illegal. The backend writer de- cides how illegal operations are to be handled. An illegal operation can be promoted, expanded or custom lowered. Promotion means that a native in- struction for a larger type will be used, expansion means that the operation will be split into native operations for a smaller type or will be turned into a library call. Custom lowering gives the backend writer full control over the handling of illegal operations, e.g. allows to insert a custom sequence of operations. Between the different phases the DAG Combiner is run, which tries to opti- mize the DAG. Instruction selection is done through pattern matching on the DAG. The instruction selector tries to cover the DAG with as few native instructions

27 as possible. The machine description of the target architecture is used to generate the pattern matcher automatically. After instruction selection, there is a legal DAG which only contains native instructions. Now the scheduler assigns an order to the operations and the DAG is converted to machine instructions.

2.3.4 Target-independent code generator

LLVM IR

Instruction Selection

Scheduling

SSA-based Machine Code Optimizations

Register Allocation

Prologue/Epilogue Code Insertion

Late Machine Code Optimizations

Code Emission

Machine Code

Target-specific Stage

Target-independent Stage

Figure 4: The target-independent code generator architecture.

The target-independent code generator is used by the different target- specific backends. It contains the parts of the backend which are generic enough to be shared among the different target-specific backends. A domain- specific language called TableGen is used to create a machine description of the target machine. The machine description encompasses the instruction set including the encoding of instructions, the register classes, the calling conventions, etc. C++ code is automatically generated from this machine description, e.g. the pattern matcher for instruction selection. The parts of the backend which are difficult to express with TableGen are implemented

28 with custom C++ code. Backends usually generate assembly code, but direct machine code emission is also possible, which is important for just-in-time compilation. Figure4 shows the structure of the target-independent code generator. Target-specific parts are indicated by a light gray background, while target- independent parts are shown with a white background. Instruction selec- tion is performed by a pattern matcher which is automatically generated from the target-specific instruction descriptions in the machine description of the target architecture. Scheduling of target instructions is done by target- independent schedulers. LLVM offers various different instruction schedulers and also allows the backend writer to provide a custom target-specific in- struction scheduler if necessary. After instruction scheduling SSA-based ma- chine code optimizations are performed. In the next stage, in the process of register allocation virtual registers are replaced with physical registers. Dif- ferent target-independent register allocators are available for selection. The prologue/epilogue code insertion stage inserts respective code which usually deals with the call stack on function entry and exit, this stage is target- specific. Optimizations which need to be performed on machine code which is very close to the final machine code being generated, e.g. spill code schedul- ing, are carried out in the late machine code optimization stage. The last stage is target-dependent and emits the generated code, either by creating machine code directly or by producing assembly code which is processed by an assembler afterwards. More information about the target-independent code generator can be found in [LWPL].

29 30 3 Implementation

This chapter describes the implementation of the HAPRA/FAPRA toolchain in detail. The following components had to be implemented, modified or ported respectively:

ABI • HAPRA/FAPRA backend • C frontend • C standard library • Assembler • Simulator • Linker • The HAPRA/FAPRA backend is presented in great detail because it is the central part of the HAPRA/FAPRA toolchain and required the biggest implementation effort of all toolchain components.

3.1 Application Binary Interface An application binary interface (ABI) defines certain aspects of the generated machine code, e.g. sizes and alignment of primitive types, register usage, calling conventions, layout of stack frames of the call stack and many more. This thesis defines an ABI for the HAPRA/FAPRA architecture for the C programming language. Since the binaries created by the compiler run directly on the hardware, without an operating system (OS) in-between, the compiler generates stati- cally linked absolute code which uses a flat 32-bit address space. Table3 shows the registers as defined by the ABI. Volatile registers are caller-saved, while non-volatile registers are callee-saved. The current stack pointer (SP) is always stored in R1. R0 serves as a link register (LR), it con- tains the return address in the calling function to whom the called function branches on exit. R3-R15 are used for parameter passing and return values. R16-R29 are used for local variables and are preserved across function calls. R30 contains the constant value zero since it is a frequently used value and keeping it in a reserved register thus reduces both execution time and code size. R31 is used for holding the target address of indirect branches which replace relative branches exceeding the maximum branch distance.

31 R0 LR, dedicated, volatile across function calls R1 SP, dedicated R2 Scratch register, volatile R3 - R15 Registers for parameter passing/return values, volatile R16 - R29 Non-volatile registers R30 Contains zero constant, dedicated R31 Reserved register for long jumps, dedicated

Table 3: Registers in the HAPRA/FAPRA ABI.

Back chain

Register argument save area

General register save area

Local variable space

Parameter list area

LR save word

SP Back chain

Figure 5: The HAPRA/FAPRA ABI stack frame layout.

Figure5 shows the layout of a stack frame according to the HAPRA/- FAPRA ABI. The call stack grows downwards and the SP always points to the back chain word of the stack frame of the currently active function. The back chain word always points to the starting address of the previous stack frame and is set by the function prologue when the stack frame for the func- tion is established. Since the LR is volatile across function calls, it needs to be saved in the LR save word of the stack frame before other functions are called in the currently active function. The parameter list area is used to pass arguments to functions in memory in case the argument registers are exhausted because there are more arguments than argument registers. The local variable space can be used for variables which are created on function entry and removed on function exit, e.g. local variables. The general register save area is used for saving and restoring non-volatile registers which are used by a function. Non-volatile registers are saved in the function prologue and are restored in the function epilogue. The register argument save area is used by functions with a variable argument list. On entry of such a function

32 all argument registers are stored in the register argument save area, because subsequent function calls can destroy the contents of the argument registers. Table4 and5 show the size and alignment of scalar data types in the C programming language for byte- and word-addressing respectively. The sizeof column shows the results of applying the sizeof operator to the indi- vidual data types. The HAPRA/FAPRA architecture uses little-endian byte ordering. On startup, initialization code is run which sets up the initial call stack in memory and initializes the SP register accordingly and branches to the entry point of the program, which is the address of the main() function for programs written in the C programming language.

Type ANSI C sizeof Size Alignment (bits) (bits) Character char 1 8 8 unsigned char signed char 1 8 8 short 2 16 16 signed short unsigned short 2 16 16 Integer int 4 32 32 signed int long int signed long enum unsigned int 4 32 32 unsigned long Pointer any-type * 4 32 32 any-type (*) () Floating Point float 4 32 32 double 8 64 64

Table 4: Size and alignment of scalar data types on the FAPRA architecture with byte-addressing.

3.2 HAPRA/FAPRA backend

This chapter discusses the implementation of the FAPRA/HAPRA backend.

33 Type ANSI C sizeof Size Alignment (bits) (bits) Character char 1 32 32 unsigned char signed char 1 32 32 short 1 32 32 signed short unsigned short 1 32 32 Integer int 1 32 32 signed int long int signed long enum unsigned int 1 32 32 unsigned long Pointer any-type * 1 32 32 any-type (*) () Floating Point float 1 32 32 double 2 64 64

Table 5: Size and alignment of scalar data types on the FAPRA architecture with word-addressing and on the HAPRA architecture.

3.2.1 Overview

The backend is the component of the compiler which translates from the LLVM IR generated by the frontend to assembly language. LLVM offers a target-independent code generator which is a framework which can be pa- rameterized for a specific target architecture. A major part of the backend is a machine description of the target architecture which is written in a domain- specific language. More information about the development of LLVM back- ends can be found in [WB]. The following parts need to be provided in order to develop a backend for a new target architecture:

Target machine The target machine forms the central interface between the target-independent code generator and the target-specific parts of the backend being developed. It offers accessor methods for the target-specific components of the backend and allows the set of code generation passes to be modified, e.g. by adding a target-specific pass. The HAPRA/FAPRA backend adds a custom branch selection pass.

34 Data layout The data layout of the target machine encodes important ABI information, namely the required and preferred alignment of LLVM IR data types and additional information concerning the data organization in memory, e.g. the endianness, the size of pointers and the alignment of stack objects on the target architecture. In the HAPRA/FAPRA backend the data layout contains the respective in- formation from the HAPRA/FAPRA ABI.

Frame information The frame information provides information about the layout of stack frames on the target machine. It stores the direction of stack growth, the alignment of the stack pointer, the offset to the locals area (the area where function data, e.g. local variables can be stored) and if required the mapping of spill slots for callee-saved registers to their respective offsets on the call stack. In the HAPRA/FAPRA backend the frame information encodes the HAPRA/- FAPRA ABI stack frame layout.

Register information The register information allows the target-independent code generator to de- termine the properties of the individual physical registers. E.g determine whether a specific physical register is reserved or whether it belongs to the callee-saved registers. Additionally, the target-specific function prologue/epi- logue emission logic is contained in the register information, which is respon- sible for setting up and tearing down stack frames on the call stack.

Register set The register set describes the physical registers available on the target ma- chine. Registers are grouped into register classes and registers share the same properties within a register class. Multiple register files are usually modeled with distinct register classes, e.g. if there is a target architecture which has a register file for integer values and a separate register file for floating-point values, then two different register classes are used respectively. Since the HAPRA/FAPRA architecture only has a general purpose 32-bit register file, the register set is rather easy to model.

35 Instruction set The instruction set of the target machine contains descriptions for all na- tive instructions of the target machine and their operands. Operands cover the different addressing modes supported by the target machine, e.g. there are different operands for immediate addressing and memory addressing. Operands in registers refer directly to the physical registers defined in the register set. An instruction description carries the following attributes:

Input and output operands, specifying the number and types of operands. • A DAG which matches the native instruction. The pattern matcher will • replace the given DAG with the native instruction during instruction selection.

A list of physical registers implicitly defined and used by the instruc- • tion.

An optional list of predicates which control whether the DAG pat- • tern matches. Those predicates can be used to enforce additional con- straints.

Flags which describe the high-level semantics of the instruction, e.g. • whether the instruction reads or writes memory or affects control flow in a certain way.

An assembly language string for the instruction. • Instruction selection Operations or types are illegal if they are not supported natively by the target architecture. For example, an architecture without a native multiplication instruction can declare the multiplication operation to be illegal. The backend specifies which types and operations are legal for the target and how illegal types and operations are handled. An illegal operation can be promoted, expanded or custom lowered. Promo- tion means that a native instruction for a larger type will be used, expansion means that the operation will be split into native operations for a smaller type or will be turned into a library call. Custom lowering gives the backend writer full control over the handling of illegal operations, e.g. allows to insert a custom sequence of operations. The HAPRA/FAPRA backend handles 16-bit stores by decomposing them into two natively supported 8-bit stores. Similarly 16-bit loads are handled by

36 performing two natively supported 8-bit loads and concatenating the loaded values accordingly. Select and conditional branch operations are handled by emitting a semanti- cally equivalent sequence of native instructions. The same is done for logical shift right operations and the xor operation, as both operations are not sup- ported natively. Instruction selection is performed through pattern matching on the DAG. The instruction selector tries to cover the DAG with as few native instruc- tions as possible. The machine description is used to generate the pattern matcher automatically from the instruction descriptions.

Register allocation Multiple different register allocation algorithms are implemented within the LLVM compiler infrastructure. The implementations are target-independent and use the physical registers as defined by the register set. Spilling of regis- ters to memory is handled by target-specific code which emits an appropriate instruction sequence for storing/loading a register from the respective spill location in memory.

Assembly printer The assembly printer is a pass which emits assembly code for consumption by an external assembler. The assembly printer knows about directives sup- ported by the assembler for the target architecture. Parts of the assembly printer are auto-generated and use the assembly language string of the in- struction descriptions.

Branch Selector The branch selector is a custom pass operating on machine instructions which identifies relative branches which exceed the maximum branch distance. It replaces those branch instructions with indirect branches which do not have this limitation but are more expensive. The branch target address is loaded into the reserved register R31. With byte-addressing, the branch displace- ment of relative branches is limited to a distance of 8192 instructions forwards or backwards. In the pass, the basic blocks of the functions are visited in topological or- der. Since the relative branch instructions are replaced with a sequence of instructions, the size of the function increases after every patched branch and might lead to another relative branch exceeding the maximum branch

37 distance. Thus, relative branches are checked and patched if necessary in an iterative process until a fixed point is reached.

3.2.2 Machine description An important part of implementing a backend for a new target architecture is the creation of a machine description of the respective target. In order to simplify the creation of new backends, the goal is to describe as much as possible of the target machine with a custom domain-specific language. The domain-specific language offers language constructs which are better suited for this purpose than a regular general-purpose programming language. It simplifies the often repetitive task of describing a target architecture. Part of LLVM is a tool called TableGen which allows the developer to specify records of domain-specific information. The TableGen language has a strict syntax and a simple type system, but does not define the semantics of the description. The semantics of a TableGen record are defined by the different TableGen backends. The LLVM code generator is a major user of Table- Gen and supplies various different TableGen backends, e.g. for instruction description or register description. Those TableGen backends of the code generator create C++ code automatically from the TableGen records. TableGen records come in two shapes: definitions and classes. Definitions are the concrete versions of records and are denoted by the keyword def. Classes are abstract records which are used to create other records (keyword class). To illustrate the syntax of TableGen, listing1 and2 show two excerpts from the machine description of the HAPRA/FAPRA backend. Listing1 shows how a specific instruction format is modeled with TableGen. c l a s s RRForm opcode, dag outs, dag ins, string asmstr, l i s t pattern> : FAPRAInst b i t s <5> RD; { b i t s <5> RA; b i t s <5> RB;

let Pattern = pattern;

l e t I n s t 31 26 = opcode ; l e t I n s t {25−21} = RD; l e t I n s t {20−16} = RA; l e t I n s t {15−11} = RB; { − } } Listing 1: Instruction format description in TableGen syntax.

38 For every instruction format a respective class is created which inherits from the FAPRAInst base class. The class RRForm is shown, which defines the encoding of instructions with two source operands in registers (RA, RB) and one destination register (RD). In addition, the opcode of the instruction is specified. Since there are 32 general purpose registers a register can be encoded with a 5-bit field. The opcode is specified in a 6-bit field. Listing2 shows the definition of the native add instruction. Concrete definitions of native instructions instantiate the class which defines their in- struction format. The native add instruction uses RRForm and its definition is shown. The add instruction has an opcode of zero, takes the RA and RB registers as source operands and puts the result into the destination operand which is register RD. Additionally a pattern is provided for the SelectionDAG add operation, on whom the native add instruction matches. def ADD : RRForm<0b000000 , (outs GPRC:$rD) , (ins GPRC:$rA, GPRC:$rB) , ”add $rD, $rA, $rB”, [( set GPRC:$rD, (add GPRC:$rA, GPRC:$rB))] >; Listing 2: Instruction description in TableGen syntax.

More information about TableGen can be found in [Lat].

3.3 C frontend Since the C programming language is highly target-dependent, the LLVM IR emitted by clang needs to be target-dependent as well. This was achieved by implementing support for the HAPRA/FAPRA ABI in clang, making sure proper LLVM IR is emitted. Other modifications to the frontend were not necessary.

3.4 C standard library Newlib [new] was chosen to be used as the C standard library of the HAPRA/- FAPRA toolchain. It is a lightweight C standard library designed for em- bedded systems, particularly for embedded systems without an operating system. Newlib requires a small target-specific part, e.g. a function which sets up the runtime environment, an implementation of the setjmp/longjmp functions and implementations for a few other target-specific functions. Figure6 shows the newlib architecture. Applications written in the C programming language are using the interface of the C standard library, which is implemented by newlib. Newlib itself is written in portable C code

39 and encapsulates the target-specific parts in a library called libgloss, which is part of newlib. Libgloss exposes a system call interface to newlib. It needs to be adapted to the target architecture, where it usually interfaces directly with the hardware. E.g. libgloss for the HAPRA/FAPRA architecture implements the write system call by writing to the serial console which is accessible via memory-mapped I/O.

C Application

newlib

libgloss

Hardware

Figure 6: The newlib architecture.

3.5 Assembler Since the HAPRA/FAPRA backend generates assembly code, an assembler is needed in order to obtain executable machine code. An assembler was devel- oped with the LLVM machine code toolkit, using the machine description to generate large parts of the assembler automatically. This is a key advantage when adding new instructions to the HAPRA/FAPRA architecture, because they only need to be added in one place – the machine description.

3.6 Simulator During the development of the compiler no real hardware was used, instead the software simulator of Angora was used to run and debug binaries created by the compiler. While Angora is very useful for debugging miscompiled binaries since it offers an excellent debugging environment, it is less suited

40 for running automated regression tests due to the low performance of the built-in simulator. Significantly speeding up the Java-based interpreting simulator of Angora was not possible due to lack of control over the machine code generated by the just-in-time compiler of the Java VM. E.g. it was not possible to force the Java VM to inline small methods in the critical loop of the interpreter. A new interpreter for HAPRA/FAPRA machine code was developed in C. The new C-based simulator offers superior performance over the simulator of Angora but is still limited by the fact that it is an interpreter, which needs to do the whole fetch, decode and execute cycle for every machine instruction. The next logical step is using binary translation from FAPRA/HAPRA machine code to x86-64 machine code instead of an interpreter. Another simulator was developed which performs a combination of dynamic and static binary translation. This simulator is built using the libcpu framework [lib]. The libcpu frame- work is a toolkit for building arbitrary processor simulators which employ bi- nary translation. Libcpu is based on the LLVM compiler infrastructure and supports a wide range of architectures (ARM, MIPS, PowerPC and more).

FAPRA Machine Code ARM Machine Code MIPS Machine Code

FAPRA Frontend ARM Frontend MIPS Frontend

LLVM IR

LLVM Optimizer & Code Generator

x86-64 Machine Code PowerPC Machine Code

Figure 7: The libcpu architecture.

Figure7 shows the architecture of libcpu. Frontends specific to the source architecture of the simulator translate machine code to LLVM IR. Then the LLVM IR is optimized by the LLVM optimizer. In a last step, the LLVM code generator is used to generate machine code for the target architecture,

41 based on the optimized LLVM IR. To add support for a new architecture to the libcpu framework a frontend needs to be developed which translates from the source architecture to LLVM IR. libcpu then uses the LLVM just-in-time compiler to generate machine code for the target architecture from the LLVM IR produced by the frontend. A FAPRA frontend for libcpu was developed during this thesis. Since the generated code produced by the HAPRA/FAPRA C compiler is statically linked absolute code it is rather easy to determine a large part of the source machine code before actual execution of the binary, making static binary translation feasible.

3.7 Linker Since LLVM can store LLVM IR on disk in a compact file format (bitcode), it is possible to link object files containing LLVM IR with the LLVM linker and create assembly code from the resulting (linked) bitcode file. This approach is useful in the first stages of development, because it is possible to link different compilation units without having a native linker for the target architecture. However, this approach has a major drawback: The benefits of separate compilation are lost. Even for minor changes in one of the bitcode files, e.g. changing a single line of code in the C source file, the whole program including all library dependencies needs to be translated from LLVM IR to assembly language again. Another drawback is that the build system of the application needs to be modified to use the LLVM linker instead of the native linker, which can be very time consuming if the application has a complex build system. Of course since the LLVM linker can only link bitcode files, it is not possible to link a binary against native code. Linking bitcode has the big advantage of making link time optimizations possible. Code generation from optimized bitcode is very efficient, e.g. compiling the whole newlib takes about ten seconds on a modern machine, so the lack of separate compilation is negligible from a performance standpoint.

42 4 Results

This chapter presents results obtained by using the C compiler which was developed during this thesis. First, an example which illustrates the transla- tion process from C source code to FAPRA assembly language is given. Then the results of the measurements on the HAPRA and FAPRA architecture are presented. In the last section of this chapter, the experience made during the development of the C compiler is summarized briefly.

4.1 Example This section presents an example program which is compiled for the FAPRA architecture. The intermediate steps of the translation are shown to illustrate the compilation process. First, the translation from C source code to LLVM IR is shown and then in a second step the translation from LLVM IR to FAPRA assembly code is shown. Listing3 shows the C source code of the example program. It is the inner loop of a program which computes the Mandelbrot set. The inner loop is nested into two other loops, which traverse all the pixels of the generated image, those loops have been omitted from the listing. The variable fp x and fp y are the loop counters of the outer loops and contain the X and Y co- ordinate of the current pixel respectively. Complex numbers are represented with coordinates in the generated image. The loop starts at point (0, 0) and iterates by calculating its square value and adding the coordinate of the cur- rent pixel into the result. The iteration stops either when a certain threshold has been reached or the maximum number of iterations has been exceeded. The number of iterations performed defines the color of the current pixel. As fixed-point numbers with a fractional part of 8-bit are used, the results of multiplications are truncated accordingly. int sx=0, sy=0, x=0, y=0; for (iter = 0; (sx+sy) < f p thresh && iter < 3 1 ; i t e r ++) sx = ( x x ) >> 8 ; { sy = ( y ∗ y ) >> 8 ; y = ( ( y ∗ x ) >> 7) + f p y ; x = sx ∗ sy + f p x ; − } Listing 3: C source code of Mandelbrot example.

Listing4 shows the optimized LLVM IR generated from the C source code. The LLVM IR instructions are annotated with the respective C statements they are generated from. Except for the loop condition, only few LLVM IR

43 instructions are generated for every C statement, resulting in a listing which is about two to three times longer than the original C source code listing. for.body15: ; preds = %for.body15, %bb.nph %y.049 = phi i32 [ 0, %bb.nph ], [ %add27, %for.body15 ] %x.048 = phi i32 [ 0, %bb.nph ], [ %add31, %for.body15 ] %0= phi i32 [ 0, %bb.nph ], [ %1, %for.body15 ]

// sx = ( x x ) >> 8 ; %mul = mul∗ i32 %x.048, %x.048 %shr = ashr i32 %mul, 8

// sy = ( y y ) >> 8 ; %mul20 = mul∗ i32 %y.049, %y.049 %shr21 = ashr i32 %mul20, 8

// i t e r++ %1= add nsw i32 %0, 1

// y = ( ( y x ) >> 7) %mul24 = mul∗ i32 %y.049, %x.048 %shr25 = ashr i32 %mul24, 7

// ( sx+sy ) < f p thresh && iter < 3 1 ; %add = add nsw i32 %shr21, %shr %cmp12 = icmp slt i32 %add, 1024 %cmp14 = icmp slt i32 %1, 31 %or.cond = and i1 %cmp12, %cmp14

// x = sx sy + f p x (actually (sx + fp x ) sy ) %tmp65 = add− i32 %shr, %tmp − %add31 = sub i32 %tmp65, %shr21

// y = y + f p y ; %add27 = add i32 %shr25 , %tmp70 br i1 %or.cond, label %for.body15, label %for.end Listing 4: LLVM IR of Mandelbrot example.

Listing5 shows the result of translating the optimized LLVM IR into FAPRA assembly code. Like in the previous listing, the machine instruc- tions are annotated with their respective LLVM IR instructions. Almost all LLVM IR instructions are directly mapped to a single machine code instruc- tion, yielding very efficient code. A key difference to the previous listing is that virtual register references are now replaced with references to physical registers. The generated machine code is difficult to optimize further and compa- rable to optimized hand-written assembly code.

44 // %mul20 = mul i32 %y.049, %y.049 mul $6, $4, $4

// %mul = mul i32 %x.048, %x.048 mul $7, $5, $5

// %shr21 = ashr i32 %mul20, 8 sari $6, $6, 8

// %shr = ashr i32 %mul, 8 sari $7, $7, 8

// %1= add nsw i32 %0, 1 addi $9, $3, 1

// %add = add nsw i32 %shr21, %shr add $8, $6, $7

// %cmp12 = icmp slt i32 %add, 1024 tge $8, $28, $8 bz LBB1 6 , $8

// %cmp14 = icmp slt i32 %1, 31 tge $3, $27, $3 bz LBB1 6 , $3

// %mul24 = mul i32 %y.049, %x.048 mul $3, $4, $5

// %shr25 = ashr i32 %mul24, 7 sari $3, $3, 7

// %tmp65 = add i32 %shr, %tmp add $4, $7, $21

// %add31 = sub i32 %tmp65, %shr21 sub $5, $4, $6

// %add27 = add i32 %shr25 , %tmp70 add $4, $3, $22 and $3, $9, $9 bra LBB1 3

Listing 5: FAPRA assembly code of Mandelbrot example.

45 4.2 Measurements The HAPRA/FAPRA C compiler was used to perform various measurements in order to compare the HAPRA and FAPRA architecture in terms of their effects on execution time and code size. For the measurements two synthetic benchmark suites were chosen: The Stanford benchmarks [HN89] and the micro-benchmarks from the Computer Language Shootout [Sho] (see table6 and7). The reader should note that in these measurements programs for different architectures are compared, all produced by a single compiler, not binaries produced by different compilers. To make sure the compiler creates correct binaries for the source pro- grams, the binaries are compiled with GCC for the x86-64 architecture and executed and the results are compared to the results of the native executions of the benchmarks. Two metrics are measured: The number of instructions executed and the code size of the generated binary. All programs are compiled with link time optimizations turned on and are executed on a simulator which gathers profil- ing information like the number of executed instructions. The measurements were made on an x86-64 Linux system (Intel Core 2 Quad Q9450 CPU, 8 GB of RAM) running Fedora 12, the simulator is compiled with GCC 4.4.2 and with optimizations turned on (-O2 option).

Name LOC Description Bubblesort 171 An array sorted using the bubblesort algorithm. IntMM 159 Two 2-D integer matrices multiplied together. Oscar 323 A floating-point Fast Fourier Transform program. Perm 169 A tightly recursive permutation program. Puzzle 225 A compute bound program. Queens 188 The eight Queens chess problem solved 50 times. Quicksort 174 An array sorted using the quicksort algorithm. RealMM 161 Two 2-D floating-point matrices multiplied together. Towers 218 The canonical Towers of Hanoi problem. Treesort 187 An array sorted using the treesort algorithm.

Table 6: The Stanford benchmark suite.

4.2.1 Instruction count Figure8 shows the speedup of a FAPRA program versus a HAPRA pro- gram assuming an instruction takes the same amount of time to execute on a FAPRA and HAPRA processor. The speedup of FAPRA without the TGE

46 Name LOC Description ackermann 23 Ackermann’s Function ary3 45 Array Access fib2 27 Fibonacci Numbers hash 219 Hash (Associative Array) Access heapsort 81 Heapsort Algorithm hello 12 Hello World lists 231 List Operations matrix 71 Matrix Multiplication methcall 94 Method Calls nestedloop 31 Nested Loops objinst 105 Object Instantiation random 36 Random Number Generator sieve 39 Sieve of Erathostenes strcat 36 String Concatenation

Table 7: The Computer Language Shootout benchmark suite. instruction vs. HAPRA varies between a factor of 7.63 (Towers) and 24.16 (nestedloop). The speedup of FAPRA with the TGE instruction vs. HAPRA is in the range of 8.11 (Perm) to 24.29 (Bubblesort). The general observation is that the TGE instructions helps to improve the performance, particularly for Bubblesort, Quicksort and Treesort as sorting algorithms obviously per- form many comparisons. Puzzle is also heavily using comparison instructions and benefits greatly from the native comparison instruction. Benchmarks which show no significant speedup with the addition of the native compar- ison instruction either perform few comparison operations or perform un- signed comparisons which do not benefit from the TGE instruction since it performs a signed comparison. Table8 shows the instruction distribution on HAPRA and the word- addressing variant of FAPRA when executing a hello world program. Exam- ining the instruction distribution gives insights on the sources of the execution overhead on the HAPRA architecture. The results clearly show that significantly more arithmetic shift right instructions are executed on HAPRA. This is due to the fact that the HAPRA architecture only has a 1-bit arithmetic shift right instruction. On HAPRA, arithmetic right shifts with more than one bit are performed with a library function. The major part of the increase in the number of call instructions executed on HAPRA comes from calls to this library function. The increased number of executed subtraction instructions on HAPRA is due to the testing of the loop condition in the arithmetic shift right library function. As there

47 HAPRA / FAPRA 25 HAPRA / FAPRA with TGE

20

15

10 Speedupvs. HAPRA

5

0 ackermannBubblesortfib2 hello IntMM nestedloopPerm Puzzle QueensQuicksortsieve TowersTreesort

Figure 8: Comparing the number of executed instructions between HAPRA and FAPRA. are no native comparison instructions on HAPRA, a semantically equivalent instruction sequence of native instructions is used, which in this case is a subtraction instruction used to test whether two values are equal. The loop of the library function itself drastically increases the number of executed jump instructions on HAPRA. Load immediate instructions also show an increase on HAPRA due to the lack of instructions with immediate addressing modes. In summary, the major execution overhead for the hello world program on HAPRA comes from the lack of a native arithmetic shift right instruction which can shift a variable number of bits.

4.2.2 Code size Figure9 shows the code size of various benchmarks compiled for HAPRA and FAPRA. Sizes vary between 76 KB (hello) and 148 KB (sieve) on FAPRA and between 180 KB (hello) and 304 KB (sieve) on HAPRA. Thus the small- est and largest binaries are the same benchmarks on both architectures. The measured HAPRA binaries are between a factor of two and three bigger than their FAPRA counterparts. The addition of a native comparison instruction on FAPRA has only a minor impact on code size. The difference in binary size across the benchmarks is rather low, as the binary size is dominated by

48 (a) HAPRA (b) FAPRA LD 206 LDW 131 ST 234 STW 168 LDIH 1564 LDIH 107 LDIL 1551 LDIL 107 JMP 5083 JMP 9 JZ 5178 BRA 40 BZ 24 BNZ 122 CALL 170 CALL 9 BL 0 ADDI 167 ADD 529 ADD 76 SUB 5115 SUB 240 AND 358 AND 266 OR 241 OR 211 NOT 272 NOT 121 SARI 145 SAL 19 SAL 1 SAR 4852 SAR 0 MUL 0 TGE 14 TSE 0 Total 25371 Total 1957

Table 8: Instruction distribution when executing hello.

49 350 HAPRA FAPRA FAPRA with TGE 300

250

200

Sizein KB 150

100

50

0 ackermannBubblesortfib2 hello IntMM nestedloopPerm Puzzle QueensQuicksortsieve TowersTreesort

Figure 9: Comparing the binary size of benchmarks compiled for HAPRA and FAPRA. code which is linked in from the C standard library and which is similar for most of the benchmarks.

4.3 Simulator Table9 and 10 show the results of executing the benchmarks of the respec- tive benchmark suites both on the interpreting simulator written in the C programming language and the libcpu-based simulator which employs static and dynamic binary translation. The speedup in table9 and 10 refers to the execution time, it does not take the static translation time into account which is required for the libcpu-based simulator on the first run. The bi- naries were compiled for the FAPRA target with byte-addressing and with link-time optimizations turned on. The results show that a significant execution speedup can be obtained through binary translation, with a speedup of at least an order of magnitude on almost all benchmarks. The speedup obtained by the libcpu-based sim- ulator is in the range of 3.45 (methcall) to 41.33 (ary3 ) across the executed benchmarks. The libcpu-based simulator tries to statically translate as much code as possible. Since the benchmarks are relatively simple programs, most of the

50 static translation time is spent on the translation of the library functions which are linked in to perform I/O. Since the static translation time is dom- inated by the library functions, the time taken for static translation is rather similar for all the benchmarks, except for methcall and objinst which perform only very limited I/O and thus are linked to fewer library functions, leading to a lower static translation time.

Name Static translation Execution Execution Speedup time (s) time (ms) time (ms) libcpu libcpu C interpreter Bubblesort 21 112 1853 16.54 IntMM 22 12 169 14.08 Oscar 33 1834 9062 4.94 Perm 25 161 2134 13.25 Puzzle 23 537 11398 21.22 Queens 22 97 1484 15.29 Quicksort 24 142 2072 14.59 RealMM 22 760 12140 15.97 Towers 26 145 2143 14.77

Table 9: Executing the Stanford benchmark suite on simulators.

Name Static translation Execution Execution Speedup time (s) time (ms) time (ms) libcpu libcpu C interpreter ackermann 22 30 445 14.83 ary3 21 545 22528 41.33 fib2 21 579 10029 17.32 lists 23 1799 37487 20.83 matrix 23 1536 25162 16.38 methcall 9 8217 28376 3.45 nestedloop 22 65 925 14.23 objinst 3 1455 42160 28.97 random 29 8536 286611 33.57 sieve 25 2672 49838 18.65

Table 10: Executing the Computer Language Shootout benchmark suite on simulators.

51 4.4 Experience During the development of the compiler many bugs were found and fixed in Angora. Since in the past only rather small assembly language programs were assembled and executed with Angora this was not surprising. When using link time optimizations it happens rather often that the branch displacement of relative branches is too small. This is due to the fact that a lot of inlining is done by the optimizer, increasing the code size and thus the distance of branches. Even for a simple hello world binary several relative branches need to be replaced with indirect branches because their branch target is too far away. Since there is no native logical shift right instruction in the HAPRA/- FAPRA instruction set the logical shift right operation is illegal and thus custom lowered to an equivalent instruction sequence. However, the fact that the logical shift right operation is illegal triggered several bugs in the target independent code generator, since the DAG combiner was (incorrectly) introducing logical shift right operations even though they are not legal. Implementing support for word-addressing required significant effort be- cause both clang and LLVM assumed a character unit is 8-bit wide, which is not true when using word-addressing since in this case a character unit is 32-bit wide. Unfortunately, many different places in the source code were affected, basically touching the whole codebase (frontend, optimizer, back- end). In total more than a hundred different places needed to be adjusted accordingly.

4.5 Remarks Interestingly, adding an ANDI and ORI instruction had only a minor impact on execution time and code size, indicating that those instructions can be omitted. It turned out that the TSE instruction is actually redundant and can be removed, since it is equal to TGE with switched operands. Still, the measure- ments evidently show the positive effect of a native comparison instructions. Adding an unsigned comparison instruction would be useful, since the measurements clearly show the positive impact of the TGE instruction, which is likely to be similar for an unsigned comparison instruction. It was observed that keeping the constant value 0 in a fixed register greatly reduces the amount of LDIH/LDIL instructions. The HAPRA/FAPRA ABI reserves a register for this purpose. It would make sense to increase the branch displacement of relative branches since there is still room left in the instruction word, which right now is unused space. This change is especially

52 recommended as in practice rather often the size of the branch displacement is not sufficient. Given the low implementation effort and area requirements of a native logical shift right instruction and a xor instruction it would make sense to add those two instructions. There large overhead on HAPRA is mostly due to single bit shifts and the lack of a native multiplication instruction.

53 54 5 Conclusion & Future Work

In this thesis, a C toolchain for the HAPRA/FAPRA architecture was de- veloped, including an assembler, a C compiler and a C standard library was ported. The developed C compiler supports the HAPRA target, the word- oriented FAPRA target and the byte-oriented FAPRA target. As shown, the C compiler generates machine code of excellent code quality. Additionally an extremely fast simulator employing binary translation was developed.

With this toolchain it is now very easy to compare and improve architec- tures since new instructions can be added with low effort and measurements can be done with a very fast simulator, resulting in a low turnaround time.

Since current implementations of the HAPRA/FAPRA architecture do not have an MMU, it is not possible to port the Linux kernel which requires an MMU in order to implement memory protection and virtual memory. However, a modified version of Linux exists called µClinux which can run on processors without an MMU.

Thus the next logical step would be to port µClinux to the HAPRA/- FAPRA architecture. Even though it lacks features like memory protection and virtual memory, advanced features like dynamic linking are still possible. In general, applications which run on the Linux kernel can be ported with a minor effort to the µClinux kernel. The µClinux distribution already con- tains a large set of ported software, about 280 different applications in total, including a lightweight C standard library called µClibc. Porting µClinux consists of two major parts: First, getting µClinux to compile with the C compiler, it is expected that while doing this, some remaining bugs in the compiler will get exposed and since the kernel uses several GCC extensions it might be the case that an extension is not implemented in clang, requiring a workaround. Second, implementing the target-specific parts of µClinux for the HAPRA/FAPRA architecture. This includes aspects like memory lay- out, interrupt handling, system calls, DMA and others. In addition, drivers for the peripherals need to be developed.

In order to implement support for shared libraries in µClinux, a native linker would be beneficial.

55 56 A Appendix List of abbreviations ABI Application Binary Interface ASIC Application-Specific Integrated Circuit AST Abstract Syntax Tree CISC Complex Instruction Set Computer DAG Directed Acyclic Graph DMA Direct Memory Access FPGA Field-Programmable Gate Array GCC GNU Compiler Collection GUI Graphical User Interface HASE Hapra Assembler and Simulation Environment IR Intermediate Representation ISA Instruction Set Architecture JIT Just-In-Time Compilation LLVM Low Level Virtual Machine LR Link Register MMU Memory Management Unit OS Operating System PC Program Counter RISC Reduced Instruction Set Computer SIMD Single Instruction, Multiple Data SP Stack Pointer SSA Static Single Assignment Form VM Virtual Machine 57 References

[AGT89] A.V. Aho, M. Ganapathi, and S.W.K. Tjiang. Code generation us- ing tree matching and dynamic programming. ACM Transactions on Programming Languages and Systems (TOPLAS), 11(4):491– 516, 1989.

[BR91] D. Bernstein and M. Rodeh. Global instruction scheduling for superscalar machines. ACM SIGPLAN Notices, 26(6):255, 1991.

[CAC+81] G.J. Chaitin, M.A. Auslander, A.K. Chandra, J. Cocke, M.E. Hopkins, and P.W. Markstein. Register allocation via coloring. Computer Languages, 6(1):47–57, 1981.

[CFR+91] R. Cytron, J. Ferrante, B.K. Rosen, M.N. Wegman, and F.K. Zadeck. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Transactions on Program- ming Languages and Systems (TOPLAS), 13(4):451–490, 1991.

[cla] clang. Accessed April 2010. http://clang.llvm.org.

[CT04] K.D. Cooper and L. Torczon. Engineering a Compiler. Elsevier, 2004.

[GJ+79] M.R. Garey, D.S. Johnson, et al. Computers and Intractability: A Guide to the Theory of NP-completeness. WH Freeman San Francisco, 1979.

[HN89] J. Hennessy and P. Nye. Stanford small benchmark suite, 1989.

[Iro83] E.T. Irons. A syntax directed compiler for ALGOL 60. Commu- nications of the ACM, 26(1):14–16, 1983.

[LA] C. Lattner and V. Adve. LLVM Language Reference Manual. Accessed April 2010. http://llvm.org/docs/LangRef.html.

[LA04] C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, page 75. IEEE Com- puter Society, 2004.

[Lat] C. Lattner. TableGen Fundamentals. Accessed April 2010. http://llvm.org/docs/TableGenFundamentals.html.

58 [Lat02] C.A. Lattner. LLVM: An infrastructure for multi-stage optimiza- tion. Master’s thesis, Computer Science Dept., University of Illi- nois at Urbana-Champaign, Urbana, IL, Dec, 2002.

[lib] libcpu. Accessed April 2010. http://www.libcpu.org/.

[LSUA07] M.S. Lam, R. Sethi, J.D. Ullman, and A. Aho. Compilers: Prin- ciples, Techniques and Tools. Addison-Wesley, 2007.

[LWPL] C. Lattner, B. Wendling, F.M.Q. Pereira, and J. Laskey. The LLVM Target-Independent Code Generator. Accessed April 2010. http://llvm.org/docs/CodeGenerator.html.

[Muc97] S.S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.

[new] newlib. Accessed April 2010. http://www.sourceware.org/newlib.

[Sho] Computer Language Shootout. Accessed April 2010. http://dada.perl.it/shootout/.

[uCl] uClinux. Accessed April 2010. http://www.uclinux.org.

[WB] M. Woo and M. Brukman. Writing an LLVM Compiler Back- end. http://llvm.org/docs/WritingAnLLVMBackend.html. Ac- cessed April 2010.

59 60 Declaration

All the work contained within this thesis, except where otherwise acknowledged, was solely the effort of the author. At no stage was any collaboration entered into with any other party.

(Tilmann Scheller)