Compiler construction 2015 x86: assembly for a real machine

Lecture 6 High-level view of x86 Not a stack machine; no direct correspondence to operand stacks. x86 architecture Arithmetics etc is done with values in registers. Calling conventions Much more limited support for function calls; you need to Some x86 instructions handle return addresses, jumps, allocation of stack frames etc From LLVM to assembler yourself. Instruction selection Your code is assembled and run; no further optimization. CISC architecture with few registers. Straightforward code will run slowly.

x86 assembler, a first example Example explained

NASM assembly code NASM code commented segment .text ; code area Javalette (or ) segment .text global f ; f has external scope > more ex1.c global f f: ; entry point for f int f (int x, int y) { f: push dword ebp ; save caller’s fp int z = x + y; push dword ebp mov ebp, esp ; set our fp return z; mov ebp, esp sub esp, 4 ; allocate space for z } sub esp, 4 mov eax, [ebp+12] ; move y to eax > mov eax, [ebp+12] add eax, [ebp+8] add eax, [ebp+8] ; add x to eax mov [ebp-4], eax mov [ebp-4], eax ; move eax to z This might be compiled to the mov eax, [ebp-4] mov eax, [ebp-4] ; return value to eax assembler code to the right. leave leave ; restore caller’s fp/sp ret ret ; pop return addr, jump Intel x86 architectures Which version should you target?

Long history 8086, 1978. First IBM PCs. 16 bit registers, real mode. 80286, 1982. AT, Windows. Protected mode. x86 80386, 1985. 32 bit registers, virtual memory. When speaking of the x86 architecture, one generally means 80486, Pentium, Pentium II, III, IV. 1989 – 2003. register/instruction set for the 80386 (with floating-point ops). Math coprocessor, pipelining, caches, SSE ... Intel Core 2. 2006. Multi-core. You can compile code which would run on a 386 Core i3/i5/i7. 2009/10. – or you may use SSE2 operations for a more recent version. Backwards compatibility important; leading to a large set of opcodes.

Not only Intel offer x86 processors: also AMD is in the market.

x86 registers Data area for parameters and local variables

General purpose registers (32-bits)

EAX, EBX, ECX, EDX, EBP, ESP, ESI, EDI. Runtime stack Illustration Conventional use: Contiguous memory area. High address EBP and ESP for frame pointer and stack pointer. Grows from high addresses Caller’s base pointer

downwards. Stack Local vars Caller’s Segment registers growth AR layout illustrated. AR Legacy from old segmented addressing architecture. EBP contains current base Parameters Pn P1 Can be ignored in Javalette . pointer (= frame pointer). Return address EBP ESP contains current stack Caller’s base pointer Callee’s Floating-point registers pointer. AR Local vars Eight 80–bit registers ST0 – ST7 organised as a stack. Note: We need to store return ESP address (address of instruction to Flag registers jump to on return). Status registers with bits for results of comparisons, etc. We discuss these later. Calling convention Parameters, local variables and return values

Callee, on entry Caller, before call Parameters Push params (in reverse order). Push caller’s base pointer. In the callee code, integer parameter 1 has address ebp+8, Push return address. Update current base pointer. parameter 2 ebp+12, etc. Jump to callee entry. Allocate space for locals. Parameter values accessed with indirect addressing: [ebp+8], etc. Double parameters require 8 bytes. Code pattern: Code pattern: push dword ebp Here ebp+n means “(address stored in ebp) + n”. push dword paramn ... mov ebp, esp sub esp, Local variables push dword param1 localbytes call f First local var is at address ebp-4, etc. Callee, on exit Local vars are conventionally addressed relative to ebp, not esp. Caller, after call Restore base and stack pointer. Again, refer to vars by indirect addressing: [ebp-4], etc. Pop parameters. Pop return address and jump. Code pattern: Return values Code pattern: leave Integer and boolean values are returned in eax, doubles in st0. add esp parambytes ret

Register usage Assemblers for x86

Scratch registers (caller save) Several alternatives EAX, ECX and EDX must be saved by caller before call, if used; Several assemblers for x86 exist, with different syntax. can be freely used by callee. We will use NASM, the Netwide Assembler, which is available for several platforms. Callee save register We also recommend Paul Carter’s book and examples. Follow EBX, ESI, EDI, EBP, ESP. link from course web site. For EBP and ESP, this is handled in the code patterns. Some syntax differences to the GNU assembler: GNU uses %eax etc, as register names. Note For two-argument instructions, the operands have opposite What we have described is one common calling convention for order(!). 32-bit x86, called cdecl. Different syntax for indirect addressing. If you use gcc -S ex.c, you will get GNU syntax. Other conventions exist, but we omit them. Example: GNU syntax Integer arithmetic; two-adress code

First example, revisited Addition, subtraction and multiplication > gcc -c ex1.c add dest, src ; dest := dest + src > objdump -d ex1.o sub dest, src ; dest := dest - src ex1.o: file format elf32-i386 imul dest, src ; dest := dest src · Disassembly of section .text: Operands can be values in registers or in memory; src also a literal. 00000000 : 0: 55 push %ebp Division – one-address code 1: 89 e5 mov %esp,%ebp idiv denom 3: 8b 45 0c mov 0xc(%ebp),%eax (eax,edx) := ((edx:eax)/denom,(edx:eax)%denom) 6: 03 45 08 add 0x8(%ebp),%eax 9: c9 leave The numerator is the 64-bit value edx:eax (no other choices). a: c3 ret Both div and mod are performed; results in eax resp. edx. > edx must be zeroed before division.

Example Example, continued

Code for main Comments push dword ebp Complete file IO functions are external; javalette program mov ebp, esp extern printString, printInt we come back to that. int main () { push str1 extern readInt printString "Input a number: "; call printString The .data segment int n = readInt(); add esp, 4 segment .data contains constants such printInt (2*n); call readInt str1 db "Input a number: " as str1. return 0; imul eax, 2 The .text segment } push eax segment .text contains code. call printInt global main The global declaration The above code could be translated as add esp, 4 gives main external scope follows (slightly optimized to fit on slide). mov eax, 0 main: (can be called from code leave code from previous slide outside this file). ret Floating-point arithmetic in x86 Floating-point arithmetic in SSE2

Moving numbers (selection) fld src pushes value in src on fp stack. New registers fild src pushes integer value in src on fp stack. 128-bit registers XMM0–XMM7 (later also XMM8–XMM15). fstp dest stores top of fp stack in dest and pops. Each can hold two double precision floats or four single-precision src and dest can be fp register or memory reference. floats. SIMD operations for arithmetic.

Arithmetic (selection) Arithmetic instructions fadd src src added to ST0. Two-address code, ADDSD, MULSD, etc. fadd to dest ST0 added to dest. SSE2 fp code similar to integer arithmetic. faddp dest ST0 added to dest, then pop. Similar variants for fsub, fmul and fdiv.

Control flow One more example

Naive assembler sum: push dword ebp mov ebp, esp Branch instructions (selection) Javalette (or C) sub esp, 8 Integer comparisons JZ lab branches if ZF is set. int sum(int n) { mov [ebp-4], 0 cmp v1 v2 JL lab branches if SF is set. int res = 0; mov [ebp-8], 0 v1-v2 is computed and bits in the Similarly for the other relations int i = 0; jmp L2 flag registers are set: between v1 and v2. while (i < n) { L3: mov eax, [ebp-8] ZF is set iff value is zero. res = res + i; add [ebp-4], eax fcomi st0 OF is set iff result overflows. src compares and src i++; inc [ebp-8] SF is set iff result is negative. and sets flags; can be followed by } L2: mov eax, [ebp-8] branching as above. return res; cmp eax, [ebp+8] } jl L3 mov eax, [ebp-4] leave ret How to do an x86 backend Input and output

Starting point A simple proposal Two alternatives: Define printInt, readInt etc in C. Then link this file together with gcc From LLVM code (requires your basic backend to generate your object files using . LLVM code as a data structure, not directly as strings). Alternative: Compile runtime.ll with llvm-as and llc to get Will generate many local vars. runtime.s; this can be given to gcc as below. From AST’s generated by frontend (means a lot of code Linux building common with LLVM backend). To assemble a NASM file to file.o: Variables nasm -f elf file.asm To link: In either case, your code will contain a lot of variables/virtual gcc file.o runtime.c registers. Possible approaches: Result is executable a.out. Treat these as local vars, storing to and fetching from stack at each access. Gives really slow code. More info Do (linear scan) register allocation. Much better code. Paul Carter’s book (link on course web site) gives more info.

From LLVM to assembler Native code generation, revisited

More complications Several stages So far, we have ignored some important concerns in code Instruction selection. generation: Instruction scheduling. The instruction set in real-world processors typically offer SSA-based optimizations. many different ways to achieve the same effect. Thus, when Register allocation. translating an IR program to native code we must do instruction selection, i.e. choose between available Prolog/epilog code (AR management). alternatives. Late optimizations. Often an instruction sequence contain independent parts that Code emission. can be executed in arbitrary order. Different orders may take very different time; thus the code generator must do Target-independent generation instruction scheduling. Also much of this is done in target-independent ways and using Both these task are complex and interact with register allocation. general algorithms operating on target descriptions. In LLVM, these tasks are done by the native code generator llc and the JIT in lli. Instruction selection Tree pattern matching, an example

Further observations a[i] := x as tree IR code Algorithm outline Instruction selection for RISC machines generally simpler (from Appel) Represent native instructions than for CISC machines. MOVE as patterns, or tree fragments. The number of translation possibilities grow (combinatorially) Tile the IR tree using these as one considers larger chunks of IR code for translation. MEM MEM patterns so that all nodes in the tree are covered. + + Pattern matching Output the sequence of

The IR code can be seen as a pattern matching problem: The * instructions corresponding to MEM FP CONST x native instructions are seen as patterns; instruction selection is the the tiling. problem to cover the IR code by patterns. + TEMP i CONST 4 Two variants FP CONST a Two approaches Greedy algorithm (top down). Tree pattern matching. Think of IR code as tree. a and x local vars, i in register. Dynamic programming (bottom Peephole matching. Think of IR code as sequence. a is pointer to first array element. up); based on cost estimates.

A simple instruction set Identifying patterns (incomplete)

+ * ADD MUL

+ + CONST

ADD ri rj + rk ADDI ← Notes CONST CONST MUL ri rj rk ← ∗ We consider only SUB ri rj rk MEM MEM MEM MEM ← − arithmetic and memory LOAD DIV r r /r + + CONST i ← j k ADDI r r + c instructions (no jumps!). i ← j CONST CONST SUBI r r c Assume special register i ← j − MOVE MOVE MOVE MOVE LOAD ri M[rj + c] r0, which is always 0. ← STORE MEM MEM MEM MEM STORE M[rj + c] ri ← + + CONST MOVEM M[r ] M[r ] Example done in class. j ← i CONST CONST

MOVE

MOVEM MEM MEM Peephole matching Instruction scheduling, background

Simple-minded, old-fashioned view of processor Recall: Fetch an instruction, decode it, fetch operands, perform operation, ... Code improvement by local simplification of the code within a small store result. Then fetch next operation, sliding window of instructions. Modern processors Can be used for instruction selection Several instructions under execution concurrently. Often one further intermediate language between IR and native Memory system cause delays, with operations waiting for code; peephole simplification done for that language. data. Similar problems for results from arithmetic operations, that Retargetable compilers may take several cycles. Instruction selection part of compiler generated from description of target instruction set (code generator generators). Consequence Important to understand data dependencies and order instructions advantageously.

Instruction scheduling, example Instruction scheduling

Example (from Cooper) w=w*2*x*y*z Comments Memory op takes 3 cycles, mult 2 cycles, add one cycle. Problem is NP-complete for realistic architectures. One instruction can be issued each cycle, if data available. Common technique is list scheduling: greedy algorithm for scheduling a basic block. Schedule 1 Schedule 2 Builds graph describing data dependencies between r1 <- M [fp + @w] r1 <- M [fp + @w] instructions and schedules instructions from ready list of r1 <- r1 + r1 r2 <- M [fp + @x] instructions with available operands. r2 <- M [fp + @x] r3 <- M [fp + @y] r1 <- r1 * r2 r1 <- r1 + r1 Interaction r2 <- M [fp + @y] r1 <- r1 * r2 Despite interaction between selection, scheduling and register r1 <- r1 * r2 r2 <- M [fp + @z] allocation, these are typically handled independently (and in this r2 <- M [fp + @z] r1 <- r1 * r3 order). r1 <- r1 * r2 r1 <- r1 * r2 M [fp + @w] <- r1 M [fp + @w] <- r1