<<

DIGITAL SYSTEMS 63

8 General purpose machines

Section 5 introduced the idea of a general purpose piece of hardware on which can be imposed a specific piece of . Suppose we want to build a machine to evaluate an expression, like a + (b × c). We could build an for + and a multiplier for ×, and wire them together in a special purpose circuit, or we could try to build a programmable circuit. One strategy would be to build an FPGA-like circuit with lots of switches which could be configured like the special purpose circuit, and which would perform the evaluation in one cycle. An alternative would be to build a programmable circuit which can perform one (or more generally some constant number) of the operations in a single clock cycle, and to make it sequence through several steps to evaluate the expression. This machine needs some to hold partial results from one cycle to the next.

8.1 Expression evaluation First of all, let us make precise what an expression is, and what it is to evaluate it. For a small example, take expressions with values, made up of named variables and integer numerals, combined by a small range of binary operations: type Value = Int type Name = String Expr = Num Value | Var Name | Bin Expr Op Expr data Op = Add | Sub | Mul | Div The values of type Expr are things like

Bin (Var “a”) Mul (Bin (Num 2) Add (Var “b”)) but for convenience we will assume a parsing function Main> parse "a*2+b" Bin (Var "a") Mul (Bin (Num 2) Add (Var "b"))

(Notice that the parser does not try to get operator precedence right; you have to use parentheses.) An expression has a value once values have been assigned to all its free variables. Suppose there is a Store indexed by Names giving the value of variables:

type Store = [(String, Value)] then a straightforward evaluator can assign a value to an expression by recursion on the structure of the expression

11:09pm 21st January 2021 GPM 64 8 GENERAL PURPOSE MACHINES

eval :: Store → Expr → Value eval store (Num n) = n eval store (Var v) = store ‘at‘ v eval store (Bin l op r) = operate op (eval store l)(eval store r) operate Add a b = a + b operate Sub a b = a − b operate Mul a b = a × b operate Div a b = a ‘div‘ b We will build sequential machines whose correctness can be measured against the results of eval.

8.2 Stack evaluation One such machine has a stack of partial results. The idea is that we write a function compileS :: Expr → [Instr] which translates expressions data Expr = Num Value | Var String | Bin Expr Op Expr into a sequence of instructions data Instr = PushN Value | PushV String | Do Op data Op = Add | Sub | Mul | Div ... for a small machine :: Store → [Instr] → Value This factorisation of eval into a compiler which flattens the structure of the ex- pression, and a machine which executes the instructions is the essence of some portable implementations of languages like Java or Scala. The compiler trans- lates the source language into a bytecode, and a virtual machine interprets that bytecode.10 The compiler translates the expression tree into reverse Polish notation:11

10The first bytecode compiler was possibly Martin Richards’ BCPL compiler in the 1960s, which produced intermediate code for the O-machine. The O-code might be interpreted, or it might be translated into the native of particular hardware. The translation into natice code might eb all-at-once, or it might be performed just in time as each bytecode instruction is executed. 11(Forward) Polish notation is attributed to JanLukasiewicz (1878–1956) in the 1920s, as a way of writing logical formulae without parentheses by putting the operators before the operands. RPN emerged in the 1950s and 1960s as an implementation technique.

GPM 11:09pm 21st January 2021 8.2 Stack evaluation 65

compileS (Num n) = [PushN n] compileS (Var v) = [PushV v] compileS (Bin l op r) = compileS l ++ compileS r ++[Do op] What makes us happy to call the exec function a ‘machine’? It had better be finite state, and its next state had better be given by a function of its present state. Here is the next state function: type State = (Store, [Value]) step :: State → Instr → State step (store, stack)(PushN n) = (store, n : stack) step (store, stack)(PushV v) = (store, store ‘at‘ v : stack) step (store, b : a : stack)(Do op) = (store, operate op a b : stack) Executing a sequence of instructions will push a value onto the top of the stack: exec store prog = answer (foldl step (store, [ ]) prog) where answer (store, [v]) = v Correctness of this machine, that is eval store expr = exec store (compileS expr) can be proved by using the hypothesis foldl step (store, stack)(compileS expr) = (store, eval store expr : stack) in a proof by induction on the structure of expr.

Exercises 8.1 Redefine eval and compileS as instances of the natural fold foldE :: (Value → a) → (String → a) → (a → Op → a → a) → Expr → a foldE num var bin = f where f (Num n) = num n f (Var v) = var v f (Bin l op r) = bin (f l) op (f r) for Expr. 8.2 Complete the proof that eval store expr = exec store (compileS expr) You may find it useful to use that foldl f e (xs ++ ys) = foldl f (foldl f e xs) ys

11:09pm 21st January 2021 GPM 66 8 GENERAL PURPOSE MACHINES

8.3 Modify the stack expression compiler to reduce the number of stack locations needed by evaluating the arguments of an operation in the right order. In the case of non-commutative operators you will also have to add instructions to the machine: either a Swap instruction which swaps the two values at the top of the stack, or reversed versions of each of the instructions corresponding to a non-commutative operator. What are the advantages and disadvantages of each of these two solutions?

8.3 Register machines The stack machine is not obviously finite state: however it will only use a finite amount of stack for any given (finite) expression. Moreover the pattern of access to the stack is predetermined, so it might be helpful to treat it as a list of registers, and work out which registers are accessed by each operation. This style of machine would have instructions

data Instr = Load Reg Value | Move Reg Reg | Do Op Reg Reg Reg to load a constant, to copy one register into another, and to operate on the contents of two registers leaving the answer in a third (not all of which need to be distinct). The state of the state machine will be a mapping from registers to values

type State = [(Reg, Value)] and the step function is straightforward step :: State → Instr → State step regs (Load d n) = update regs d n step regs (Move d s) = update regs d (regs ‘at‘ s) step regs (Do op d s t) = update regs d (operate op (regs ‘at‘ s)(regs ‘at‘ t) Rather than identifying elements of the store by name at run time, we will allocate registers to hold the values of the named variables. The mapping from names to addresses is usually called the environment

type Env = [(String, Reg)] and exists only at compile time. The compiler allocates registers from a (finite) list of free registers in a way that corresponds to the locations on the stack from the previous compiler, and the result is left in the first of those free registers.

GPM 11:09pm 21st January 2021 8.3 Register machines 67

compileR :: Env → [Reg] → Expr → [Instr] compileR env (r : free)(Num n) = [Load r n] compileR env (r : free)(Var v) = [Move r (env ‘at‘ v)] compileR env (r0 : r1 : free)(Bin l op r) = compileR env (r0 : r1 : free) l ++ compileR env (r1 : free) r ++ [Do op r0 r0 r1] The machine has to be modified to look for the answer in the right register. exec :: State → Reg → [Instr] → Value exec regs r prog = answer (foldl step regs prog) where answer regs = regs ‘at‘ r The initial state of the machine must be a mapping from registers to values which matches the original store:

regs ‘at‘(env ‘at‘ v) = store ‘at‘ v

For example

Main> env [("a",10),("b",11),("c",12),("d",13)] Main> compileR env [0..9] (parse "(a+b)-(c*d)") [Move 0 10,Move 1 11,Do Add 0 0 1, Move 1 12,Move 2 13,Do Mul 1 1 2, Do Sub 0 0 1] Main> regs [(10,2),(11,4),(12,6),(13,8)] Main> exec regs 0 (compileR env [0..9] (parse "(a+b)-(c*d)")) -42

Code for the register machine can be improved by eliminating unnecessary moves from register to register, and the same sort of depth minimisation as for the stack machine. Here is code from a modified compiler, which minimises register use. This compiler returns (as well as the code) the identity of the register containing the answer, a list of unused registers, and the maximum number of registers used.

Main> compileR’ env [0..5] (parse "(a+b)-(c*d)") (0,[Do Add 0 10 11,Do Mul 1 12 13,Do Sub 0 0 1],[1,2,3,4,5],2) Main> exec regs 0 [Do Add 0 10 11,Do Mul 1 12 13,Do Sub 0 0 1] -42

11:09pm 21st January 2021 GPM 68 8 GENERAL PURPOSE MACHINES

The beauty of the register machine code is that it is easy to see how to construct a state machine to execute it. On each cycle the machine must read (up to) two registers, combine them by a program-determined operation, and write the result into a register. This circuit

write func

❄ P❄ we P ✲raddr 0 rdata0 ✲ PP

✲raddr 1 ✲ waddr ❅

✲ wdata rdata1 ✲ ✏ ✏✏✏

is known as a data path (or sometimes a ).

Exercises 8.4 Modify the register expression compiler to reduce the number of instructions by eliminating unnecessary Move instructions. The compiler should return not only the code, but also the identity of the register containing the answer.

8.5 Modify the register expression compiler to reduce the number of registers used by evaluating the arguments in the right order. Do you have to modify the machine?

8.4 Components of the data path The simple data path consists of a bank of addressable registers, and an arithmetic- logic unit. The register bank contains k registers, each n wide. That needs nk flip- flops, each with a write enable signal. Logically they behave like a flip-flop and a , although remember that the implementation is likely to be smaller and involve a gated clock signal. Each register consists of n of these with a common write enable signal.

GPM 11:09pm 21st January 2021 8.4 Components of the data path 69

☛ ✟   0     d q q   r    r  d 1     ✡ ✠ > q¯   d   q we  r  φ      r       .   .   .   

we

The instructions write (at most) one register each, so it is enough to be able to select one of the registers, and conditionally update that. This could be done by a decoder along the following lines: in a clock cycle when write is true the register selected by the (log n ) waddr is updated with wdata. wdata ✟✟ ✟ ✔r register ☛✟ ··· ☛✟

✕ ✔r register ···

✕ ✔r register ··· ✲ ✲ waddr rdata0 rdata1 ✕ ✔r register ···

✕ ...... ❍ ❍❍ ✡✠ ✡✠ write raddr 0 raddr 1 Each instruction also needs to be able to read two registers. Logically, this means two independent each capable of selecting the output of any one of the registers. The ALU is a combinational circuit whose output is one of a number of func- tions of its inputs, according to the instruction being executed.

11:09pm 21st January 2021 GPM 70 8 GENERAL PURPOSE MACHINES

a ✓✏ s + ✲ b s

s − ✲ s PP a PP ✲ s × ✲ ❅ s b ✏ ✏✏✏ s ‘div‘ ✲ s func . ✲ .

✒✻✑ func

Logically this could consist of a number of components each one of the f potentially different functions and an n bit f way multiplexer selecting the right output. In practice there would probably be a great deal of sharing: and subtraction done by the same circuit with a control input, all the bit-by-bit logical operations done by a circuit with some control inputs, and so on.

write func

❄ P❄ we P ✲raddr 0 rdata0 ✲ PP

✲raddr 1 ✲ waddr ❅

✲ wdata rdata1 ✲ ✏ ✏✏✏

Physically, components we have seen here would probably not be apparent. One of the characteristics of the data path is that connections tend to be between corresponding bits of components. The ith bit of each register is connected to each of the output multiplexers, the ith bit of each multiplexer goes to the ith bit input of each component of the ALU, and the ith bit of the output from each component goes to the ith bit of the ALU multiplexer, and the ith bit of the output

GPM 11:09pm 21st January 2021 8.5 Program control 71 from that goes to the ith bit write input of each register. Only carry signals and control signals run across that structure. The first bit slice description of a design like this was probably EDSAC 2, built at the University of Cambridge Mathematical Laboratory in 1956-8. The data path was assembled from a number of identical bit slices, each of which could be hauled out independently from the racking.

8.5 Program control The control signals for the data path might be generated by a counter and a read only memory containing the instructions of the program:

+ 1✛

✲ raddr 0 ✲ raddr 1 ✲ ✲ ✲ PC s ROM waddr ✲ write ✲ func at least for straight line code of the form we have seen so far. A machine like this can only execute programs in which each instruction is executed once in (textual) sequence. More generally we will want to be able to repeat pieces of the code (loops), and to choose whether or not to execute some instructions according to values in the data (conditionals). These involve instructions which can modify the as well as the general purpose registers, and instructions which do so conditionally on the.data passing through the ALU.

11:09pm 21st January 2021 GPM 72 8 GENERAL PURPOSE MACHINES

Code for the abstract expressions module Expr where import Data.Char import Numeric type Value = Int type Name = String data Expr = Num Value | Var Name | Bin Expr Op Expr deriving Show foldE :: (Value -> a) -> (String -> a) -> (a -> Op -> a -> a) -> Expr -> a foldE num var bin = f where f (Num n) = num n f (Var v) = var v f (Bin l op r) = bin (f l) op (f r) data Op = Add | Sub | Mul | Div deriving Show eval :: Store -> Expr -> Value eval store (Num n) = n eval store (Var v) = store ‘at‘ v eval store (Bin l op r) = operate op (eval store l) (eval store r) operate Add a b = a + b operate Sub a b = a - b operate Mul a b = a * b operate Div a b = a ‘div‘ b type Store = [(String,Value)] at :: Eq a => [(a,b)] -> a -> b store ‘at‘ v = head [ e | (n,e) <- store, n==v ] update :: Eq a => [(a,b)] -> a -> b -> [(a,b)] update regs r v = (r,v):regs

GPM 11:09pm 21st January 2021 8.5 Program control 73 parse :: String -> Expr parse xs = check [ e | (e,[]) <- (whitespace >> expr) xs ] where check [] = error "syntax error" check [x] = x check (_:_:_) = error "ambiguous"

expr = (pMap Bin atom && rator && expr) || atom

atom = ( balanced || pMap Num readDec || pMap Var readName) << whitespace

balanced = lit ’(’ << whitespace >> expr << lit ’)’

rator = oneOf opMap << whitespace where oneOf = foldr1 (||) opMap = [ pMap (const f) (lit c) | (c,f) <- table ] table = [(’+’,Add), (’-’,Sub), (’*’,Mul), (’/’,Div)]

readName xs | null ns = [] | otherwise = [ (ns,ys) ] where (ns,ys) = span isAlpha xs

lit c [] = [] lit c (x:xs) | x == c = [ (c, xs) ] | otherwise = []

whitespace xs = [ span isSpace xs ]

pMap f p xs = [ (f x, ys) | (x,ys) <- p xs ]

(p || q) xs = p xs ++ q xs (p && q) xs = [ (f x, zs) | (f,ys) <- p xs, (x,zs) <- q ys ] p >> q = pMap (\ x y -> y) p && q p << q = pMap (\ x y -> x) p && q

11:09pm 21st January 2021 GPM 74 8 GENERAL PURPOSE MACHINES

Code for the abstract register machine import Expr type Reg = Int data Instr = Load Reg Value | Move Reg Reg | Do Op Reg Reg Reg deriving Show type Env = [(String,Reg)] compileR :: Env -> [Reg] -> Expr -> [Instr] compileR env (r:free) (Num n) = [Load r n] compileR env (r:free) (Var v) = [Move r (env ‘at‘ v)] compileR env (r0:r1:free) (Bin l op r) = compileR env (r0:r1:free) l ++ compileR env (r1:free) r ++ [ Do op r0 r0 r1] type State = [(Reg,Value)] exec :: State -> Reg -> [Instr] -> Value exec regs r prog = answer (foldl step regs prog) where answer regs = regs ‘at‘ r step :: State -> Instr -> State step regs (Load d n) = update regs d n step regs (Move d s) = update regs d (regs ‘at‘ s) step regs (Do op d s t) = update regs d v where v = operate op (regs ‘at‘ s) (regs ‘at‘ t)

GPM 11:09pm 21st January 2021 8.5 Program control 75

Code for the abstract stack machine module Stack where import Expr data Instr = PushN Value | PushV String | Do Op deriving Show compileS :: Expr -> [Instr] compileS (Num n) = [PushN n] compileS (Var v) = [PushV v] compileS (Bin l op r) = compileS l ++ compileS r ++ [Do op] exec :: Store -> [Instr] -> Value exec store prog = answer (foldl step (store,[]) prog) where answer (store, [v]) = v type State = (Store, [Value]) step :: State -> Instr -> State step (store, stack) (PushN n) = (store, n:stack) step (store, stack) (PushV v) = (store, store ‘at‘ v:stack) step (store, b:a:stack) (Do op) = (store, operate op a b:stack)

EDSAC in 1949 on the left, EDSAC2 on the right.

11:09pm 21st January 2021 GPM DIGITAL SYSTEMS 77 9 MIPS

MIPS here is a simplified based on a real also called MIPS, which originally stood for “Microprocessor without Interlocked Pipeline Stages”. The real MIPS was a pioneering RISC (reduced instruction com- puter) processor architecture designed by John Hennessey at Stanford about 1980. Sophisticated implementations of the architecture can be made very efficient, and indeed are made and licenced by a spin-out company and are used in many ap- plications in the real world. However, our interest in it will be as a particularly simple (but real) design for which we can explain (and simulate) a straightforward implementation.

9.1 Instructions Every instruction on MIPS is a single 32 bit word, so stored in memory at a word address: an address which is a multiple of four. For example the word

000000000110010000010000001000002 or 0064102016, or 655772810 happens to be interpreted as op rs rt rd sh func 0 3 4 2 0 32 6 5 5 5 5 6 bits

which is written in as

° ° add °2, 3, 4

Its effect is $2 := $3 + $4 (and pc := pc + 4).12 There is nothing magic about the format. There are five bits in each of the rs, rt and rd fields because they need to specify one of 32 registers. Those bits are bits 25-21, 20-16, 15-11 because those bits of an internal are wired to the address lines of the register block. The op field is bits 31-26, again because wires from those bits go to the decoding logic; the 0 means rd := rs ⊗ rt because the decoding logic translates the 0 into control signals that set the data path to do that; and the func field makes the ALU do addition because there is (in essence) an entry in a lookup table that maps 32 to the control signals for addition.

12These notes use x := E for change the value of x to that of E (evaluated before changing x); also simultaneous assignments such as x, y := y, x (which exchanges the values of x and y). It should be clear why x := x + 1 is preferable to x = x + 1.

11:09pm 21st January 2021 MIPS 78 9 MIPS

All these details are unimportant contingent consequences of a particular pro- cessor design: the important thing is the relationship between the wiring of the hardware and the interpretation of words of instructions. Try not to be confused by the difference between the order of the registers in the assembly language and the order of the fields in the binary format. The three arguments of the assembly language instruction match the order of the variables and the arguments in the corresponding assignment statement. The order of the fields in the binary format match the wiring of the hardware. There is no reason for there to be any relationship between the layout of the assembly and binary formats.

9.2 Executing instructions

If pc contains 10010 and location 10010 contains 0064102016 and $2, $3, $4 contain 3110, 4110, 5910, the state will change so that pc contains 10410 and location 10010 still contains 0064102016 and $2, $3, $4 contain 10010, 4110, 5910. Then the processor goes on to execute the instruction in location 10410.

If a, b and c are contained in registers $2, $3, $4 then

° °

mul °6, 3, 4

° ° add °5, 2, 6 computes d = a + (b × c) and stores it in $5 (having destroyed the value in $6).

9.3 In a like C, a statement assigning d := a + (b × c) might be written d=a+b*c in terms of some named variables. The compiler needs to understand the structure of the language, and the architecture of the processor well enough to assign registers to hold the values of variables, and to generate the

right sequence of instructions. In this case,13 perhaps we get

° °

mul °6, 3, 4

° ° add °5, 2, 6 13Of course, if you compile d = a+b*c using a compiler for a different machine architecture, it will probably translate into different assembly instructions. On the workstation in my office, which has an family processor, the compiler produces imull %ecx, %edx leal (%rdx,%rax), %ebx which means ‘signed multiply (32 bit) register cx by (32 bit) register dx, truncate to 32 bits and store in register dx’; then ‘calculate the address (load effective address) of a memory location whose address is the sum (64 bit) of register dx and (64 bit) ax truncating the address to 32 bits and store it in register ex’.

MIPS 11:09pm 21st January 2021 9.4 Data memory 79

prog.c d := a + b × c

compiler

° ° mul °6, 3, 4

prog.s

° ° add °5, 2, 6

assembler ❄ libc.a prog.o

linker ❄ ❄ prog

loader ❄ code in

memory

° ° The assembler translates symbolic instructions such as add °5, 2, 6 into a numeric form, such as 0046282016 It will also calculate lengths of jumps to symbolic labels and so on, and deal with reifying other symbolic values in the instructions. Any real program will also need some standard routines from a , for example input and output functions, mathematical functions. The linker brings these and joins them to the references to them in the program. Finally the loader copies the code into memory, the pc is set to point at the first instruction, and. . . . The link/load distinction is often unclear: some functions that are logically part of loading happen at link time, linking may even be delayed until after loading so that library code is only linked when it is needed.

9.4 Data memory Registers are fast, but few. We cannot afford enough address bits to have many registers (nor can we put enough of them close enough to the heart of the pro- cessor to make them fast enough). Most of the state of a program is in slower, bulk, cheaper memory usually ‘outside’ the processor. In general it will be highly structured; we will come back to that later. On RISC processors like MIPS. instructions either ‘do something’ or move

11:09pm 21st January 2021 MIPS 80 9 MIPS data between registers and memory, but not both. MIPS has two14 instructions for memory access: lw (load word) and sw (store word).

lw °2, 2004 moves (copies) a word from memory at address 2004 into register $2, and

sw °2, 2012 copies the contents of register $2 into a word from memory at address 2012. If a, b, c, d are at consecutive locations (words: four -addresses apart) starting at 2000, the assignment d := a + b × c might compile to

lw °2, 2004 — $2 := b

lw °3, 2008 — $3 := c

° ° mul °2, 2, 3 — $2 := b × c

lw °3, 2000 — $3 := a

° ° add °2, 3, 2 — $2 := a + b × c

sw °2, 2012 — d := a + b × c Other sequences are possible! Clearly there is an advantage to being able to reuse loaded values (or to keep them out of memory altogether. The lw and sw instructions have a format unlike add and mul, to accommodate the address: op rs rt imm

lw °2, 2004 35 0 2 2004 I-format 6 5 5 16 bits For now, rs is zero, rt is the register written (there is no rd field) and the address comes from the bottom sixteen bits of the instruction. As before, the details of the format are contingent on the design of the pro- cessor. More significant is that the op field occupies the same bits as before. Instruction decoding identifies the format of the instruction from the value of op.

9.5 A simple implementation of MIPS This implementation is based on the unrealistic assumption that an instruction can complete in a single clock cycle. The clock has to be so slow that this would have very poor performance. For other approaches see Henessey and Patterson, or next year’s Architecture course. The point of RISC is that it is easy to make things faster: in fact the right way of making more complex architectures faster is to implement them on a RISC machine and make the RISC machine go faster.

14The real MIPS has seven at least, but we need not worry about the others.

MIPS 11:09pm 21st January 2021 9.6 Fetch unit (1st draft) 81

The presentation here involves adding hardware in steps to implement more and more instructions, starting with R-type arithmetic operations and building up through adding immediate constants, indexed addressing, branches, jumps and eventually subroutines. The floor plan will look something like:

✛ C B ✲ Branch logic ✻ ✻ E Fetch A ✲ Data path unit ✻ D ✲ F G and we will gradually elaborate a simple (but slightly unrealistic) implementation of the processor by filling in the parts. The fetch unit gets a of instructions from memory and supplies them (at A and D) on consecutive clock cycles to the data path and the control unit. Each instruction specifies some registers to be operated on (at A) and some operation to be done (at D), the fetch unit also increments the program counter and passes this on to the branch logic (at B). The control unit translates the instructions into a large number of control signals which drive the data path (at F) to perform those operations. The branch logic receives control signals from the control unit (at G) and flags and computed addresses from the data path (at E) which might affect the location of the next instruction passed back to the fetch unit (at C).

9.6 Fetch unit (1st draft) The first draft version of the fetch unit brings consecutive words from the instruc- tion memory: C ✛ P ✲ PPP

❅ + ✲ B 4 ✏ ✏✏✏

✲ A Instruction ✲ PC ✲ s memory ✲ D

11:09pm 21st January 2021 MIPS 82 9 MIPS

On each clock cycle the word at the address in the program counter is read, and supplied to the data path and control unit as an instruction. On the next clock cycle the program counter will point at the next word, which is four away (because words are 32 bits on this machine, and consecutive memory addresses correspond to consecutive bytes of memory).

9.7 Data path (1st draft) The first draft data path will execute a stream of R-format instructions:

RegWrite ALUFunc E ❄ P❄ ✻ we P 25:21 ✲raddr 0 rdata0 ✲ PP ✲ zero A 20:16 raddr 1 (15:11 ✲ waddr ❅ out ✲ wdata rdata1 ✲ ✏ ✏✏✏

On each cycle it reads registers rs and rt, operates on them according to the value of ALUFunc, and (assuming RegWrite is true)writes the result back to rd. It also generates a signal for the branch logic which indicates whether the result of an operation was zero.

9.8 Data path (2nd draft) In order to implement instructions like addi, for example

op rs rt imm ° addi °2, 3, 42 8 3 2 42 I-format 6 5 5 16 bits which adds $3 and the constant 42 and puts the answer in $2, we need a way of getting the immediate constant from the instruction to the ALU. (Details: the constant is 16 bits long, not 32 bits; and the destination register is not in bits 15 : 11.)

MIPS 11:09pm 21st January 2021 9.9 Data path (3rd draft) 83

PP 25:21 ✲raddr 0 rdata0 ✲ PP 20:16 ✲raddr s 1  ✲✞0 ☎ ❅ out

 ✲ waddr rdata1 ✞0 ☎  A15:11 ✲ 1 ✲ ✏  ✏✏  ✝✻✆ ✲ 1 ✏ ✝ ✆ ✓ ✏ ✻ 15:0 ✲  Sign extend   ✒ ✑ RegDst ALUSrc

When RegDst = 1 and ALUSrc = 0 this is the R-type data path. When RegDst = 0 the register to be written is selected by the rt field of the instruction; and if ALUSrc = 1 the second operand comes from sign extending the immediate field of the instruction. (It costs less to read a second register and discard the value at the ALUSrc multiplexer than it would to think about not reading the register.) (Details: there are other instructions on MIPS such as ori which do not sign extend the constant.)

9.9 Data path (3rd draft) In order to implement the load word and store word instructions we need to im- plement indexed addressing. The full form of the (real hardware) load word in- struction lw $2, 100($3) calculates a by adding the contents of $3 and the constant 100; the contents of the word with that address are copied into $2. Think of programs that use arrays and other indexed structures.

✲✞0 ☎ PP rdata0 ✲ PP

❅ out ✲ addr rdata ✲ 1 s rdata1 ✞0 ☎ ✝✻✆ ✲ ✲ wdata ✏✏ ✲ 1 ✏✏

✝ ✆ read write ✓Sign extend✏ ✻ ✻ ✒ ✑

MemRead MemtoReg MemWrite

11:09pm 21st January 2021 MIPS 84 9 MIPS

The memory address we want is the value that would have been calculated by an addi instruction. When MemToReg = 0 we get the previous behaviour; if MemToReg = 1 the contents of a memory location are delivered to be written to a register. Of course, MemWrite = 0; we do not want to change memory. It is worth having a MemRead signal, because it turns out to be worth not troubling the data memory on other instructions: this avoids all sorts of activity in the memory to which we will return later. The store word instruction is much like load word, but copies a register to the memory. The addressing is just like load word, but the data to be written comes from rt (which is being read anyway!)

✲✞0 ☎ PP rdata0 ✲ PP

❅ out ✲ addr rdata ✲ 1 s rdata1 ✞0 ☎ ✝ ✆ s ✲ ✏✏ ✲ 1 ✏✏ ✲ wdata ✝ ✆ read write ✓Sign extend✏ ✻ ✻ ✒ ✑

MemRead MemWrite and, of course, we set MemWrite = 1, and MemRead = 0 and RegWrite = 0.

9.10 Fetch unit and branch logic A branch instruction conditionally selects between going on to the next instruc- tion (in textual order) and jumping away to somewhere else in the program. For example

op rs rt imm ° beq °2, 3, label 4 2 3 offset I-format 6 5 5 16 bits changes the pc to point at the instruction marked label, but only if $2 and $3 contain the same value. The value in the immediate constant field is the number of words (a quarter of the number of bytes) between ‘here’ (the instruction next

MIPS 11:09pm 21st January 2021 9.11 Control signals 85 after the branch, in textual sequence) and ‘there’ (the label). The assembler usually calculates this number. To implement the instruction we need two to the calculation of the next pc value:

PP ✲ PP ✲✞1 ☎ PP ❅❅ + ✲ P s ✏ ❅❅ + ✲ 4 ✏✏✏ 0 ×4 ✏ ✝ ✆ ✏✏

✲ ✲ PC s signext zero Branch imm The target of the branch is calculated from the incremented pc and the sign ex- tended immediate constant. (This makes a branch with a zero immediate constant a non-jump.) This calculation may as well be made whether or not the branch target is needed. Secondly, the next value to be stored in the pc is selected according to the value of the zero signal from the ALU, and the Branch control signal. MIPS also has bne (branch if not equal) and ‘unconditional branch’ (sic) in- structions, as well as several branch instructions that are conditional on other signals, such as those from the floating point arithmetic unit.

9.11 Control signals It remains to generate the various control signals, on the basis of which instruction is being executed. ✲ ALUSrc ✲ MemRead ✲ MemWrite opcode Control ✲ MemToReg 31:26 ROM ✲ RegDst ✲ RegWrite ✲ Branch

✑✑✸ ✑ ALUOp ✲ ALUFunc func 5:0

11:09pm 21st January 2021 MIPS 86 9 MIPS

The signals to be produced can just be tabulated by the value of the opcode, and the implementation of the circuit can be just a read only memory.

ALU ALU Mem Mem Mem Reg Reg op Src Op Read Write ToReg Dst Write Branch R-type 0 0 func 0 0 0 1 1 no addi 8 1 + 0 0 0 0 1 no lw 35 1 + 1 0 1 0 1 no sw 43 1 + 0 1 X X 0 no beq 4 0 − 0 0 X X 0 eq

9.12 Fetch unit and branch logic and jump logic MIPS also provides a range of jump instructions, for example op addr j label 2 target J-format 6 26 bits sets the pc to a (constant) value. Again, the assembler is expected to calculate the target address from the label. (Detail: the pc is 32 bits long, the addr field is only 26 bits. Where do the other bits come from? Because instructions are on word boundaries, we can multiply by four, making 28 bits; and MIPS chooses to leave remaining top four bits of the pc unchanged.) This needs another change to the hardware:

✲ ✞1 ☎ P ×4 ✲ ✲ PPP ✲✞ ☎  s 1 PP ❅❅ + ✲ PP s 0 ✏ ❅ + ✲ ✝ ✆ 4 ✏✏✏ 0 ×4 ✏ ✝ ✆ ✏✏

✲ ✲ PC s signext zero Branch target Jump immed and another column in the control ROM. The target address comes from bits 25 : 0 of the instruction.

MIPS 11:09pm 21st January 2021 9.12 Fetch unit and branch logic and jump logic 87

There are also instructions

jal addr

‘jump and link’ which is a J-format instruction that jumps to the same place as the corresponding jump instruction, having simultaneously saved pc + 4 in register $31, and an R-format instruction

jr $r which loads the contents of $r into the pc – so that jr $31 will ‘return’ from a jal. (Both of these will need new hardware: exercise 9.1 and 9.2.) (Detail: the real MIPS also has a ‘jump and link register’ instruction which combines a ‘jump register’ with storing the return pc in a specified register.)

Exercises 9.1 The jal (jump and link) instruction is a J-format instruction which jumps to the target address, having first saved the address of the next instruction in register 31 ($ra). Outline the implementation of this instruction, together with any changes you have to make to the hardware of the MIPS described in the text.

9.2 The jr (jump register) instruction is an R-format instruction with opcode 0 and function 8 which jumps to the address stored in the register indicated by its rs field. Outline the implementation of this instruction, together with any changes you have to make to the hardware of the MIPS described in the text.

9.3 The jalr (jump and link register) instruction is an R-format instruction with opcode 0 and function 9 which jumps to the address stored in the register indicated by its rs field, having first saved the address of the next instruction in the register specified by its rd field. Outline the implementation of this instruction, together with any changes you have to make to the hardware of the MIPS described in the text.

9.4 MIPS also provides a pair of branch and link instructions, bltzal and bgtzal. These are similar to the bltz and bgtz instructions, which branch conditionally on rs being less than (or greater than) zero. The branch and link variants also store the program counter in register 31. What additional hardware do these instructions require?

11:09pm 21st January 2021 MIPS 88 9 MIPS

9.13 Synthetic instructions There are many special cases of instructions that we have already seen which justify being thought of as instructions in their own right. The assembler might as well translate li $r, const → addi $r, $0, const b label → beq $0, $0, label There is also a range of things that take more than one instruction. For example, there is a (real hardware) instruction lui (load upper immediate) that puts its 16 bit constant into the top half of a register, and zeroes into the bottom half. For example, lui $r, 0x1234 puts 1234000016 into $r. This together with another instruction can be used to synthesise a two-instruction sequence for li const where const needs more than 16 bits (exercise 9.5). Synthetic instructions like this are known as pseudoinstructions. There is a convention that $1 is free for use by the assembler when it needs one to synthesise a multi-instruction pseudoinstruction. In fact the assembler allows you to refer to it as $at, for ‘assembler temporary’. (This convention has no significance in the hardware.) In implementing li it is useful (but not necessary) to know that while addi sign extends its constant, ori (bit-by-bit or) does not.15 (Having instructions which do not sign extend the immediate constant does affect the hardware.) There is also a slt (set less than) which sets rd to 1 if rs < rt and to 0 otherwise. Notice that this gives a stronger result than looking at the result of subtracting one register from the other, which may cause an overflow. The slt instruction can be used with beq and bne to synthesise branches on less-than-or- equal, less-than, greater-than, and greater-than-or-equal (exercise 9.6). There are, of course, also slti, sltu, and sltiu instructions; and pseudoinstructions for all the other ‘set’ instructions seq, sge, sgeu, sgt, sgtu, sle, sleu and sne can be synthesised from the set instructions and branches. Even some of the instructions already mentioned turn out not to be imple- mented instructions on a real MIPS processor. The multiplication instruction mult $rs, $rt on real hardware multiplies two 32 bit registers and puts the 64 bit product in two special registers hi and lo (not in the general purpose registers) and the real divide instruction stores the and remainder in hi and lo. There are two further instructions mfhi and mflo to move values from the hi and lo registers into a general purpose register, so mul $rd, $rs, $rt expands into mult $rs, $rt 15There is an instruction addiu, but misleadingly it also sign-extends its immediate constant. The difference between addi and addiu is that on real MIPS one causes an exception on overflow whereas the other does not.

MIPS 11:09pm 21st January 2021 9.14 MIPS is not the real MIPS 89

mflo $rd and muli $rd, $r, 4 translates into

addi $at, $zero, 4 mult $r, $at mflo $rd which is why a compiler should prefer to do that by sll $rd, $r, 2. The point here is not the details, but the idea of keeping the hardware small, while still providing a rich target language for the compiler writer. And we can always choose to add some of these instructions to later hardware.

9.14 MIPS is not the real MIPS One of the attractions of RISC architectures like MIPS is that it is possible to devise much faster implementations than the one described here. In particular real MIPS chips will be executing several instructions at the same time. This has turned out to be so much the right thing to do that even non-RISC processors these days work by translating their instructions into component parts which can be treated as RISC instructions and executed on pipelines. A consequence of this pipelining is that many implementations of RISC ma- chines will have decided to execute an instruction before discovering that a branch had been taken. One way of dealing with this is to guarantee to execute these instructions anyway. This is just the sort of complication which we want to avoid in these notes, so some of the things we have said about MIPS instructions are not true about the real machine. Other things we do not deal with here are exceptions (interrupts) and in par- ticular the arithmetic overflow exception; and the handling of floating point by a .

11:09pm 21st January 2021 MIPS 90 9 MIPS

9.15 Summary of instructions mentioned The details are not important: what you need is a feel for what would be a reasonable set of instructions for a machine like MIPS, and a convincing account of how the structure of the hardware and the function of the instructions correspond. add $rd, $rs, $rt rd := rs + rt sub $rd, $rs, $rt rd := rs − rt mul $rd, $rs, $rt rd := rs × rt sll $rd, $rt, n rd := rt <

°zero 0 zero

°at 1 assembler temporary °

°v0- v1 2-3 return values from functions °

°a0- a3 4-7 first four arguments to a procedure °

°t0- t9 8-15,24-25 temporary registers (may be destroyed by calls) °

°s0- s7 16-23 saved temporary (preserved by calls) ° °k0, k1 26,27 reserved for OS kernel

°gp 28 global pointer

°sp 29 stack pointer

°fp 30 frame pointer

°ra 31 return address Again, the details do not matter!

MIPS 11:09pm 21st January 2021 9.15 Summary of instructions mentioned 91

Exercises 9.5 Show how to implement the following pseudoinstructions, where big requires more than sixteen bits, but −215 ≤ small < 215. Instruction Effect move $5, $3 $5 := $3 clear $5 $5 := 0 li $5, small $5 := small li $5, big $5 := big lw $5, big($3) $5 := m[$3 + big] addi $5, $3, big $5 := $3 + big beq $5, small, L if $5 = small goto L beq $5, big, L if $5 = big goto L

9.6 Show how to implement the following pseudoinstructions. Instruction Effect bz $5, L if $5 = 0 goto L bnez $5, L if $5 =6 0 goto L blt $5, $3, L if $5 < $3 goto L ble $5, $3, L if $5 ≤ $3 goto L bgt $5, $3, L if $5 > $3 goto L bge $5, $3, L if $5 ≥ $3 goto L 9.7 How far away can the target of a branch such as beq be? How might you implement beq $5, $3, l if label l is too far away?

9.8 Write a simple expression compiler (in the style of section 8.3) for MIPS. The environment should map names to absolute store locations, and the output should be text acceptable to the assembler. A real compiler would have to try to memory accesses, and to manage expressions too big to fit into the available registers. Do not try to be at all clever about register allocation: use as many registers as you need, and let the compiler fail if it runs out of registers.

11:09pm 21st January 2021 MIPS 92 9 MIPS

9.16 Single instruction machines Everyone should meet a single instruction machine at least once in their career, if only to see the consequences of taking any methodology to extreme lengths. A single instruction machine (SIC) has only one instruction, which therefore needs no opcode. It has no registers (apart from the program counter), but the instruction has two memory addresses and an integer as arguments. On this particular example of the SIC the effect of the instruction a, b, c is to subtract the value of memory location b from memory location a, and if this makes the value of location a negative, jump to an instruction a distance c away.

a, b, c -- m[a] := m[a] - m[b]; if m[a] < 0 then pc := pc + c

For definiteness, assume that the pc has already been incremented before the conditional jump, and that instructions occupy consecutively addressed locations. Thus an instruction with third field 0 always executes the following instruction next, whatever the result of the test. The machine also requires that some memory locations contain known con- stants, say one contains the constant 1.

Exercises 9.9 Write sequences of instructions for SIC which have the effect of each of

1. x := x − y 2. x := 0 3. x := −y, leaving y unchanged 4. x := y, leaving y unchanged 5. x := x + 1 6. pc := l 7. x := x + y, leaving y unchanged 8. x := x × y, leaving y unchanged

In some cases you will need some other temporary location(s) for working. It will help to assume that you have an assembler that can labels for jumps.

9.10 Write an assembler and emulator for SIC, and use it to count the number of instructions executed in running some programs.

MIPS 11:09pm 21st January 2021 DIGITAL SYSTEMS 93

10 Programming

Suppose we do not implement a multiplier in hardware, and as a result have no multiplication instructions. This program calculates a × b by repeated addition, provided b ≥ 0: c := 0; d := 0; while c =6 b do d := d + a; c := c + 1 end

To check this: it maintains the invariant d = a × c ∧ 0 ≤ c ≤ b. This program takes a time proportional to the size of b; you can do the same in time O(log b) by successive halving of b. This linear-time program can be translated into clr $4 -- c := 0 clr $5 -- d := 0 loop: beq $4, $3, done -- while c /= b do add $5, $5, $2 -- d := d + a addi $4, $4, 1 -- c := c + 1 b loop -- end done:

These are not all physical instructions, though in this case there is one real in- struction for each logical instruction. 0: addi $4, $0, 0 4: addi $5, $0, 0 8: beq $4, $3, L24 -- offset +3 12: add $5, $5, $2 16: addi $4, $4, 1 20: beq $0, $0, L8 -- offset -4

Notice that had we written the test as c < b and turned that into loop: bge $4, $3, done -- while c < b do this would have translated into two real instructions and the offset for the other branch would have to change to compensate. Of course it is much easier to translate a program written in a structured language into an assembly code program than it is to reconstruct the intention

11:09pm 21st January 2021 assembly 94 10 PROGRAMMING from an assembly code program. Whilst the structured program in the comments may well just about make it feasible to read small assembly code programs like this, it is well worth writing the comments first and then adding the code, rather than the other way around! This program accumulates the sum of the elements of an array:

i := 0; total := 0; while i =6 n do total := total + a[i]; i := i + 1 end

This time the invariant is total = P0≤j

clr $2 -- i := 0 clr $3 -- total := 0 loop: beq $2, $4, exit -- while i /= n do sll $5, $2, 2 -- $5 := 4i lw $6, a($5) -- $6 := a[i] add $3, $3, $6 -- total := total + a[i] addi $2, $2, 1 -- i := i + 1 b loop -- end exit:

The shift left logical instruction makes $5 = 4i, and the lw $6, a($5) translates this into byte addresses at a, a + 4, a + 8, . . . a + 4i (on the assumption that the array is a sequence of word-sized values). Different code would be possible: for example keep both i and 4i in registers, incrementing both on each cycle; or keep 4i, and have 4n in a register for compar- ison with it. Different code might be necessary: for example if the array a were not at a known constant address, but was at an address computed and stores in a register (it might be an argument to a procedure).

assembly 11:09pm 21st January 2021 10.1 Procedures and functions 95

10.1 Procedures and functions A procedure, such as

proc gcd(val x, y : int): int begin while x =6 y do if x > y then x := x − y else y := y − x end end; return x end is conventionally (this is unimportant) called with its arguments in $a0, $a1, and so on, and returns its result in $v0 (and so on). If the code for the procedure is called by jumping to it with a jal instruction and the code ends by jumping to the address which that instruction leaves in $ra

li $a0, n1 li $a1, n2 jal gcd -- $v0 := gcd($a0, $a1) next: then the effect of the subroutine is to resume execution at the next instruction with the result in $v0. Here is possible code for the body of the procedure gcd: beq $a0, $a1, done -- while x /= y do ble $a0, $a1, else -- if x > y then sub $a0, $a0, $a1 -- x := x - y b gcd -- else else: sub $a1, $a1, $a0 -- y := y - x b gcd -- end done: move $v0, $a0 -- end jr $ra -- return x

The intended effect of the jal is $v0 := gcd($a0, $a1), but it unexpectedly destroys $a0 and $a1. It is in effect

$v0, $a0, $a1, $ra := gcd($a0, $a1), gcd($a0, $a1), gcd($a0, $a1), next

11:09pm 21st January 2021 assembly 96 10 PROGRAMMING

Again, by convention, this is fine.16 Suppose we had for some reason decided to write a recursive procedure

proc gcd(val x, y : int): int begin if x =6 y then if x > y then return gcd(x − y, y) else return gcd(x, y − x) end else return x end end this code would not be good enough gcd: beq $a0, $a1, done -- if x /= y then ble $a0, $a1, else -- if x > y then sub $a0, $a0, $a1 jal gcd -- return gcd(x-y,y) jr $ra else: sub $a1, $a1, $a0 -- else jal gcd -- return gcd(x,y-x) jr $ra done: move $v0, $a0 -- else return x jr $ra because $ra is not preserved by a recursive call. Although the first jr to be executed returns to the right place, if there has been a recursive call then that place is one of the jr $ra instructions, and the program counter will stay pointing at that instruction. 16The MIPS convention is that only the eight $s registers and $gp, $fp and $sp are preserved. This is part of the RISC philosophy of keeping the mechanism cheap, and allowing a program to do more work only when necessary.

assembly 11:09pm 21st January 2021 10.1 Procedures and functions 97

How might a procedure preserve some registers like $ra? It could be copied into a fixed place in memory, but that does not work for recursive procedures. We could use a stack: this one grows downwards from high addresses, and our convention is that $sp is the address of the used location with the lowest address, so that the next free location is at $sp − 4. gcd: subi $sp, $sp, 4 sw $ra, 0($sp) beq $a0, $a1, done -- if x /= y then ble $a0, $a1, else -- if x > y then sub $a0, $a0, $a1 jal gcd -- $v0 := gcd(x-y,y) b exit else: sub $a1, $a1, $a0 -- else jal gcd -- $v0 := gcd(x,y-x) b exit done: move $v0, $a0 -- else $v0 := x exit: lw $ra, 0($sp) addi $sp, $sp, 4 jr $ra -- return $v0

Here is another example:

proc fact(val n : int): int begin if n = 0 then return 1 else return n × fact(n − 1) end end

If no attempt is made to save registers fact: bne $a0, 0, else then: li $v0, 1 jr $ra else: subi $a0, $a0, 1 jal fact mul $v0, $v0, $a0 jr $ra then not only does $ra contain the wrong return address, but by the time a mul instruction is executed, $a0 contains 0.

11:09pm 21st January 2021 assembly 98 10 PROGRAMMING

This time, both registers have to be preserved. Should it be the caller, or the called code that preserves the registers? fact: subi $sp, $sp, 8 sw $ra, 4($sp) sw $a0, 0($sp) bne $a0, 0, else then: li $v0, 1 addi $sp, $sp, 8 jr $ra else: subi $a0, $a0, 1 jal fact lw $a0, 0($sp) mul $v0, $v0, $a0 lw $ra, 4($sp) addi $sp, $sp, 8 jr $ra The layout of the stack is a matter of convention: any convention will do, so long as all the code is consistent about it. The same convention must apply both to the code of a procedure body and to every call of it. This is most obviously necessary when the procedure and its user are compiled separately, especially when they are compiled from programs written in different languages. MIPS convention also passes all parameters after the first four words on the stack.

Exercises 10.1 Rearrange the code for the while loop on page 93 so that it follows the pattern initialisation, and branch to test; loop: body of the loop test: test and conditional branch to loop Compare the (static) number of instructions in the program, and the (dynamic) number of instructions executed if the loop body is executed n times. 10.2 Write pseudocode for the logarithmic time multiplication described on page 93 and translate this into assembly code. 10.3 Modify the code for the summing loop on page 94 as described in the text so that the indexing register contains 4i. 10.4 Modify the code for the summing loop on page 94 for the case that the address of the array a is not a contant, but is provided in a register. assembly 11:09pm 21st January 2021 10.2 Evaluating conditional expressions 99

10.5 Rearrange the code for while-loop version of gcd to reduce the number of instructions executed.

10.6 Modify the code for fact to avoid saving registers in the branch of the condi- tional which does not perform a recursive call. What is the advantage of the modified code?

10.7 Suppose that our processor does not implement multiplication (as the emu- lator in the practical does not) and so does not have a mul instruction. Write a subroutine to implement multiplication, and rewrite fact to call this when it needs to perform a multiplication. Use the MIPS calling conventions: argu- ments are passed in $a registers, results returned in $v registers and neither $a registers nor $t register need be preserved by calls (but the $s registers do).

10.2 Evaluating conditional expressions In a program that looks like

while a < b do ... end it is natural to generate code that looks like

slt $t0, $a, $b bz $t0, end ... end: and regard the result of the slt as the value of the Boolean a < b. That would lead in the direction of translating

while a < b ∧ c < d do ... end into loop: slt $t0, $a, $b -- $t0 := a < b slt $t1, $c, $d -- $t1 := c < d and $t0, $t0, $t1 -- $t0 := (a < b) /\ (c < d) bz $t0, end -- while (a < b) /\ (c < d) do ... -- ... b loop -- end end: however this is not what you want in cases like

11:09pm 21st January 2021 assembly 100 10 PROGRAMMING

while 0 ≤ i < n ∧ a[i] =6 x do ... end where it may be obvious that the test is false because of a wildly unlikely value of i, but one that makes it unsafe to evaluate a[i]. More generally, the expressions may be expensive to evaluate, and it may even be significantly quicker to ‘shortcut’ the evaluation. Most programming languages provide a non-commutative logical and and non- commutative or like the && and || in Haskell with exactly these semantics. An expression made of such ands and ors can be evaluated by generating jumps like slt $t, $a, $b bz $t, false slt $t, $c, $d false: although in a context where the value is only for a subsequent conditional, for example the while loop above, the right code to generate here is slt $t, $a, $b bz $t, end slt $t, $c, $d bz $t, end body: ... end: Code like this, where the Boolean is only used for a conditional jump can be generated in a systematic way for general Boolean expressions, for example if (a = b) ∧ ((c = d) ∨ ¬e) then ... else ... end translates into bne $a, $b, else beq $c, $d, then bnez $e, else then: ... b end else: ... end: This code never constructs a value for (a = b) ∧ ((c = d) ∨ ¬e), but the way of evaluating this expression if the value is needed, for example in x := (a = b) ∧ ((c = d) ∨ ¬e) assembly 11:09pm 21st January 2021 10.2 Evaluating conditional expressions 101 is to translate it into

if (a = b) ∧ ((c = d) ∨ ¬e) then x := true else x := false end which corresponds to

bne $a, $b, else beq $c, $d, then bnez $e, else then: li $x, 1 b end else: li $x, 0 end: or possibly

li $x, 0 bne $a, $b, else beq $c, $d, then bnez $e, else then: li $x, 1 else:

Exercises 10.8 Write a function to calculate the quotient and remainder on dividing a (small) non-negative number by a (small) positive number; and then translate your function into MIPS code. Try a simple algorithm to begin with, and then one logarithmic in the quotient. The force of ‘small’ is that you need not worry about making your code work for the very largest representable positive numbers. What would your code do in case of extremely large values, in case of division by zero, in case of negative arguments?

10.9 Write a function to calculate the integer square root of its argument, and then translate your function into MIPS code.

10.10 Write a function to test whether its argument is prime, and then translate your function into MIPS code.

10.11 Write a function which takes n as an argument and returns the nth prime number; then translate your function into MIPS code.

11:09pm 21st January 2021 assembly 102 10 PROGRAMMING

10.12 Write a function which takes n as an argument and uses the sieve of Er- atosthenes to find the nth prime number, storing the first n prime numbers in data memory.

10.3 Stack frames The subroutines which have been coded in this chapter preserve some values on the stack on entry, and restore those values on exit. If a routine is going to call another, this will have to include the return address $ra. Local variables have mostly stayed in registers; they can be kept in $s registers. This is quite complicated enough for hand-compiled code, but is not enough for languages with statically nested procedure definitions. In Algol-like languages (like Pascal and its derivatives) just as in Haskell the code of a procedure can refer to variables local to an enclosing procedure, for example proc outer var x; proc inner var y; begin ... x := y ... end; begin ... inner ... end In the body of inner, the name y refers to a local variable, but x refers to a variable local to an enclosing scope. In general it is not possible to tell how far such non-local variables are from the top of stack. proc outer var x; proc p begin ... x ... end; proc q begin ... p;... q;... end; begin ... q ... end Where q (conditionally) calls itself, the call which it makes to p can refer to an x which is at a distance from the top of stack that varies with the number of recursive active calls of q. The solution to this problem is to link the stack frames together in such a way as to make it possible to find non-local variables. The current stack frame is identified by a frame pointer (the $fp register is identified for this by the MIPS assembler). This is constant during the execution of a procedure, and so provides assembly 11:09pm 21st January 2021 10.3 Stack frames 103 a means of referring to local variables by fixed offsets (as opposed to the varying offsets from $sp). Systematic compilation of procedures ensures that each stack frame contains, as well as the saved return address $ra, the saved value of fp (know as the dynamic link) with which to restore fp on exit, and in addition the value of fp that was current when the statically enclosing procedure was active (known as the static link). The compiler must ensure that the static link for a procedure is supplied as an argument to the code of the procedure. Non-local variables can be identified by their position in a stack frame and the position of that stack frame in the chain of frames created by the static links. In a call of p in the example, x is a variable at a known offset in the stack frame of outer, which must be the frame that is next down the static chain from the current frame. In languages like C it is not possible to nest the definitions of procedures in this way. In consequence all variables that are not local to a procedure are global, and so can be at known fixed addresses. Static links are not needed for such languages.

11:09pm 21st January 2021 assembly 104 11 MIPS EMULATOR

11 MIPS emulator

The MIPS emulator is a Haskell program which takes a MIPS assembly language program, expands the pseudoinstructions into real instructions, assembles them into a list of representing the binary program, and then executes that program, one instruction at a time, showing the contents of the registers at each step.

11.1 Running the emulator You can run the compiled emulator

/usr/local/practicals/digitalsystems/mips with a file name as an argument

/usr/local/practicals/digitalsystems/mips/mips fact

The assembler looks for fact.mips, and if that does not work fact. To save typing, I suggest you make a symbolic link to the directory containing the emulator: ln -s /usr/local/practicals/digitalsystems/mips/mips mips so that you can now refer to it as mips, and run the interpreter by typing mips/mips fact

Here fact.mips is a program which contains the answer to exercise 10.7, and a short calling sequence which evaluates 4! main: -- $v0 := factorial(4) li $a0, 4 jal fact halt fact: -- $v0 := factorial($a0) bne $a0, 1, else ... jr $ra mult: -- $v0 := $a0 * $a1 ...

assembly 11:09pm 21st January 2021 11.1 Running the emulator 105

The label main is by way of a comment: the emulator does not use it, but assembles the program to start at location zero, and then runs the code at location zero. When assembling the code, the emulator expands pseudoinstructions and re- solves labels for jumps and branches. It then produces a listing showing the real instructions in the program, with actual addresses: L0: addi $4, $0, 4 L4: jal 3 -- L12 L8: break 0 L12: addi $1, $0, 1 L16: bne $4, $1, 2 -- L28

Thus the li $a0, 4 becomes L0: addi $4, $0, 4 at location zero, and bne $a0, 1, else becomes L12: addi $1, $0, 1 L16: bne $4, $1, 2 -- L28 at locations 12 and 16. Remember that (real) instructions are four bytes long. The emulator now waits for a return to be typed at the keyboard, and after that it runs the program starting from location zero. before running each instruction it shows the pc, the instruction fetched from that address about to be executed, and the contents of the registers. So the running of this program begins 0: addi $4, $0, 4 [0,...,0] 4: jal 3 [0,0,0,4,0,...,0] 12: addi $1, $0, 1 [0,0,0,4,0,...,0,8] 16: bne $4, $1, 2 [1,0,0,4,0,...,0,8] 28: addi $30, $30, -8 [1,0,0,4,0,...,0,8] 32: sw $31, 4($30) [1,0,0,4,0,...,0,-8,8] so that after executing the addi corresponding to li $a0, 4, register number 4 contains 4. The listing of the registers shows only registers from 1 to 31 (since register zero is always zero) and tries to omit a long run of zeroes to make it easier to see what is happening, so after adjusting the stack pointer you can see that $sp (which is $30) is −8, and $ra (which is $31) is 8. Eventually (after about ninety instructions) the program ends with 64: addi $30, $30, 8 [1,24,0,32,0,...,0,-8,8] 68: jr $31 [1,24,0,32,0,...,0,8] 8: break 0 [1,24,0,32,0,...,0,8]

11:09pm 21st January 2021 assembly 106 11 MIPS EMULATOR with the answer 24 in register $v0 (which is $2). The final instruction, break 0 (from the pseudoinstruction halt), terminates the emulator. Suppose the program was not working; perhaps it was never reaching the halt instruction. You can put break instructions anywhere in the program and the emulator will stop there for you to study the state. For example, if we put a break just after the recursive call of the factorial subroutine fact: -- $v0 := factorial($a0) bne $a0, 1, else ... subi $a0, $a0, 1 jal fact break lw $a0, 0($sp) ... then running the program gets you as far as the first execution of that break instruction

... 24: jr $31 [1,1,0,1,0,...,0,-24,48] 48: break 1 [1,1,0,1,0,...,0,-24,48] Type to continue typing return restarts execution and runs on to the next break point

24: jr $31 [1,1,0,1,0,...,0,-24,48] 48: break 1 [1,1,0,1,0,...,0,-24,48] Type to continue 52: lw $4, 0($30) [1,1,0,1,0,...,0,-24,48] ... 72: jr $31 [1,2,0,4,0,...,0,-16,48] 48: break 1 [1,2,0,4,0,...,0,-16,48] Type to continue and so on. (The break pseudoinstruction generates break 1. Argument zero is reserved for halt, but you can put break 2, break 3 and so in different places if it helps.) The mips program accepts three flags. You can get just the disassembled listing of the real instruction with the -l flag, so that mips -l fact.mips > fact.listing

assembly 11:09pm 21st January 2021 11.2 Differences between the emulator and the hardware 107 gives you a handy copy of the real instructions to help understand what is hap- pening when the program is running. You can get just the output from running the program with the -r flag so that

./mips -r fact | wc -l will produce a count of the number of instructions executed.17 The -s flag (single step) runs the program one instruction at a time, pausing before the next. Finally, the -b flag runs the program without showing every instruction: it shows only the break instructions, and the values in the registers at those points (without stopping the program there). If the emulator is given a list of file names it assembles them together as one program, by concatenating the files. This makes it easier to keep common subroutines in separate files. However the labels in the concatenated program have to be distinct, so if you do this in any big way you will have to have a convention for naming the labels that are local to a file.

11.2 Differences between the emulator and the hardware There are a number of ways in which the emulator does not try to be faithful to the real MIPS. The data and instruction memories of the emulator are entirely unconnected. Writing to a location in the data memory which has the same address as an instruction will not change that instruction. This is known in the jargon as a (after an early Harvard University machine which used paper tape for the program) in contrast to a which funnels all accesses to memory through a single . The real MIPS machine, like many modern machines, lies between the two, allowing simultaneous access to cached copies of program and data, but caching a single memory which contains the two. The ability of the von Neumann machine and the modified Harvard architecture to modify the program store from within a program is necessary, not least to enable one program to cause another to be run. The emulator does not implement any , so there is no check- ing for arithmetic overflow; nor does it implement anything to do with floating point or any other co-processors. This also means that the implementation of the break instruction is nothing like the software exception in the real hardware. Pipelined implementations of MIPS commit themselves to executing the next instruction before deciding whether or not to take a jump or branch. Accordingly the order of the instructions which they execute is not the same as in the emulator.

17Only the listing goes to the standard output: the prompts come to the standard error channel, and the input is expected from standard input.

11:09pm 21st January 2021 assembly 108 11 MIPS EMULATOR

Moreover in the pipelined processor return addresses are pc +8 rather than pc + 4. The emulator is made to be easier to understand; the real pipelined processor to be fast! More recent MIPS instruction sets also include more instructions, such as con- ditional assignments to registers, which are designed to allow compilers to avoid branches, so as to improve the efficiency of instruction pipelines.

11.3 Opening the lid The assembler and emulator are written in Haskell and the source modules are also in the practicals directory. This means that as well as running the compiled code you can interpret the components of the system with hugs or ghci. For example, you can see the implementation of pseudoinstructions clpc214.cs.ox.ac.uk_# ghci Assembler.hs GHCi, version 6.10.3: http://www.haskell.org/ghc/ :? for help Prelude Assembler> assemble "sw 3, 4" [(0,536936451),(4,-1409220604)] Prelude Assembler> putStr(disassemble_program(assemble "sw 2, 3")) L0: addi $1, $0, 2 L4: sw $1, 3($0) Prelude Assembler>

If you are familiar with the Haskell IO monad you can also use the modules to construct a different interface to the emulator, by writing a replacement for the main function. If you want to run ghc or ghci on your own program which uses the components of the emulator it is useful to know about the -i flag which adds a directory to the path that ghc uses to find imports. If you take a copy of MIPS.hs and modify it, you can run it with ghci -imips MIPS.hs with no space after the -i, and the name mips after it is the (symbolic link) name of the directory containing the emulator modules. If you want to compile the whole emulator for a new architecture, it should be sufficient to run ghc -o mips --make MIPS.hs which compiles the main body of the program from MIPS.hs automatically com- piles all modules on which it depends, and then links the result to make the executable mips. assembly 11:09pm 21st January 2021 11.4 Errors 109

11.4 Errors The implementation, in particular of the assembler, is fairly unforgiving. Any small error is likely to cause the program to fall over, and the error message is not always helpful. “Syntax error” usually means something which does not even look to the as- sembler as though it should be an instruction, although this might just be because of a stray punctuation mark. “Invalid instruction” might mean an instruction with an invalid name, such as bang $1, $2, $3, but it might also mean an instruction with the wrong num- ber of arguments such as add $1, $2 or the wrong of arguments such as addi $1, 4($sp). “Unknown func field” means an instruction which the assembler has translated, but the emulator does not implement. Principal amongst these are the multipli- cation and division instructions, and the special move instructions that read and write the registers hi and lo. It also omits the exotic branch instructions which perform conditional subroutine calls, and implements as pseudoinstructions some other branches which are real instructions on real hardware. Occasionally, things that ought to cause errors will not. This is usually because I have failed to make the program strict enough. For example, executing mflo $0 ought to cause an error because the mflo instruction is not implemented, but mflo $0 does not change anything in the real machine (it only writes register $0, which is constant zero) and the emulator blithely does nothing rather than flagging an error. You might say that the emulator implements those special cases of the unimplemented instructions which are actually no-ops. Beware that hugs does not properly implement strictness annotations, so if something appears to work in hugs but causes an error in ghci, it is ghci that is right. This emulator has only been used for practicals for a short while, so you are about to do at least as much testing as I have been able to do. If you find any other things which are not implemented, or which appear to be wrongly implemented, do let me know and I will try to fix it.

Exercises 11.1 Design and implement additions to the emulator to implement the missing instructions.

11.2 The current implementation of the data cache is particularly inefficient: it keeps a record of every write that happened to a location, even when overwrit- ten, and to read the cache has to search through writes in reverse cronological

11:09pm 21st January 2021 assembly 110 11 MIPS EMULATOR

order to find the most recent. Implement a more efficient for this cache.

11.3 Implement the two cache interfaces as they are intended to be: caches of a single shared . Add to the emulator the ability to predefine some of the contents of data memory, and to dump the final state.

11.5 MIPS instruction summary The emulator implements the real MIPS instructions shown on page 111. The un- signed instructions addiu, addu and subu produce the same results as their signed forms, but guarantee not to cause overflow exceptions. (Since the emulator does not implement exceptions, these are implemented in the same way as the signed forms.) The unsigned comparison instruction sltiu and sltu are the same as the signed forms, except that they use unsigned comparison. The logical operators ∧, ∨, ¬ and ⊕ (exclusive or) operate in parallel on the corresponding bits of the arguments. The emulator uses the break n instruction of MIPS to pause the emulator (for small positive values of n) and to halt the emulator when n = 0. On the real machine it causes an exception. The emulator does not implement the following instructions: div, divu, mult, multu, mfhi, mflo, mrhi, mtlo, ll, sc, lb, lh, sb, sh, lbu, lhu. The pseudoinstructions implemented by the assembler are shown on page 112. In addition there are unsigned versions of the order comparisons sgeu, sgtu, sleu, and the corresponding branches bgeu, bgtu, bleu, bltu. Constants are allowed instead of either register operand of the comparisons and branches. The source register of sw can also be replaced by a constant. The unsigned negu causes no overflow (and so is implemented as neg). The argument of break defaults to 1. The standard MIPS pseudoinstructions div, mulo, mul, rem and their unsigned forms divu, mulou, remu are translated, but the resulting instructions are not implemented by the emulator.

assembly 11:09pm 21st January 2021 11.5 MIPS instruction summary 111

add $rd, $rs, $rt rd := rs + rt addi $rt, $rs, n rt := rs + signex(n) and $rd, $rs, $rt rd := rs ∧ rt andi $rt, $rs, n rt := rs ∧ n beq $rs, $rt, n if rs = rt then pc := pc + signex(4n) bgez $rs, n if rs ≥ 0 then pc := pc + signex(4n) bgezal $rs, n if rs ≥ 0 then pc, ra := pc + signex(4n), pc bgtz $rs, n if rs > 0 then pc := pc + signex(4n) blez $rs, n if rs ≤ 0 then pc := pc + signex(4n) bltz $rs, n if rs < 0 then pc := pc + signex(4n) bltzal $rs, n if rs < 0 then pc, ra := pc + signex(4n), pc bne $rs, $rt, n if rs 6= rt then pc := pc + signex(4n) j n pc := 4n jal n pc, ra := 4n, pc jr $rs pc := rs lui $rt, n rt := n << 16 lw $rt, n($rs) rt := m[rs + signex(n)] nor $rd, $rs, $rt rd := ¬(rs ∨ rt) or $rd, $rs, $rt rd := rs ∨ rt ori $rt, $rs, n rt := rs ∨ n sll $rd, $rt, n rd := rt <>n srlv $rd, $rt, $rs rd := rt >> rs sub $rd, $rs, $rt rd := rs − rt sw $rt, n($rs) m[rs + signex(n)] := rt xor $rd, $rs, $rt rd := rs ⊕ rt xori $rt, $rs, n rt := rs ⊕ n Also unsigned forms sltiu and sltu.

MIPS instructions implemented by the emulator.

11:09pm 21st January 2021 assembly 112 11 MIPS EMULATOR

abs $rt, $rs rt := |rs| bal n pc, ra := pc + signex(4n), pc bge $rs, $rt, n if rs ≥ rt then pc := pc + signex(4n) bgt $rs, $rt, n if rs > rt then pc := pc + signex(4n) b n pc := pc + signex(4n) ble $rs, $rt, n if rs ≤ rt then pc := pc + signex(4n) blt $rs, $rt, n if rs < rt then pc := pc + signex(4n) bnez $rs, n if rs 6= 0 then pc := pc + signex(4n) bz $rs, n if rs = 0 then pc := pc + signex(4n) clr $rt rt := 0 la $rt, n($rs) rt := rs + signex(n) li $rt, n rt := signex(n) halt stop the emulator move $rt, $rs rt := rs neg $rt, $rs rt := −rs nop no operation not $rt, $rs rt := ¬rs rol $rt, $rs, n rt := rs rotated left by n bits ror $rt, $rs, n rt := rs rotated right by n bits seq $rd, $rs, $rt rd := 1 h|rs = rt|i 0 sge $rd, $rs, $rt rd := 1 h|rs ≥ rt|i 0 sgt $rd, $rs, $rt rd := 1 h|rs > rt|i 0 sle $rd, $rs, $rt rd := 1 h|rs ≤ rt|i 0 sne $rd, $rs, $rt rd := 1 h|rs 6= rt|i 0 subi $rt, $rs, n rt := rs − signex(n) Also unsigned forms: bgeu, bgtu, bleu, bltu, negu, sgeu, sgtu, sleu.

Pseudoinstructions translated by the assembler.

°zero 0 zero

°at 1 assembler temporary °

°v0- v1 2-3 return values from functions °

°a0- a3 4-7 first four arguments to a procedure °

°t0- t9 8-15,24-25 temporary registers (may be destroyed by calls) °

°s0- s7 16-23 saved temporary (preserved by calls) ° °k0, k1 26,27 reserved for OS kernel

°gp 28 global pointer

°sp 29 stack pointer

°fp 30 frame pointer

°ra 31 returnaddress

Recognised register names

assembly 11:09pm 21st January 2021 DIGITAL SYSTEMS 113 12 Some other architectures

It is perfectly possible to implement in hardware a stack machine along the lines of that on page 64. This one is loosely based on the transputer design from the 1980s. The machine has a program counter and a stack pointer, and a frame pointer. It will need a few instructions with no arguments, such as add which adds the top two elements of the stack:

add m[sp + 4] := m[sp + 4] + m[sp]; sp := sp + 4 and a few instructions with one immediate argument such as Load immediate

li n sp := sp − 4; m[sp] := n

We have adopted the same convention for the stack: it grows downwards, and sp points at the topmost location in use. One way of encoding these is to have every real instruction be one with an im- mediate constant parameter, and to have one of those be the operate instruction, opr, which implements the zero-argument instructions all of which are pseudoin- structions. The equivalent of the MIPS load word instruction would be provided by the load (pseudo-) instruction

ld m[sp] := m[m[sp]] preceded by a calculation of the address. There being no general purpose registers, the equivalent of the $rs part of lw $rt, n($rs) would be a local variable stored at a known location on the stack. These would be read by load local instructions (and written by store local)

ldl n sp := sp − 4; m[sp] := m[sp + n] stl n m[sp + n] := m[sp]; sp := sp + 4 Notice that every time the stack pointer moves, the index of a particular local variable on the stack changes. (Should it have been m[sp] := m[sp + n] or m[sp] := m[sp + 4 + n]?) A different design would use a frame pointer relative to which to refer to local variables. (So it would be m[sp] := m[fp + n].) This frame pointer would only be moved on entry to a new procedure. Hand-coding instructions for a machine with sp-relative addressing is a serious challenge, and it is much easier to leave this sort of thing to a compiler! A natural way of providing branches on such a machine would be to provide a branch on zero instruction

11:09pm 21st January 2021 stack 114 12 SOME OTHER ARCHITECTURES

bz l if m[sp] = 0 then pc := pc + l; sp := sp + 4 or a branch on equal instruction

beq l if m[sp] = m[sp + 4] then pc := pc + l; sp := sp + 8

Either will do, because bz l = li 0; beq l beq l = sub; bz l Notice that on this machine it is not as cheap as on the register machine to im- plement bz as a pseudoinstruction using beq, because a zero has to be stored onto the stack every time the test is made.

12.1 Longer arguments Assuming that a stack machine instruction occupies a word, there cannot be room for a whole word of immediate constant in a li instruction. The MIPS solution to this problem was to have a load upper immediate. We could do the same, which would require there to be room for half a word of immediate constant. It would also be difficult to use the same trick to extend the immediate constant in instructions such as ldl. A different solution to the problem is to construct the constant argument for each instruction not just from the immediate constant, but from that and a se- quence of preceding prefix instructions. The idea is that there is another register op which is cleared to zero at the end of each normal instruction, and the actual effect of an instruction is to use that or-ed with the immediate constant as its operand: opr n operation (op ∨ n); op := 0 ldl n sp := sp − 4; m[sp] := m[sp + (op ∨ n)]; op := 0 stl n m[sp + (op ∨ n)] := m[sp]; sp := sp + 4; op := 0 li n sp := sp − 4; m[sp] := op ∨ n; op := 0 bz l if m[sp] = 0 then pc := pc + (op ∨ l); sp := sp + 4; op := 0 At least one of the instructions, however, is special in that it does not leave op at zero:

pfx n op := (op ∨ n) <

The constant k is the number of bits in the immediate constant field of the in- struction. Imagine a machine with one byte long instructions, four bits of opcode and four bits of immediate constant. The effect of a sequence of n pfx instructions stack 11:09pm 21st January 2021 12.1 Longer arguments 115

pxf 1 -- op := 0x00010 pfx 2 -- op := 0x00120 pfx 3 -- op := 0x01230 li 4 -- li 0x01234; op := 0 would be to construct a constant argument with up to k(n + 1) bits. You could imagine an assembler implementing li 4660 as a pseudoinstruction by generating that sequence (466010 = 123416). This would work for all of the one-argument instructions, including the opr instructions, allowing for more than sixteen zero- argument pseudoinstructions. Of course, loading very big constants in this way is expensive, but very big constants are rarer than small constants. It is much the same trade-off as that of lui on MIPS, but more extreme. The big difference is that negative constants with small absolute value would be very expensive. On a 64 bit machine with 4-bit immediate constants, you would need fifteen prefix instructions to implement ldl -1. The transputer (on which this stack machine is based) provided a second prefix instruction

nfx n op := (¬(op ∨ n)) <

Exercises 12.1 Suppose we did not implement load local as an instruction. Show how to use an instruction

sp sp := sp − 4; m[sp] := sp

to load the local variable loaded by ldl n. What are the advantages and disadvantages of this scheme?

12.2 The conditional jump instruction on the transputer was, effectively

bz’ l if m[sp] = 0 then pc := pc + (op ∨ l) else sp := sp + 4; op := 0

Why was this instruction designed to leave a value on the stack if branching? (Think of evaluating Boolean expressions.)

12.3 Outline an algorithm to generate a shortest prefix sequence for an instruction with an arbitrary argument. It might have the form of a definition of expand op n 0 ≤ n && n < 16 = generate op n ... = ...

11:09pm 21st January 2021 stack 116 12 SOME OTHER ARCHITECTURES

Start with an assumption that you can generate prefix instructions with arbi- trary arguments; then use a recursive call to expand to implement that.

12.2 Microcoded implementation The hardware implementation of this machine cannot be as straightforward as that of MIPS. There is no register bank, but there have to be several accesses to data memory on each instruction. In particular there can be several reads, and a write which might be to one of the locations just read. Worse, there are reads and writes from address which are themselves read from memory. The most practical way to do this would be to make each instruction take several clock cycles, and arrange for different memory accesses to happen in different clock cycles. This would have been a traditional solution at the end of the last century. A natural way of implementing this multi-cycle machine is to build a small programmable register machine, using its registers for the internal state of the stack machine: the stack pointer, the program counter, some values just read from memory, some values about to be written to memory and so on. The microprogram which this small machine runs is then an interpreter for the machine code of the stack machine. This is fixed, and stored internally to the processor, although there is no reason why a microprogrammable machine should not have writable microcode store. Microprogrammed designs make it relatively easy to implement quite complex instructions.18 For example block moves of memory can be encoded in a single instruction by putting a loop in the microcode. Short microcode subroutines can be used to implement complicated combinations of addressing modes involving address arithmetic and register updating like the adjustment of the stack pointer in push and pop instructions. A single instruction can be made to do an awful lot. This can be an advantage if it happens that the combination of things possible in an instruction are often required to happen together: complex instructions can reduce the number of instructions needed to be executed in a program. However if it turns out that the power of a complicated instruction can rarely be harnessed, it is wasted. Worse than that, in contrast to the simpler RISC designs, it can be hard to find ways of making the hardware faster. The fastest implementations of complex instruction sets these days operate by translating the complex instructions into sequences of simpler components, and executing those quickly on parallel pipelines. The principal reason for keeping the complex instruction sets is code compatibility with earlier microprogrammed machines. There are also likely to be some code density advantages, but the complexity of the implementation hardly justifies them. 18The Cambridge University EDSAC 2 was probably the first microprogrammed machine. stack 11:09pm 21st January 2021 12.3 Hardware stack machines 117

12.3 Hardware stack machines In a real machine, memory accesses are much more expensive than accesses to registers. The real problem with the machine that has been described so far is that there are several accesses to memory on most instructions, and many of the values stored are immediately re-read and never used again. A practical solution to this would be to cache the top of the stack in some registers. One solution would be to provide a limited-depth stack in registers. The transputer implemented one of these: pushing a value onto the stack put it into the top-of-, moving everything else down by one place, rather like one of those spring-loaded plate stacks in a canteen servery. It is the compiler’s responsibility now to ensure that nothing is ever lost by being pushed out of the bottom of the stack. Any sequence of code which requires more than the finite depth of the hardware stack is translated into sequences of safe code using temporary variables. For example, the transputer stack was only three locations deep, so x := a + (b ∗ c) would work

ldl a ldl b ldl c mul add stl x

(assuming all variables are in fixed local locations), but x:= a−(b−(c −d)) would not and would be translated into something like t := c − d; x := a − (b − t) or t := b − (c − d); x := a − t.

12.4 Load/store register machine implementation The register window on the stack need not have hardware to implement the stack. A stack-machine based compiler for a register machine like MIPS could translate programs into stack operations, then statically allocate registers to hold the tem- porary values near the top of the stack and memory locations to hold values further down, and translate operations on those into register instructions and loads and stores. In effect the compiler is statically translating the complex instructions into simpler ones, rather than the hardware having to do this at run time. It can also save some work by combining components bundling together instruction that can be done at once: for example, several instructions which push values onto the stack can be combined into a sequence of store instructions and a single adjustment of the stack pointer.

11:09pm 21st January 2021 stack 118 12 SOME OTHER ARCHITECTURES

Such a machine could look quite like MIPS, with a straightforward data path that has a sequence of blocks of hardware, each used in turn. P ✲ PP regs ◗ PC✲ ICache ✲ ◗ ✲ DCache ✲ ✑✑ regs ✲ ✏ ✏✏

The implementation described in these notes would be quite slow, because each instruction reads memory, reads registers, does an operation, perhaps reads data memory, and then perhaps writes a register. That would take quite a while, particularly the reads of external memory. However the straightforward sequence of events for each instruction means that as soon as one instruction had cleared one section of the data path, another one could follow along behind it. This technique is called pipelining. By putting extra registers in to hold the state of an instruction part way through the data path, five instruction could be executed at the same time, one at each stage of its progress down the pipeline. P ✲ ✲ PP regs ◗ PC✲ ICache ✲ ✲ ◗ ✲ ✲ DCache ✲ ✲ ✑✑ regs ✲ ✲ ✏ ✏✏

The pipelined implementation would, of course, take five clock cycles to complete an instruction, but one instruction would be completed on each clock cycle and the clock frequency would be anything up to five times higher. The result is a much faster implementation. Of course it is not quite that simple: if an instruction reads a register before it is written by an earlier instruction that has not completed, the wrong value will be read. The simplest fix for this is to get the compiler to try to avoid this from happening by rearranging the sequence of the code. Having plenty of registers means that independent calculations can use independent sets of registers, and instructions operating on independent registers can be exchanged with each other. If the compiler guarantees not to produce any code which would require values to be ready before they are, the hardware can guarantee to execute compiled code without specifying what it would do in difficult cases. Sadly, this restriction is probably expecting too much of a compiler: often a compiler would find nothing to be done, and would have to issue no-op instructions to fill the gaps. It is also possible to add extra hardware to the data path either to fish out the value that will eventually be stored, if it is already available in the pipeline, or to hold up the reading instruction until the value does become stack 11:09pm 21st January 2021 12.5 The shape of contemporary (and future) processors 119 available. in that way the code can run as quickly as it is safe for it to do. (A compiler can still improve code speed by reordering instructions to try to avoid pipeline stalling.) Slightly more complicated is the idea of out of order execution. On an out- of-order processor, fetched instructions are not executed immediately if it is not safe to do so. They stay in a pool of pending instructions until their operands are ready, and only then are they dispatched for execution. Out-of-order machines can improve the rate of execution by dynamically allocating physical registers to hold each of a succession of values which are written into a logical (‘architectural’ register). This means that two consecutive calculations apparently using the same register can also be run concurrently. Branches (and to a lesser extent jumps) also cause a problem: it might not be known whether a jump is taken until after the next instruction or two or three are already on their way down the pipeline. The simplest pipelined hardware im- plementation gives rise to slightly bizarre behaviour that some instructions after a branch are executed whether or not the branch is taken. If the branch is taken, control passes to the target from some point later in the program. Pipelined im- plementations of the real MIPS had one branch delay slot, an it was the compiler’s job to try to find something useful to do: something which logically belonged be- fore the branch, but which could be executed after the test for whether the branch should be taken. Extra complications arise when an instruction in the branch delay slot does something awkward, like calling a subroutine or causing an interrupt. Out-of-order implementations of MIPS would be simpler if they did not have (for code compatibility) to implement the delay slots of earlier pipelined versions.

12.5 The shape of contemporary (and future) processors Amongst the the most common these days you can see many of the structures we have described. The x86 architecture is common partly because it remains compatible with architectures going back to 1978. It is horrendously complicated for the same reason. Nobody would start from here! Most x86 processors these days look something like the figure on page 120. The x86 family is most decidedly a complex instruction set architecture. However modern implementations will translate almost all instructions into several RISC instructions for an internal RISC machine. The Core machine shown here fetches instructions from the Instruction Cache in big chunks, decoding up to eight at a time. This is a complicated business since x86 instructions vary in their size. The x86 has a few fairly general purpose registers (they’re not all usable for everything) but the implementing RISC machine has many more. The next thing that happens to each instruction is that these registers are allocated to the suc- cession of values which get stored in x86 registers, and the RISC instructions are

11:09pm 21st January 2021 stack 120 12 SOME OTHER ARCHITECTURES

stack 11:09pm 21st January 2021 12.5 The shape of contemporary (and future) processors 121 rewritten to use those RISC registers. By now there are many instructions that are independent of each other and so can be executed in any order. Instructions reading a particular RISC register have to wait for that register to have been written, but the instructions operating on two different values occupying the same x86 register one after the other can be run interleaved with each other provided the two values are now in two different RISC registers. The instructions are then allocated to one of several parallel pipelines according to what they need to do. The pipelines in general are capable of doing different things. For example some pipelines will do normal ALU operations, some will do floating point operations, some might do branch calculations, some might provide vector operations,19 and some will do data memory accesses. Instructions wait in a pool at the head of the pipelines for their arguments to be ready. once their arguments are ready they can be executed, and with the best of luck at least one instruction will be ready for each pipeline in each clock cycle. When results emerge from the pipelines they are passed back to the pool of waiting instructions. There is no necessary ordering on instructions except that which is imposed by data dependencies. Accesses to data memory have to be made to happen in the right order, which is the function of the Memory Ordering Buffer. The ‘real’ registers of the x86 machine, marked ‘Program Visible State’, writ- ten when all the parts of an x86 instruction have been completed, are relatively insignificant in the whole . The internal architecture here does have separate Instruction Cache and Data Cache. having the two separate makes it possible, if necessary, to access both in each clock cycle. They each contain a local copy of the relevant parts of the level two cache, which itself contains a copy of a relevant part of the main memory outside the processor. Each cache reduces the number of accesses to the next level of memory outwards, improving performance. In general they also make wider accesses to the slower memories, reading and writing blocks, which again reduces the number of accesses and so improves performance. In recent times, the trend has been for more of the same: deeper pipelines, from which an instruction takes more clock cycles to emerge. To balance this greater latency, because each stage of the pipeline is simpler, the clock can be made to go faster and the throughput goes up. However this trend is limited by the need to find more mutually independent things to do. Currently the trend (which this department has been anticipating since I was a graduate student) is for several ‘cores’. Effectively each microprocessor contains several more-or-less independent processors sharing a common memory, probably by having accesses from the level two caches of each processor to a shared level

19SSE, streaming SIMD extension, is the Intel jargon for vector operations.

11:09pm 21st January 2021 stack 122 12 SOME OTHER ARCHITECTURES three cache. Programs running on each processor have to arrange not to interfere with each other by not sharing memory, and to co-operate by sharing memory in a disciplined way.

stack 11:09pm 21st January 2021 DIGITAL SYSTEMS 123

13 Floating point

Arithmetic on integers can be extended naturally to fixed precision rationals: arith- metic on millimetres amounts to arithmetic on fractions of a metre. However any increase in the fineness of the distinction between values is at the expense of the range expressible of values. Floating point numerals implement the idea of ‘scientific notation’ in binary. There used to be manufacturer-specific implementations which meant that cal- culations would give different results on different hardware, but there are IEEE standards. In particular, an IEEE standard single-length floating point number is a 32 bit numeral divided into a sign bit, an 8 bit exponent field and a 23 bit significand

s y x 1 8 23 representing

(−1)s × (1 + bin(x) × 2−23) × 2bin(y)−127 at least for 1 ≤ bin(y) ≤ 254 (excluding 0 and 255). For example

7 0 1.7510 = 4 = 1.112 × 2 = 0 12710 110000000000000000000002 1 210 = 1.02 × 2 = 0 12810 000000000000000000000002 7 1 3.510 = 2 = 1.112 × 2 = 0 12810 110000000000000000000002 2 510 = 1.012 × 2 = 0 12910 010000000000000000000002 Although programmers casually talk about ‘real’ numbers, floating point numbers are rationals with a certain limited range of denominators. Just as with finite fractions, some numbers cannot be represented exactly, for example the 1 single length number nearest to 3 is 11184811 ⌈225/3⌉ 0 125 01010101010101010101011 = = 10 2 33554432 225 1 1 1 −23 125−127 1 −25 which is too big by 3 in the last place, that is 3 + 3 ×2 ×2 = 3 (1 + 2 ). But notice that amongst the numbers that are non terminating binary fractions are 0.110 which has to be approximated by 13421773 ⌈227/10⌉ 0 123 10011001100110011001101 = = 10 2 134217728 227

11:09pm 21st January 2021 floating 124 13 FLOATING POINT

The largest number representable in this way is

+ 25410 111111111111111111111112 254−127 = 1.111111111111111111111112 × 2 = (2 − 2−23) × 2127 = 2128 − 2104 ≈ 3.4 × 1038 and the smallest positive number representable in this way

+ 1 000000000000000000000002 = 1.0 × 21−127 = 2−126 ≈ 1.175 × 10−38

There are 254 representable positive powers of two (with significand zero); the same number (223 − 1) of representable values are evenly spaced between them. Then there is a mirror image of this range for negative numbers. The smallest difference between values doubles at each power of two. But there is a big hole between 2−126 and −2−126; in particular zero is missing. In this gap are the denormalised numbers: numbers with exponent field zero represent

(−1)s × (0 + bin(x) × 2−23) × 2−126

Note carefully: the final exponent in that number is −126, which is not 0 − 127. For example

−126 −127 + 0 100000000000000000000002 = 0.12 × 2 = 2 Notice that the numbers either side of 1 are

−23 + 12710 000000000000000000000012 = 1 + 2 −24 + 12610 111111111111111111111112 = 1 − 2 but that the numbers either side of 2−126 are

−23 −126 + 1 000000000000000000000012 = (1 + 2 ) × 2 −23 −126 + 0 111111111111111111111112 = (1 − 2 ) × 2

floating 11:09pm 21st January 2021 13.1 Floating point addition 125 and the numbers either side of zero are −2−23 × 2−126 and 2−23 × 2−126 so there is a gradual loss of relative precision from 2−126 = 1.175 × 10−38 down to 2−149 ≈ 1.4 × 10−45 and of course zero is represented by exponent zero and significand zero (but there is a positive zero and a negative zero!) Finally exponent 255 is reserved for exceptional values: ∞, −∞, and NaN (not a number) values used in a well-defined way to signify the results of calculations which go wrong. The IEEE standard also specifies formats for other precisions of number: length exponent mantissa half 16 5 10 single 32 8 23 double 64 11 52 quadruple 128 15 113 in each case with the smallest and largest exponents reserved, and the exponent biased by half the largest normal value.

13.1 Floating point addition Addition or subtraction of floating point numbers can be done by aligning the two significands, adding them as if they were fixed point numbers, and fitting the 3 7 answer back into the floating point format. For example, 1 8 + 8 , in single precision

0 12710 011000000000000000000002 +

0 12610 110000000000000000000002 = { representation } 127−127 126−127 1.0112 × 2 + 1.112 × 2 = { align significands } 127−127 127−127 1.0112 × 2 + 0.1112 × 2 = { fixed point addition } 127−127 10.0102 × 2 = { normalise } − 1.001 × 2128 127

= 0 12810 001000000000000000000002

1 which is 2 4 . Addition of numbers with different signs reduces to a fixed point subtraction, and subtraction of floating point numbers can be done by changing the sign of the second number and adding.

11:09pm 21st January 2021 floating 126 13 FLOATING POINT

In general, in a format with d bits of significand field, we need an answer 1 ≤ x < 2 which we will round to 1 + d bits. If the exponents of the operands are very different the answer is essentially the same as the operand with the bigger exponent: this happens if the exponents differ by at least d + 2 because there is a guaranteed zero in the first bit to be discarded when adding (and two guaranteed ones in the first two bits to be discarded in case of subtractions). For exponents differing by less than this, we can do a (1 + 2d + 2) bit fixed point addition producing a (2 + 2d + 2) bit result. If the result is non-zero it is normalised and rounded to d bits (or fewer if the answer is in the denormalised range). In the case of subtractions, normalisation can involve a shift to the left by d places, for example (1 + 2−d) − 1 = 1 × 2−d. This is why floating point additions can cause a loss of significance. The IEEE standard allows for rounding in various ways, but usually numbers are rounded to nearest, with ties broken by rounding to even (that is, a zero in the last place). Rounding can again produce a number that needs to be normalised a second time (but no more times than that). In fact, 1 + 2d + 2 bit adders are not needed. It suffices to have 1 + d + 2 + 1 bits (with another one to the left for carry out): 1 + d for the answer, two less significant digits to decide on the direction of rounding, and a final bit which is the logical or of all those bits that have been shifted to a lower significance.

13.2 Dire warning Floating point arithmetic is not pleasant: for one thing, it is not associative. In single precision 1 + 2−24 = 1 so (1 + 2−24) + 2−24 = 1 + 2−24 = 1, but 1 + (2−24 + 2−24) = 1 + 2−23 =6 1. More confusingly −x is not the same as 0 − x, in particular it is not the case even when x = 0. These properties are so deeply ingrained in the way you usually do arithmetic that you have to be very careful not to rely on them.

13.3 Floating point multiplication and division Multiplication is much more straightforward.

− − − − ((1 + x) × 2a k) × ((1 + y) × 2b y) = ((1 + x) × (1 + y)) × 2(a+b k) k

The result may need renormalising because 1 ≤ (1 + x) × (1 + y) < 4 and the normalised value may require rounding. One way of dealing with division would be to do a long division but it suffices to reduce it to a multiplication and a reciprocal operation: x/y = x×(1/y) and the floating 11:09pm 21st January 2021 13.3 Floating point multiplication and division 127 reciprocal can be calculated by an iterative method such as the Newton-Raphson20 method for finding roots of differentiable functions.

✻ ′ f (xn)

1

❡ f(xn)

❡ xn+1 r xrn

Let f(x) = y − 1/x, then f has a root at x = 1/y. The Newton-Raphson ′ iteration starts from an initial guess x0 and calculates xi+1 = xi − f(xi)/f (xi) = (2 − yxi)xi. Given a good enough first guess, this sequence converges rapidly on 1/y so a small fixed number of iterations can be used. Some fast division (for example the Sweeney, Robertson, and Tocher algorithm used by the ) involve accumulating successive bits of the result by looking in a table indexed by bits of the remainder (calculated by multiply and add).

Exercises 13.1 The IEEE half-precision floating point representation uses a sign bit, a five bit exponent and a ten bit significand packed in a sixteen bit word.

1 7 1 1 1. What are the representations of 2 , 8 , 1, 1 2 , 2 and 6 2 ? 1 3 2. What are the closest representable numbers to 5 and 5 ? 3. What are the smallest and largest positive numbers representable in nor- malised form? 4. What is the smallest positive denormalised number?

20Named for the Cambridge mathematician Joseph Raphson (approx 1648-1715), and Sir Isaac Newton (1642-1727) who had known about it earlier but had not published it.

11:09pm 21st January 2021 floating 128 13 FLOATING POINT

5. What is the smallest positive ε for which 1 + ε is representable?

13.2 In the C programming language, single length floating point variables are of type float. Why do you imagine that the C program

main() { float x, q; x = 3.0; q = x/7.0; putchar(q == x/7.0 ? ’t’ : ’f’) ; }

prints an ‘f ’? What do you think is C’s strategy for evaluating floating point valued expressions? (Hint: if the type of q is changed to double, it prints ‘t’.)

13.3 Use a (such as hugs or ghci) to evaluate the first few terms in the Newton-Raphson sequence for evaluating 1/3. Let the error in successive terms of the Newton-Raphson iteration be εi = 1/y − xi. Show that εi converges to zero, for some range of initial guesses x0; and explain why the numbers of correct digits in the terms of the sequence grows in the way that it does. Show how to get a first guess close enough for rapid convergence (try to deal 1 with 2 ≤ y < 1 first, then consider other y) and estimate the number of iterations needed to get the right answer in single and in double precision.

13.4 Goldschmit division21 is an iterative algorithm like Newton-Raphson. To 1 divide n by d, where 2 < d ≤ 1, first construct the sequence d0 = d, di+1 = di(2−di); from this construct the sequence n0 = n, ni+1 = ni(2−di). Assuming exact arithmetic, show that ni converges to the quotient n/d. How quickly does it converge? Which is the first term correct to b bits? Com- pare this with the long division algorithm. How would you use the algorithm to divide by a denominator not in that range?

1 13.5 The IBM variant of Goldschmit division (exercise 13.4) also works for 2 < 2 d ≤ 1. Construct the sequence p0 = (1 − d), pi+1 = pi , n0 = n, ni+1 = ni(1 + pi). Assuming exact arithmetic, show that ni = (n/d)(1 − pi) and so that it converges to the quotient n/d. How quickly does it converge? Which is the first term correct to b bits? 21Robert Elliott Goldschmidt, in his MSc dissertation at MIT in 1964.

floating 11:09pm 21st January 2021 13.3 Floating point multiplication and division 129

13.6 A certain kind of minifloat number has four bits of exponent and three bits of mantissa. If it follows the pattern of IEEE 754 numbers, exactly what range of numbers can be represented, to what relative precision? Suppose we want to represent integers to about one significant decimal digit of precision. How might these minifloat numerals best be used? How do they compare with the usual binary representation of integers?

13.7 Since floating-point addition is not associative the functions

foldr (+) 0 :: [Float] → Float foldl (+) 0 :: [Float] → Float

are not equal. How might you get the best approximation to the sum of the first n terms of a series with terms of rapidly decreasing magnitude? Does it matter whether the terms have the same sign, are alternating?

11:09pm 21st January 2021 floating 130 14 STRUCTURED DATA

14 Structured data

What is an array? The MIPS code in section 10 used consecutive words of memory to contain an array of integers, indexed from zero. More generally, an array might have bigger (or smaller) elements, might be indexed from something other than zero. If it is necessary to check (because the compiler cannot prove) that accesses are in range, we also need to know the upper bound of the index. The data, naturally, is usually not known to the compiler. The indexing might be: but then for an array parameter to a procedure it might not. The type of element of an array is part of the type of the array, but whether or not the sizes and bounds are is a matter of language design. Since the elements of the array have a fixed type, and so occupy a fixed amount of storage it is possible to identify the location occupied by an element identified by a run-time variable index by doing an indexing calculation. In the original Pascal language the type of an array included its bounds and size, so a particular type of multidimensional array could be allocated a single flat area of memory with a fixed indexing calculation. Frustratingly, in this language it is not possible to write a procedure which can then be applied to different sized arrays. Extensions of Pascal such as Oberon allow open arrays of type array of T as parameters. If a procedure takes an open array parameter, the value passed at run time must contain the bounds as the indexing calculation (and bounds checking) depends on these values. The extra data is sometimes called the dope vector of the array.22 Since the code handling array parameters has to be able to manage arrays by dope vector, arrays might well be represented by a dope vector even when that dope vector is constant. Languages like C which do not perform bounds checking on arrays can avoid passing bounds information, but only by requiring the sizes of all dimensions except the first to be fixed. Often, large objects such as arrays are not allocated on the stack. This is to keep the stack frames small, so that accesses to components of the stack frame can be close to a single pointer, and so cheap.

22‘Dope’ in the 20th century US slang sense of information not widely disseminated.

floating 11:09pm 21st January 2021 14.1 Pointers 131

14.1 Pointers We have already seen how to keep a bounded stack in an array; here it is described at a higher level of abstraction:

var i : N, as : array n of T ; proc empty begin i := 0 end; proc push(val x : T ) — precondition: i < n begin as[i] := x; i := i + 1 end; proc pop(var x : T ) — precondition: i > 0 begin i := i − 1; x := as[i] end

The stack pointer in this implementation points at the next free location in the array, and the stack grows ‘upwards’.

as[0] as[i] as[n]

✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ ✓✓✓✓✓✓✓✓✓✓✓✓✓✓ ✓ |✓✓✓✓✓✓✓✓✓✓✓✓{z } as

The variable i is used only to index as, and to calculate values of i, and the array is used only when indexed by i. A compiler might well translate this into machine code which does not do the indexing: on MIPS you might replace i by a variable containing the address of as[i]. The C programming language and its derivatives allow you to express this as a pointer to an object of the type of the elements of as.

var p : ref T , as : array n of T ; proc empty begin p := addr(as[0]) end; proc push(val x : T ) — precondition: p < addr(as[n]) begin p↑ := x; p := p + 1 end; proc pop(var x : T ) — precondition: p > addr(as[0]) begin p := p − 1; x := p↑ end

In C the type ref T of pointers (references) to T is written ∗T , the addresses addr(as[0]) and addr(as[n]) of as[0] and as[n] would be written &(as[0]) and &(as[n]) (or more idiomatically, as and as + n) and the dereferencing operator p↑

11:09pm 21st January 2021 floating 132 14 STRUCTURED DATA is also written ∗T . Be careful reading this code: the addition of a number to a pointer in C means adding on that number multiplied by the size of the referenced objects. So, in MIPS where the memory is byte addressed, if the array is an array of word-sized integers, p := p +1 adds four onto p, and p := p − 1 takes four away.

p r ✩ as[0] as[n] ❄ ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ ✓✓✓✓✓✓✓✓✓✓✓✓✓✓ ✓ ✓| ✓✓✓✓✓✓✓✓✓✓✓{z } as The p↑ notation comes from languages related to Pascal, although it usually appears as p^. In Pascal the type ref T is written pointer to T . However there is nothing equivalent to addr(as[i]); you are not in general allowed to make pointers to existing variables. Here is an implementation, in the style of the bounded stack, of a queue as a circular buffer kept in an array. var w : N, r : N, c : N, as : array n of T ; — invariant: (r + c) mod n = w proc empty begin r, w, c := 0, 0, 0 end; proc enqueue(val x : T ) — precondition: c < n begin as[w] := x; w, c := (w + 1) mod n, c + 1 end; proc dequeue(var x : T ) — precondition: c > 0 begin c := c − 1; x := as[r]; r := (r + 1) mod n end This time w points at the next location to be written, r the next location to be read; and c is the number of queued items. You do not need all three of these variables, because of the invariant; however you cannot do with r and w because c takes n + 1 different values, but (w − r) mod n takes only n. The symmetrical solution seems neater, even though it uses more variables. If you wanted to rewrite this in terms of pointers, it would be straightforward except for the mod n operations. r := (r + 1) mod n has to turn into something like r := r + 1; if r = addr(as[n]) then r := addr(as[0]) end and you would probably keep a pointer to as[n] for efficiency. floating 11:09pm 21st January 2021 14.2 Pointers to stack variables 133

Exercise 14.1 Translate the circular buffer code to use pointers.

14.2 Pointers to stack variables There are several reasons for Pascal-like languages to forbid pointers to existing variables. One is that it makes a second name for the variable, and breaks one of the fun- damental properties of assignment which makes it easy to reason about programs. In a Pascal program you can (almost23) know that an assignment x := ... does not change y. In this respect writing C programs is rather like writing machine code programs: in isolation, assigning to the target of a pointer might change anything, just as a MIPS store word instruction can change anything which has its representation in the store (including the instructions of the program). More significantly, pointers can be returned from procedures. This means that a pointer to a variable on the stack can be returned outside the scope of the variable. That is, it can be returned to code which runs when that variable is no longer on the stack.

proc dangling : ref T var x : T ; begin return addr(x) end

The value returned by a call of dangling is the address of a variable that was on the stack during the execution of the procedure, but which is not there any more. It is the address of a location in the unused part of the stack, which may or may not be being used for something else at any time in the future when the pointer is used.

14.3 Access vectors Pointers offer an alternative way of representing multidimensional arrays. An n+1 dimensional array can be represented by a one dimensional array of pointers to n dimensional arrays. These arrays of pointers are known as access vectors, displays

23In Pascal a call to a procedure with a var parameter passes a pointer to the actual param- eter, so within the body of the procedure the formal parameter names the same variable as the actual parameter.

11:09pm 21st January 2021 floating 134 14 STRUCTURED DATA or Iliffe vectors.24 Access to an element of an n dimensional array represented in this way, rather than requiring n − 1 multiplications by elements of a dope vector, requires three multiplications by constants, and two pointer dereferencing steps. Some operations on access vector arrays are particularly cheap: for example, exchanging two rows of a two dimensional array becomes the swapping of two pointers. Irregular arrays, for example triangular arrays, can be implemented without wasting space. However, the pattern of accesses to memory can be much less predictable, reducing the ability of clever compilers to prefetch memory. In contemporary machines the processor is so much faster than the memory that the time taken to do address calculations can be hidden in the latency of the memory by predicting which locations will be needed next.

14.4 Records: products and sums Records make the opposite trade-off between run-time and compile-time flexibility. In order to allow the components of a structure to have different types, but still to identify the type of a selected component, it must be selected by a compile time constant. The code for accessing records and arrays will look much the same except that record indexes are constants known at compile time. Record types are product types: the values of a record type with fields of types T and U are pairs drawn from the Cartesian product T × U.

record fst : a; snd : b end corresponds to the Haskell type

data Pair a b = Pair {fst :: a; snd :: b}

The record needs room for both an a and a b, at fixed offsets from the address of the record.

a b a b

fst snd

The (disjoint) sum T + U is the type of what in Pascal is called a variant record, in C is called a union. The Haskell type

data Either a b = Left a | Right b

24John Kenneth Iliffe, MIT and ICL was director of the ICL distributed array processor project. floating 11:09pm 21st January 2021 14.5 Recursive types 135 is the sum of a and b. If we added selector functions

data Either a b = Left {left :: a} | Right {right :: b} corresponds to record case tag :(Left, Right) of Left :(left : a) Right :(right : b) end

The record needs space for a tag, and offset from that a field big enough for the bigger of an a and a b, both the left and right fields at the same offset from the address of the record. a a b ✄ ✄ ✄ b✄ ✄ ✄ ✂ ✁ ✂ ✁ ✂ ✁ ✂ ✁ ✂ ✁ ✂ ✁ tag left right Pascal allows you to omit the tag from the actual data structure; but then the program has to be able to keep track of which variant of the data is stored in the record by some other means.

14.5 Recursive types Type definitions in Haskell can be recursive, for example

data Heap a = Fork (Heap a) a (Heap a) | Nil for a binary heap. You have already seen the idea of storing such a heap in an array: the label of the root goes at a[1], the labels of the children of the node at a[i] go at a[2i] and a[2i + 1]. That works well for heaps, because they are almost complete binary trees. For less complete trees, the size of the array needs to be much bigger than the size of the tree. Suppose that the array as[1 .. n), that is as[i] for 1 ≤ i < n, contains the labels of a binary heap. abs as i n | i ≥ n = Nil | i < n = Fork (abs as (2 ∗ i) n) (as ! i) (abs as (2 ∗ i + 1) n)

11:09pm 21st January 2021 floating 136 14 STRUCTURED DATA returns the Heap a that is rooted at i. (Operations on Haskell arrays are defined in the module Data.Array, and arrays are indexed by the (!) operator, so as!(2∗i) and so on.) In general it would be better to have an array of records, with a field for the left and right subtrees, rather than using 2i and 2i + 1. abs as i n | i ≥ n = Nil | i < n = Fork (abs as (left (as ! i)) n) (value (as ! i)) (abs as (right (as ! i)) n) returns the Heap a rooted at i. Of course there is still the question of which part of the array to use when adding a new node to the tree, and how big an array to allow for a given tree. Most programming languages with pointers allow you to share a single anonymous (and otherwise inaccessible) array between all the structures in the program. Languages like Pascal have a function new. A call of new(p) on a pointer p of type ref T allocates a new chunk of memory big enough to contain a T , and makes p point at it. Here is a Pascal-like record which can be used to represent a cell for a list like the Haskell list of type [T ]. type R = record head : T ; tail : ref R end T ref R

A pointer of type ref R represents a list; every non-empty list is represented by a pointer to a record of type R, and the empty list is represented by a special pointer nil which is guaranteed not to be equal to any pointer to a record. (The value nil is polymorphic and can be made to be a pointer of any type.) Consing a T onto such a list involves allocating a new cell.

✲ ✲ ✲ ❅ xs xs0 xs1 xs2 ❅ r r r ❅ proc cons(x : T ; xs : ref R): ref R var p : ref R; begin new(p); p↑.head, p↑.tail := x, xs; return p end xs r ✘ ✲ ✲ ✲ ✲ ❅ p x xs0 xs1 xs2 ❅ r r ✚ r r ❅ floating 11:09pm 21st January 2021 14.5 Recursive types 137

Using this representation, we could implement a potentially unbounded variant of the bounded stack from page 131

var p : ref R; proc empty begin p := nil end; proc push(val x : T ) begin p := cons(x, p) end; proc pop(var x : T ) — precondition: p =6 nil begin x, p := p↑.head, p↑.tail end

A program using this stack many times will consume more and more of the memory available for allocation by new. The cell occupied by a popped value becomes inaccessible, but the memory could be reused. One solution would be a procedure free which would make it available again.

proc pop(var x : T ) — precondition: p =6 nil var q : ref R; begin x, q := p↑.head, p↑.tail; free(p); p := q end

Exercises 14.2 Define the abstraction function abs :: ref R → [T ] for lists implemented by cons cells.

14.3 Define a record type suitable to a pointer representation of values of type Heap T

data Heap a = Fork (Heap a) a (Heap a) | Nil

and define an abstraction function abs and a procedure fork so that

abs(fork(l, x, r)) = Fork (abs l) x (abs r)

11:09pm 21st January 2021 floating 138 14 STRUCTURED DATA

14.6 Heap allocation The support for new and free has to keep control of which part of the memory is available to be allocated. The very simplest implementation would keep a pointer (call it a heap pointer) to the first unallocated location, and implement new by moving this pointer on: var hp : ref byte; proc new(var p : ref R) begin p := hp; hp := hp + size(R) end; proc free(val p : ref R) begin end (There is something suspicious about the type of new and free: They have to work for pointers to any R, so it cannot in fact be defined in a language like Pascal.) This implementation of does nothing to reclaim freed memory: long running programs which allocate memory will eventually run out of heap, even though they never have much of it which has not been freed. We have already seen memory allocation on a stack. This is a special case in which allocated memory is released in a strictly last-in first-out order, and it is enough to keep a pointer (the stack pointer) to the edge of the tide as it sweeps in and out. This works because there is a static pairing of allocation (at, say, entry to a procedure) and deallocation (at, say, exit from the same procedure). Stack allocation is particularly efficient, but also particularly confining. The implementation of a stack using new and free is almost as easy to sup- port: because we are implementing a stack the calls of new and free will come in order. But, in general, a program may legitimately call free on any other allocated memory. As a result ‘holes’ can appear in the allocated memory, and for efficient use of memory these holes must become available for reallocation. A heap pointer is not enough. If the memory is always allocated in chunks of the same size (suppose as in Lisp that the only allocated unit is the cons cell) then it is enough to keep a list (the free list) of these chunks, chained together by storing the address of the next in the first word of each. Either the whole heap memory is carved up into cells at the beginning, to make a free list; or the free list starts empty and you keep a heap pointer. Calls to new remove a cell from the free list and return it for use. If the free list becomes exhausted, a new cell is carved off the memory that has never yet been allocated. Calls to free push the released cell onto the free list. More generally memory may be allocated in different sized chunks. Now allo- cation and release is harder: it would be possible to keep several freelists, by size floating 11:09pm 21st January 2021 14.7 Garbage collection 139 of chunk, or to keep the freelist sorted by size of chunk, but doing this might cause unnecessary fragmentation. It might be that when a large chunk is required there are no already freed chunks of that size, but there are several contiguous small chunks which if they were amalgamated would together be big enough. For this reason, you would probably want to keep the free list in increasing address order. (Free chunks cannot overlap, so this is well-defined.) When a chunk is freed, the right position for it in the free list is found, and if it is contiguous with a free chunk to either side (or both sides) they are amalgamated. Free chunks will have to record not only the address of the next free chunk, but also the size of the current chunk. There is a small detail: how to deal with allocating and freeing chunks of memory smaller than needed to record both an address and a size. This can be done by having a minimum allocation size (two words, say) on the grounds that relatively little memory is wasted if relatively few chunks smaller than that are allocated. Alternatively, a list of single words might be kept separately from the general free list. Allocating a chunk of memory from a free list of variable sized chunks needs a policy decision: which chunk to allocate? Should a scan be made through the list until a free chunk of exactly the right size, if there is one, is found? This would reduce the extent of fragmentation of free space, at the expense of searching the whole free list. Or should the chunk come out of the first big enough hole? Secondly, which chunk should be allocated if there is no exact fit: should it be the smallest one that is too big, or should it be carved out of a larger one? Time efficiency and space efficiency are often in conflict. Some designs allocate memory in sizes of powers of two of words: when there is not a small enough free chunk, a bigger one is split in half. Chunks are only ever amalgamated with their original pairs (known as buddy blocks). In some circumstances the waste of space (which is never more than 50%) is compensated for by the efficiency of managing the space.

14.7 Garbage collection In general it can be hard to tell at compile time when it is safe to free a chunk of memory. In a program which represents a data structure by a directed acyclic graph, a single record might be pointed at by several pointers. This sharing may change dynamically, and it might not be possible to tell when the value of a pointer is changed whether that was the last pointer to its target. A solution to this problem is reference counting. Every allocated chunk of memory contains, in addition to the visible data, a count of the number of pointers by which it is accessible. The program has to be compiled so that this count is kept current. Every time a pointer variable is changed, and every time a pointer variable goes out of scope, the relevant reference counts must be updated. Explicit calls of free are not needed, but whenever a reference count is reduced to zero the

11:09pm 21st January 2021 floating 140 14 STRUCTURED DATA memory to which it is attached is automatically freed, having first reduced the reference counts of the targets of every pointer from that chunk of memory. This clearly requires that the compiler know what type of object is pointed at by every pointer. More subtly, it also requires that every pointer variable, and every pointer field in a new record be initialised (to, probably, nil) when it is allocated. Reference counting is fine for directed acyclic graphs, but fails for cyclic struc- tures. If a structure contains a pointer or pointers to itself, it may become inac- cessible without all the pointers to it having been removed. The solution to this problem is garbage collection, which was invented by John McCarthy.25 In a garbage collecting allocator, memory is allocated from a free list in the usual way, and no explicit freeing or reference counting is needed. When the free list is eventually exhausted, computation stops. A depth first search is performed from every pointer on the stack, which visits all of the acces- sible chunks of memory, and marks them. Anything which is not marked is, by definition, free. A sweep through memory in address order can then identify all the free memory and construct a free list from which the required memory can then be allocated. In the course of that scan, all the marks from the garbage collection can also be cleared in preparation for the next round of garbage collection. In order to do the freeing it must be possible to tell from the allocated memory how big the chunks are. In order to identify all pointers on the stack it must be possible to identify from the structure of the stack where the pointers are (and whether there are any in registers). In order to perform the depth first search it must be possible, either from the types of the starting pointers or directly from a tag on the values of the records which are pointer targets, where the pointers in the records are. (Many of these things are easier in a homogeneous heap, like that of Lisp.) As described, garbage collection causes an occasional but potentially substan- tial pause in the course of the running of a program. The mark phase takes a time roughly proportional to the amount of memory in active use; the scan phase takes a time roughly proportional to the size of the heap. It is possible to run incremental collectors, where the mark phase is atomic, but the scan phase is spread out over subsequent calls of new. It is also possible with limits on the operations allowed in the program, and at the expense of extra work, to run a collector concurrently with the program.

25John McCarthy (1927–2011) was a Stanford scientist who coined the term artificial intelligence. He designed Lisp while at Caltech in 1958. Had the technique been inveted on this side of the Atlantic, no doubt the refuse would be collected by a dustman.

floating 11:09pm 21st January 2021 14.8 Serialisation 141

14.8 Serialisation Writing structured data to a file, and reading them back in, or transmitting struc- tured data over a network, requires the data to be represented in a linear form. This process is variously called serialisation, marshalling, or linearisation. Memory addresses in the representation of the structure make no sense when the data are moved out of the memory, or back into another memory, so pointers have to be dealt with specially. Suppose that it is possible to represent all component types of a polynomial type T a b . . . in a way that makes it possible to detect the end of the representation of a value, for example if integers are represented by a known number of bits. Then values of the type T a b . . . can be represented by a preorder traversal which represents the components by a representation of the tag followed by the representation of the components. For example data Heap a = Fork (Heap a) a (Heap a) | Nil serial (Fork left value right) = 0 : serial left ++ serial value ++ serial right serial Nil = 1 The Haskell idiom would be to declare a class

class Serial a where serial :: a → [Bit] then

instance Serial a ⇒ Serial (Heap a) where serial (Fork left value right) = 0 : serial left ++ serial value ++ serial right serial Nil = [1] This is similar to the textual representation produced by the show function which Haskell provides when a structured type is declared with deriving Show. That creates a human-readable linear representation which includes the spellings of the names of the tags. The reconstruction of an object from its linear representation is similar to the Haskell parsing function read. (It is a special case where the first element(s) of the input unambiguously identify the structure of the parse.)

class Serial a where serial :: a → [Bit] build :: [Bit] → (a, [Bit])

11:09pm 21st January 2021 floating 142 14 STRUCTURED DATA

Here the function build returns the first object in the bit sequence, and the re- maining unused bit sequence; so build (0 : xs) = (Fork left value right, rs) where (left, ps) = build xs (value, qs) = build ps (right, rs) = build qs build (1 : xs) = (Nil, xs) This representation generalises to finite objects of recursive types in languages with explicit pointers. A tree is represented in linear form by a traversal which records punctuation between the components. However, serialising a directed acyclic graph in this way would remove information about sharing: the representation would always be the same as that for a tree. Worse than that, the representation produced for a general finite graph might well be infinite, if the graph expands into an infinite tree. To deal with this problem, a depth first search of the (finite) graph can identify a finite tree to be represented. All nodes which are the targets of cross edges or back edges have to be substituted for by a new identifier. In this way the graph is broken into a finite sequence of finite trees, together with a mapping from identifiers to nodes. (The identifier could be a representation of a path in the tree, or some arbitrary label such as the discovery time in the depth first search.) This can be linearised, and then reconstructed, and a representation of the graph reconstructed from it by substituting pointers for the temporary identifiers. More generally you might want to check that the type of a value being rebuilt from a linear representation was what you expected to read. That would require the linear representation to include a representation of the type, which raises the question of when two types are the same. Perhaps the easiest thing to do here is to say that types are equivalent if they are structured types made of the same operators of the same base types, for some small collection of operators (such as sums and products) and some small collection of base types. This is structural equality of types, and may not match the idea of type equality in a programming langauge. Putting the text of the names of types into the linear representation would allow checking for name equality of types, but again this might not be quite what you expect. Two different types with the same name can exist in different scopes even in the same program.

Exercises 14.4 Write instances of Serial for Int and the type of (finite) lists, use them to turn a list of numbers into a bit string and back again. Explain the encoding floating 11:09pm 21st January 2021 14.8 Serialisation 143

of the list. (You can assume that you know the number of bits in an Int, and build that into your code.)

14.5 Assuming that there are no pointers involved, how might you provide se- rial representations of the products (records) and sums (variant records) in a language like Oberon? What about arrays? Now, allow pointers to appear in these data structures, but consider only data structures that represent only finite trees. How would you represent these?

11:09pm 21st January 2021 floating