COMPUTER ARCHITECTURE s

Theory - Sum Up

{Part francesco LONGO Appendix: winMIPS64 Instruction Set

WinMIPS64 beq - branch if pair of registers are equal The following assembler directives are supported bne - branch if pair of registers are not equal .data - start of data segment beqz - branch if register is equal to zero .text - start of code segment bnez - branch if register is not equal to zero .code - start of code segment (same as .text) .org - start address j - jump to address .space - leave n empty bytes jr - jump to address in register .asciiz - enters zero terminated ascii string jal - jump and link to address (call subroutine) .ascii - enter ascii string jalr - jump and link to address in register (call subroutine) .align - align to n-byte boundary .word ,.. - enters word(s) of data (64-bits) dsll - shift left logical .byte ,.. - enter bytes dsrl - shift right logical .word32 ,.. - enters 32 bit number(s) dsra - shift right arithmetic .word16 ,.. - enters 16 bit number(s) dsllv - shift left logical by variable amount .double ,.. - enters floating-point number(s) dsrlv - shift right logical by variable amount dsrav - shift right arithmetic by variable amount where denotes a number like 24, denotes a string movz - move if register equals zero like "fred", and movn - move if register not equal to zero ,.. denotes numbers seperated by commas. nop - no operation and - logical and The following instructions are supported or - logical or lb - load byte xor - logical xor lbu - load byte unsigned slt - set if less than sb - store byte sltu - set if less than unsigned lh - load 16-bit half-word dadd - add integers lhu - load 16-bit half word unsigned daddu - add integers unsigned sh - store 16-bit half-word dsub - subtract integers lw - load 32-bit word dsubu - subtract integers unsigned lwu - load 32-bit word unsigned sw - store 32-bit word add.d - add floating-point ld - load 64-bit double-word sub.d - subtract floating-point sd - store 64-bit double-word mul.d - multiply floating-point l.d - load 64-bit floating-point div.d - divide floating-point s.d - store 64-bit floating-point mov.d - move floating-point halt - stops the program cvt.d.l - convert 64-bit integer to a double FP format cvt.l.d - convert double FP to a 64-bit integer format daddi - add immediate c.lt.d - set FP flag if less than daddui - add immediate unsigned c.le.d - set FP flag if less than or equal to andi - logical and immediate c.eq.d - set FP flag if equal to ori - logical or immediate bc1f - branch to address if FP flag is FALSE xori - exclusive or immediate bc1t - branch to address if FP flag is TRUE lui - load upper half of register immediate mtc1 - move data from integer register to FP register slti - set if less than or equal immediate mfc1 - move data from FP register to integer register sltiu - set if less than or equal immediate unsigned

Classic RISC - Wikipedia, the free encyclopedia 19/01/16 11:23 Classic RISC pipeline From Wikipedia, the free encyclopedia

In the history of computer hardware, some early reduced instruction set computer central processing units (RISC CPUs) used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, , and later the notional CPU DLX invented for education.

Each of these classic scalar RISC designs fetched and attempted to execute one instruction per cycle. The main common concept of each design was a five-stage execution instruction pipeline. During operation, each pipeline stage would work on one instruction at a time. Each of these stages consisted of an initial set of flip- flops and which operated on the outputs of those flip-flops.

Contents

1 The classic five stage RISC pipeline

1.1 Instruction fetch

1.2 Instruction decode

1.3 Execute

1.4 Memory access

1.5 Writeback 2 Hazards

2.1 Structural hazards 2.2 Data hazards

2.2.1 Solution A. Bypassing

2.2.2 Solution B. Pipeline interlock

2.3 Control hazards

3 Exceptions

4 miss handling

5 References

The classic five stage RISC pipeline

Instruction fetch

https://en.wikipedia.org/wiki/Classic_RISC_pipeline Page 1 of 9 Classic RISC pipeline - Wikipedia, the free encyclopedia 19/01/16 11:23

The Instruction Cache on these machines had a latency of one cycle, meaning that if the instruction was in the cache, it would be ready on the next 1 2 3 4 5 clock cycle. During the Instruction Fetch stage, a 32-bit instruction was fetched +1 from the cache. +1 The Program , or PC, is a register + : responsible for holding the address of the current instruction. It feeds into the PC Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, predictor which then sends the Program ID = Instruction Decode, EX = Execute, MEM = Memory access, WB Counter (PC) to the Instruction Cache to = Register write back). The vertical axis is successive instructions; the read the current instruction. At the same horizontal axis is time. So in the green column, the earliest instruction time, the PC predictor predicts the is in WB stage, and the latest instruction is undergoing instruction address of the next instruction by fetch. incrementing the PC by 4 (all instructions were 4 bytes long). This prediction was always wrong in the case of a taken branch, jump, or exception (see delayed branches, below). Later machines would use more complicated and accurate algorithms (branch prediction and branch target prediction) to guess the next instruction address.

Instruction decode

Unlike earlier microcoded machines, the first RISC machines had no . Once fetched from the instruction cache, the instruction bits were shifted down the pipeline, so that simple combinational logic in each pipeline stage could produce the control signals for the directly from the instruction bits. As a result, very little decoding is done in the stage traditionally called the decode stage. A consequence of this lack of decoding meant however that more instruction bits had to be used specifying what the instruction should do (and also, what it should not), and that leaves less bits for things like register indices.

All MIPS, SPARC, and DLX instructions have at most two register inputs. During the decode stage, these two register names are identified within the instruction, and the two registers named are read from the register file. In the MIPS design, the register file had 32 entries.

At the same time the register file was read, instruction issue logic in this stage determined if the pipeline was ready to execute the instruction in this stage. If not, the issue logic would cause both the Instruction Fetch stage and the Decode stage to stall. On a stall cycle, the stages would prevent their initial flip-flops from accepting new bits.

If the instruction decoded was a branch or jump, the target address of the branch or jump was computed in parallel with reading the register file. The branch condition is computed after the register file is read, and if the branch is taken or if the instruction is a jump, the PC predictor in the first stage is assigned the branch target, rather than the incremented PC that has been computed. It should be noted that some architectures made use of the ALU in the Execute stage, at the cost of slightly decrease instruction throughput.

The decode stage ended up with quite a lot of hardware: the MIPS instruction set had the possibility of branching if two registers were equal, so a 32-bit-wide AND tree ran in series after the register file read, making a very long critical path through this stage. Also, the branch target computation generally required a

https://en.wikipedia.org/wiki/Classic_RISC_pipeline Page 2 of 9 Classic RISC pipeline - Wikipedia, the free encyclopedia 19/01/16 11:23

16 bit add and a 14 bit incrementer. Resolving the branch in the decode stage made it possible to have just a single-cycle branch mispredict penalty. Since branches were very often taken (and thus mispredicted), it was very important to keep this penalty low.

Execute

The Execute stage is where the actual computation occurs. Typically this stage consists of an Arithmetic and Logic Unit, and also a bit shifter. It may also include a multiple cycle multiplier and divider.

The Arithmetic and Logic Unit is responsible for performing boolean operations (and, or, not, nand, nor, xor, xnor) and also for performing integer addition and subtraction. Besides the result, the ALU typically provides status bits such as whether or not the result was 0, or if an overflow occurred.

The bit shifter is responsible for shift and rotations.

Instructions on these simple RISC machines can be divided into three latency classes according to the type of the operation:

Register-Register Operation (Single-cycle latency): Add, subtract, compare, and logical operations. During the execute stage, the two arguments were fed to a simple ALU, which generated the result by the end of the execute stage. Memory Reference (Two-cycle latency). All loads from memory. During the execute stage, the ALU added the two arguments (a register and a constant offset) to produce a virtual address by the end of the cycle. Multi-cycle Instructions (Many cycle latency). Integer multiply and divide and all floating-point operations. During the execute stage, the operands to these operations were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to continue execution while the multiply/divide unit did its work. To avoid complicating the writeback stage and issue logic, multicycle instruction wrote their results to a separate set of registers.

Memory access

If data memory needs to be accessed, it is done so in this stage.

During this stage, single cycle latency instructions simply have their results forwarded to the next stage. This forwarding ensures that both single and two cycle instructions always write their results in the same stage of the pipeline, so that just one write port to the register file can be used, and it is always available.

For direct mapped and virtually tagged data caching, the simplest by far of the numerous data cache organizations, two SRAMs are used, one storing data and the other storing tags.

Writeback

During this stage, both single cycle and two cycle instructions write their results into the register file.

Hazards

https://en.wikipedia.org/wiki/Classic_RISC_pipeline Page 3 of 9 Classic RISC pipeline - Wikipedia, the free encyclopedia 19/01/16 11:23

Hennessy and Patterson coined the term hazard for situations in which instructions in a pipeline would produce wrong answers.

Structural hazards

Structural hazards occur when two instructions might attempt to use the same resources at the same time. Classic RISC pipelines avoided these hazards by replicating hardware. In particular, branch instructions could have used the ALU to compute the target address of the branch. If the ALU were used in the decode stage for that purpose, an ALU instruction followed by a branch would have seen both instructions attempt to use the ALU simultaneously. It is simple to resolve this conflict by designing a specialized branch target into the decode stage.

Data hazards

Data hazards occur when an instruction, scheduled blindly, would attempt to use data before the data is available in the register file.

In the classic RISC pipeline, Data hazards are avoided in one of two ways:

Solution A. Bypassing

Bypassing is also known as .

Suppose the CPU is executing the following piece of code:

SUB r3,r4 -> r10 ; Writes r3 - r4 to r10 AND r10,r3 -> r11 ; Writes r10 && r3 to r11

The instruction fetch and decode stages will send the second instruction one cycle after the first. They flow down the pipeline as shown in this diagram:

In a naïve pipeline, without hazard consideration, the data hazard progresses as follows:

In cycle 3, the SUB instruction calculates the new value for r10. In the same cycle, the AND operation is decoded, and the value of r10 is fetched from the register file. However, the SUB instruction has not yet written its result to r10. Write-back of this normally occurs in cycle 5 (green box). Therefore, the value read from the register file and passed to the ALU (in the Execute stage of the AND operation, red box) is incorrect.

https://en.wikipedia.org/wiki/Classic_RISC_pipeline Page 4 of 9 Clock Cycles Data Hazard - ftp.fmp#1 2 3 4 5 6 Sua ROM , Rt

Rm , RIO RL F M AND , , D E W

SUB

to F Ⱦ PC sanded . Instruction coche to Reaal Instruction Then PC -14 ( next instruction corresponding .

Ⱦ the D 2 name offre Rs , Rc are and read of input registers ( , ) identified from

| In MPS I have 32 regista file : Rd : . . R 31 )

Ⱦ Register file Reaal if there are problems Ⱦ stall Fond D

in If Instruction decoded is a- Branche or Lump Ⱦ Address Computed parallel reading register file

Branche condition computer after register file is read

E Ⱦ Actual computation occurs

dato . lo . M Ⱦ Only used if Memory needs le accessed ( in this cose value RIO is only forwarded to W stage )

Ⱦ value it is W In this stooge it is written the new of RIO ( new ready

AND

F Ⱦ Normally dono

. . Ⱦ lecause execution D- Trying to access RIO net Ready SUB needs to end its ! Classic RISC pipeline - Wikipedia, the free encyclopedia 19/01/16 11:23

Instead, we must pass the data that was computed by SUB back to the Execute stage (i.e. to the red circle in the diagram) of the AND operation before it is normally written-back. The solution to this problem is a pair of bypass . These multiplexers sit at the end of the decode stage, and their flopped outputs are the inputs to the ALU. Each selects between:

1. A register file read port (i.e. the output of the decode stage, as in the naïve pipeline): red arrow 2. The current register pipeline of the ALU (to bypass by one stage): blue arrow 3. The current register pipeline of the access stage (which is either a loaded value or a forwarded ALU result, this provides bypassing of two stages): purple arrow. Note that this requires the data to be passed backwards in time by one cycle. If this occurs, a bubble must be inserted to stall the AND operation until the data is ready.

Decode stage logic compares the registers written by instructions in the execute and access stages of the pipeline to the registers read by the instruction in the decode stage, and cause the multiplexers to select the most recent data. These bypass multiplexers make it possible for the pipeline to execute simple instructions with just the latency of the ALU, the multiplexer, and a flip-flop. Without the multiplexers, the latency of writing and then reading the register file would have to be included in the latency of these instructions.

Note that the data can only be passed forward in time - the data cannot be bypassed back to an earlier stage if it has not been processed yet. In the case above, the data is passed forward (by the time the AND is ready for the register in the ALU, the SUB has already computed it).

Solution B. Pipeline interlock

However, consider the following instructions:

LD adr -> r10 AND r10,r3 -> r11

The data read from the address adr will not be present in the data cache until after the Memory Access stage of the LD instruction. By this time, the AND instruction will already be through the ALU. To resolve this would require the data from memory to be passed backwards in time to the input to the ALU. This is not possible. The solution is to delay the AND instruction by one cycle. The data hazard is detected in the decode stage, and the fetch and decode stages are stalled - they are prevented from flopping their inputs and so stay in the same state for a cycle. The execute, access, and write-back stages downstream see an extra no- operation instruction (NOP) inserted between the LD and AND instructions.

https://en.wikipedia.org/wiki/Classic_RISC_pipeline Page 5 of 9 Classic RISC pipeline - Wikipedia, the free encyclopedia 19/01/16 11:23

This NOP is termed a pipeline bubble since it floats in the pipeline, like an air bubble, occupying resources but not producing useful results. The hardware to detect a data hazard and stall the pipeline until the hazard is cleared is called a pipeline interlock.

Bypassing backwards in time Problem resolved using a bubble

A pipeline interlock does not have to be used with any data forwarding, however. The first example of the SUB followed by AND and the second example of LD followed by AND can be solved by stalling the first stage by three cycles until write-back is achieved, and the data in the register file is correct, causing the correct register value to be fetched by the AND's Decode stage. This causes quite a performance hit, as the spends a lot of time processing nothing, but clock speeds can be increased as there is less forwarding logic to wait for.

This data hazard can be detected quite easily when the program's machine code is written by the compiler. The original Stanford RISC machine relied on the compiler to add the NOP instructions in this case, rather than having the circuitry to detect and (more taxingly) stall the first two pipeline stages. Hence the name MIPS: without Interlocked Pipeline Stages. It turned out that the extra NOP instructions added by the compiler expanded the program binaries enough that the instruction cache hit rate was reduced. The stall hardware, although expensive, was put back into later designs to improve instruction cache hit rate, at which point the acronym no longer made sense.

Control hazards

Control hazards are caused by conditional and unconditional branching. The classic RISC pipeline resolves branches in the decode stage, which means the branch resolution recurrence is two cycles long. There are three implications:

The branch resolution recurrence goes through quite a bit of circuitry: the instruction cache read, register file read, branch condition compute (which involves a 32-bit compare on the MIPS CPUs), and the next instruction address multiplexer. Because branch and jump targets are calculated in parallel to the register read, RISC ISAs typically do not have instructions which branch to a register+offset address. Jump to register is supported. On any branch taken, the instruction immediately after the branch is always fetched from the instruction cache. If this instruction is ignored, there is a one cycle per taken branch IPC penalty, which is quite large.

There are four schemes to solve this performance problem with branches:

Predict Not Taken: Always fetch the instruction after the branch from the instruction cache, but only https://en.wikipedia.org/wiki/Classic_RISC_pipeline Page 6 of 9 Classic RISC pipeline - Wikipedia, the free encyclopedia 19/01/16 11:23

execute it if the branch is not taken. If the branch is not taken, the pipeline stays full. If the branch is taken, the instruction is flushed (marked as if it were a NOP), and one cycle's opportunity to finish an instruction is lost. Branch Likely: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch was taken. The compiler can always fill the branch on such a branch, and since branches are more often taken than not, such branches have a smaller IPC penalty than the previous kind. Branch Delay Slot: Always fetch the instruction after the branch from the instruction cache, and always execute it, even if the branch is taken. Instead of taking an IPC penalty for some fraction of branches either taken (perhaps 60%) or not taken (perhaps 40%), branch delay slots take an IPC penalty for those branches into which the compiler could not schedule the branch delay slot. The SPARC, MIPS, and MC88K designers designed a branch delay slot into their ISAs. Branch Prediction: In parallel with fetching each instruction, guess if the instruction is a branch or jump, and if so, guess the target. On the cycle after a branch or jump, fetch the instruction at the guessed target. When the guess is wrong, flush the incorrectly fetched target.

Delayed branches were controversial, first, because their semantics is complicated. A delayed branch specifies that the jump to a new location happens after the next instruction. That next instruction is the one unavoidably loaded by the instruction cache after the branch.

Delayed branches have been criticized as a poor short-term choice in ISA design:

Compilers typically have some difficulty finding logically independent instructions to place after the branch (the instruction after the branch is called the delay slot), so that they must insert NOPs into the delay slots. Superscalar processors, which fetch multiple and must have some form of branch prediction, do not benefit from delayed branches. The Alpha ISA left out delayed branches, as it was intended for superscalar processors. The most serious drawback to delayed branches is the additional control complexity they entail. If the delay slot instruction takes an exception, the processor has to be restarted on the branch, rather than that next instruction. Exceptions then have essentially two addresses, the exception address and the restart address, and generating and distinguishing between the two correctly in all cases has been a source of bugs for later designs.

Exceptions

Suppose a 32-bit RISC is processing an ADD instruction which adds two large numbers together. Suppose the resulting number does not fit in 32 bits. What happens?

The simplest solution, provided by most architectures, is wrapping arithmetic. Numbers greater than the maximum possible encoded value have their most significant bits chopped off until they fit. In the usual integer number system, 3000000000+3000000000=6000000000. With unsigned 32 bit wrapping arithmetic, 3000000000+3000000000=1705032704 (6000000000 mod 2^32). This may not seem terribly useful. The largest benefit of wrapping arithmetic is that every operation has a well defined result.

But the programmer, especially if programming in a language supporting large integers (e.g. Lisp or Scheme), may not want wrapping arithmetic. Some architectures (e.g. MIPS), define special addition operations that branch to special locations on overflow, rather than wrapping the result. Software at the target

https://en.wikipedia.org/wiki/Classic_RISC_pipeline Page 7 of 9 Classic RISC pipeline - Wikipedia, the free encyclopedia 19/01/16 11:23 location is responsible for fixing the problem. This special branch is called an exception. Exceptions differ from regular branches in that the target address is not specified by the instruction itself, and the branch decision is dependent on the outcome of the instruction.

The most common kind of software-visible exception on one of the classic RISC machines is a TLB miss (see ).

Exceptions are different from branches and jumps, because those other control flow changes are resolved in the decode stage. Exceptions are resolved in the writeback stage. When an exception is detected, the following instructions (earlier in the pipeline) are marked as invalid, and as they flow to the end of the pipe their results are discarded. The is set to the address of a special exception handler, and special registers are written with the exception location and cause.

To make it easy (and fast) for the software to fix the problem and restart the program, the CPU must take a precise exception. A precise exception means that all instructions up to the excepting instruction have been executed, and the excepting instruction and everything afterwards have not been executed.

In order to take precise exceptions, the CPU must commit changes to the software visible state in the program order. This in-order commit happens very naturally in the classic RISC pipeline. Most instructions write their results to the register file in the writeback stage, and so those writes automatically happen in program order. Store instructions, however, write their results to the Store Data Queue in the access stage. If the store instruction takes an exception, the Store Data Queue entry is invalidated so that it is not written to the cache data SRAM later.

Cache miss handling

Occasionally, either the data or instruction cache will not have a datum or instruction required. In these cases, the CPU must suspend operation until the cache can be filled with the necessary data, and then must resume execution. The problem of filling the cache with the required data (and potentially writing back to memory the evicted cache line) is not specific to the pipeline organization, and is not discussed here.

There are two strategies to handle the suspend/resume problem. The first is a global stall signal. This signal, when activated, prevents instructions from advancing down the pipeline, generally by gating off the clock to the flip-flops at the start of each stage. The disadvantage of this strategy is that there are a large number of flip flops, so the global stall signal takes a long time to propagate. Since the machine generally has to stall in the same cycle that it identifies the condition requiring the stall, the stall signal becomes a speed-limiting critical path.

Another strategy to handle suspend/resume is to reuse the exception logic. The machine takes an exception on the offending instruction, and all further instructions are invalidated. When the cache has been filled with the necessary data, the instruction which caused the cache miss is restarted. To expedite data cache miss handling, the instruction can be restarted so that its access cycle happens one cycle after the data cache has been filled.

References

Computer Architecture, A Quantitative Approach (Fifth Edition) - Morgan Kaufmann, 2011 ISBN 978-0123838728 https://en.wikipedia.org/wiki/Classic_RISC_pipeline Page 8 of 9 Pipelined processors Example 1

E. Sanchez Computer Architectures Example 1

Considering the MIPS64 architecture presented in the following: – Integer ALU: 1 clock cycle – Data memory: 1 clock cycle – FP multiplier unit: pipelined 8 stages – FP arithmetic unit: pipelined 2 stages – FP divider unit: not pipelined unit, requiring 12 clock cycles – branch delay slot: 1 clock cycle (disabled) – forwarding is enabled.

1. write an assembly program able to sum 10 integer (64 bits) numbers previously saved in memory. Example 1 – [cont]

2. show the timing of the developed loop-based program and compute how many cycles does this program take to execute. 3. using the static scheduling technique, re-schedule the presented code in order to eliminate the most data hazards. ; 10 integers addition ;------.data values: .word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ;64-bit integers result: .space 8 .text MAIN: daddui R1,R0,10 ;R1 <-- 10 dadd R2,R0,R0 ;R2 <-- 0 POINTER REG dadd R3,R0,R0 ;R3 <-- 0 RESULT REG LOOP: ld R4,values(R2) ;GET A VALUE IN R4 dadd R3,R3,R4 ;R3 <-- R3 + R4 daddi R2,R2,8 ;R2 <-- R2 + 8 POINTER INCREMENT daddi R1,R1,-1 ;R1 <-- R1 - 1 DECREMENT COUNTER bnez R1,LOOP sd R3,result(R0) ;Result in MEM halt ;the end

Configure ; 10 integers addition Enable forwarding ;------ Enable Delay Slot .data values: .word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ;64-bit integers result: .space 8 .text , IF ID EX M WB MAIN: daddui R1,R0,10 ✓ dadd R2,R0,R0 IF ID EX M WB dadd R3,R0,R0 IF ID EX M LOOP: ld R4,values(R2) IF ID EX dadd R3,R3,R4 IF ID daddi R2,R2,8 IF daddi R1,R1,-1 bnez R1,LOOP sd R3,result(R0) halt

Configure ; 10 integers addition Enable forwarding ;------ Enable Delay Slot .data values: .word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ;64-bit integers result: .space 8

.text ✓ IF ID EX M WB MAIN: daddui R1,R0,10 ✓ dadd R2,R0,R0 IF ID EX M WB dadd R3,R0,R0 IF ID EX M WB in R value . access retrieve 4 LOOP: ld R4,values(R2) IF ID EX M Memory . Read Write ! Ⱦ : After IF ID S RAW dadd R3,R3,R4 - is not until Read value on R 4 but complete IF S daddi R2,R2,8 end M daddi R1,R1,-1 Fimianistage bnez R1,LOOP - stole sd R3,result(R0) halt

Configure ; 10 integers addition Enable forwarding ;------ Enable Delay Slot .data values: .word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ;64-bit integers result: .space 8 .text MAIN: daddui R1,R0,10 IF ID EX M WB dadd R2,R0,R0 IF ID EX M WB dadd R3,R0,R0 IF ID EX M WB LOOP: ld R4,values(R2) IF ID EX M WB dadd R3,R3,R4 IF ID S EX M WB daddi R2,R2,8 IF toS ID EX M t.im Ⱦ .ae:*:[: daddi R1,R1,-1 8 IF ID EX bnez R1,LOOP IF S RAW

↳ . Stalls sn decade I'll access to M but

end sd R3,result(R0) Is not ready until of halt premieres Execution

Configure ; 10 integers addition Enable forwarding ;------ Enable Delay Slot .data values: .word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ;64-bit integers result: .space 8 .text MAIN: daddui R1,R0,10 IF ID EX M WB dadd R2,R0,R0 IF ID EX M WB dadd R3,R0,R0 IF ID EX M WB LOOP: ld R4,values(R2) IF ID EX M WB dadd R3,R3,R4 IF ID S EX M WB daddi R2,R2,8 IF S ID EX M WB daddi R1,R1,-1 IF ID EX M WB ( IF S ID EX M WB bnez R1,LOOP sd R3,result(R0) IF . LOOP: ld R4,values(R2) Ⱦ IF ID EX M WB dadd R3,R3,R4 IF ID S EX M WB Way +1 anni LOOP Configure ; 10 integers addition Enable forwarding ;------ Enable Delay Slot .data values: .word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ;64-bit integers result: .space 8 .text MAIN: daddui R1,R0,10 IF ID EX M WB 5 dadd R2,R0,R0 IF ID EX M WB 1 dadd R3,R0,R0 IF ID EX M WB 1 LOOP: ld R4,values(R2) IF ID EX M WB 1 dadd R3,R3,R4 IF ID S EX M WB 2 daddi R2,R2,8 IF S ID EX M WB 1 daddi R1,R1,-1 IF ID EX M WB 1 bnez R1,LOOP IF S ID EX M WB 2 sd R3,result(R0) IF LOOP: ld R4,values(R2) IF ID EX M WB dadd R3,R3,R4 IF ID S EX M WB

Configure ; 10 integers addition Enable forwarding ;------ Enable Delay Slot .data values: .word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ;64-bit integers result: .space 8 .text MAIN: daddui R1,R0,10 IF ID EX M WB 5 dadd R2,R0,R0 7 IF ID EX M WB 1 dadd R3,R0,R0 IF ID EX M WB 1 LOOP: ld R4,values(R2) IF ID EX M WB 1 dadd R3,R3,R4 IF ID S EX M WB 2 IF S ID EX M WB 1 daddi R2,R2,8 8X9 daddi R1,R1,-1 IF ID EX M WB 1 bnez R1,LOOP IF S ID EX M WB 2 IF 1 sd R3,result(R0) - LOOP: ld R4,values(R2) IF ID EX M WB 81 dadd R3,R3,R4 IF ID S EX M WB

Configure ; 10 integers addition Enable forwarding ;------ Enable Delay Slot .data values: .word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ;64-bit integers result: .space 8 .text MAIN: daddui R1,R0,10 IF ID EX M WB dadd R2,R0,R0 IF ID EX M WB dadd R3,R0,R0 IF ID EX M WB LOOP: ld R4,values(R2) IF ID EX M WB 1 dadd R3,R3,R4 IF ID S EX M WB 2 IF S ID EX M WB 1 daddi R2,R2,8 8X1 daddi R1,R1,-1 IF ID EX M WB 1 bnez R1,LOOP IF S ID EX M WB 2 sd R3,result(R0) IF ID EX M WB 1 halt 1 IF ID EX M WB 1 0

Configure ; 10 integers addition Enable forwarding ;------ Enable Delay Slot .data values: .word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ;64-bit integers result: .space 8 .text MAIN: daddui R1,R0,10 5 1 7 dadd R2,R0,R0 2.10 dadd R3,R0,R0 1 LOOP: ld R4,values(R2) 1 dadd R3,R3,R4 2 20 RAW 1 daddi R2,R2,8 8X10 daddi R1,R1,-1 1 bnez R1,LOOP 2 sd R3,result(R0) 1 halt 1 1 88 3. using the static scheduling technique, re- ; 10 integers addition schedule the presented code in order to eliminate ;------the most data hazards. .data values: .word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ;64-bit integers result: .space 8 .text MAIN: daddui R1,R0,10 dadd R2,R0,R0 dadd R3,R0,R0 LOOP: ld R4,values(R2) dadd R3,R3,R4 daddi R2,R2,8 daddi R1,R1,-1 bnez R1,LOOP sd R3,result(R0) halt 3. using the static scheduling technique, re- ; 10 integers addition schedule the presented code in order to eliminate ;------the most data hazards. .data values: .word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ;64-bit integers result: .space 8 .text MAIN: daddui R1,R0,10 MAIN: daddui R1,R0,10 5 dadd R2,R0,R0 dadd R2,R0,R0 1 7 dadd R3,R0,R0 dadd R3,R0,R0 1 LOOP: ld R4,values(R2) LOOP: ld R4,values(R2) 1 dadd R3,R3,R4 daddi R1,R1,-1 1 1 daddi R2,R2,8 dadd R3,R3,R4 6X10 daddi R1,R1,-1 daddi R2,R2,8 1 bnez R1,LOOP bnez R1,LOOP 1 sd R3,result(R0) sd R3,result(R0) 1 halt halt 1 1 88 68 Pipelined processors Example 2

E. Sanchez Computer Architectures Example 2

Considering the following pseudo-code: for (i = 0; i < 100; i++) { Y[i] = X[i]2 + X[i] / Z[i] } Suppose that vectors X[i] and Z[i] contain 100 FP numbers, were previously saved in memory, and Z[i] ≠ 0. Assume the MIPS64 architecture presented below: – Integer ALU: 1 clock cycle – Data memory: 1 clock cycle – FP multiplier unit: pipelined 4 stages – FP arithmetic unit: pipelined 2 stages – FP divider unit: not pipelined unit, requiring 4 clock cycles – branch delay slot: 1 clock cycle (disabled) – forwarding is enabled.

Example 2 – [cont] 1. Write an assembly program for the MIPS64 architecture able to perform the previously presented pseudo-code 2. show the timing of the developed loop-based program and compute how many cycles does this program take to execute 3. using all the static optimization techniques, re-write the developed code in order to eliminate the most data hazards 4. show the timing development of the new optimized program and compute how many clock cycles does this program take to execute. 1. Write an assembly program for the MIPS64 architecture able to perform the previously presented pseudo-code

;*********** MIPS64 ************* ;------; Y = x^2 + x /z ; Z ≠ 0. ;------.data vetX: .double 2, 4, 6, 8… vetZ: .double 1, 2, 3, 4… vetY: .double 0, 0, 0, 0… 1) .text CC MAIN: daddui R1,R0,100 daddui R2,R0,vetX daddui R3,R0,vetZ daddui R4,R0,vetY loop: l.d F1,0(R2) l.d F2,0(R3) mul.d F3,F1,F1 div.d F4,F1,F2 add.d F5,F3,F4 s.d F5,0(R4) daddi R2,R2,8 daddi R3,R3,8 daddi R4,R4,8 daddi R1,R1,-1 bnez R1,loop halt 1) .text CC 2) MAIN: daddui R1,R0,100 5 daddui R2,R0,vetX 1 daddui R3,R0,vetZ 1 8 daddui R4,R0,vetY 1 8 loop: l.d F1,0(R2) 1 + l.d F2,0(R3) 1 mul.d F3,F1,F1 4 ______17 x 100 div.d F4,F1,F2 1 add.d F5,F3,F4 2 1708 s.d F5,0(R4) 1 daddi R2,R2,8 1 17 daddi R3,R3,8 1 daddi R4,R4,8 1 daddi R1,R1,-1 1 bnez R1,loop 2 halt 1 3)

.text cc MAIN: daddui R1,R0,100 MAIN: daddui R1,R0,25 daddui R2,R0,vetX daddui R2,R0,vetX daddui R3,R0,vetZ daddui R3,R0,vetZ daddui R4,R0,vetY daddui R4,R0,vetY loop: l.d F1,0(R2) loop: l.d F1,0(R2) l.d F2,0(R3) l.d F2,0(R3) mul.d F3,F1,F1 mul.d F3,F1,F1 div.d F4,F1,F2 div.d F4,F1,F2 add.d F5,F3,F4 daddi R2,R2,8 s.d F5,0(R4) add.d F5,F3,F4 daddi R2,R2,8 s.d F5,0(R4) daddi R3,R3,8 daddi R3,R3,8 daddi R4,R4,8 daddi R4,R4,8 daddi R1,R1,-1 bnez R1,loop halt 3)

CC

MAIN: daddui R1,R0,25 5 F D E M W

daddui R2,R0,vetX 1 F D E M W

daddui R3,R0,vetZ 1 F D E M W

daddui R4,R0,vetY 1 F D E M W loop: l.d F1,0(R2)

l.d F2,0(R3)

mul.d F3,F1,F1

div.d F4,F1,F2

daddi R2,R2,8

add.d F5,F3,F4

s.d F5,0(R4)

daddi R3,R3,8

daddi R4,R4,8

3)

CC

/ MAIN: daddui R1,R0,25 5 F D E M W

✓ daddui R2,R0,vetX 1 F D E M W

✓ daddui R3,R0,vetZ 1 F D E M W

✓ daddui R4,R0,vetY 1 F D E M W

✓ loop: l.d F1,0(R2) 1 F D E M W

✓ l.d F2,0(R3) 1 F D E M W

✓ mul.d F3,F1,F1 4 F D E E E E M W

div.d F4,F1,F2 1 F D E E E E M W

daddi R2,R2,8 0 F D E M W festooned

° add.d F5,F3,F4 2 F D S S E E M W

s.d F5,0(R4) 1 F S S D E S M W

daddi R3,R3,8 1 F D S E M W daddi R4,R4,8 1 ottusità i F S D E M W 12 3+)

CC l.d F1,0(R2) 1 4) MAIN: daddui R1,R0,25 5 l.d F2,0(R3) 1

daddui R2,R0,vetX 1 mul.d F3,F1,F1 4 daddui R3,R0,vetZ 1 8 div.d F4,F1,F2 1 daddui R4,R0,vetY 1 daddi R2,R2,8 0 12 loop: l.d F1,0(R2) 1 8 add.d F5,F3,F4 2 l.d F2,0(R3) 1 s.d F5,0(R4) 1 mul.d F3,F1,F1 4 daddi R3,R3,8 1 + div.d F4,F1,F2 1 daddi R4,R4,8 1 daddi R2,R2,8 0 12 51 x 25 add.d F5,F3,F4 2 ______s.d F5,0(R4) 1 l.d F1,0(R2) 1 daddi R3,R3,8 1 l.d F2,0(R3) 1 daddi R4,R4,8 1 mul.d F3,F1,F1 4 1283 div.d F4,F1,F2 1

l.d F1,0(R2) 1 daddi R2,R2,8 0

l.d F2,0(R3) 1 add.d F5,F3,F4 2 mul.d F3,F1,F1 4 15 s.d F5,0(R4) 1 div.d F4,F1,F2 1 daddi R3,R3,8 1 daddi R2,R2,8 0 12 daddi R1,R1,-1 1 add.d F5,F3,F4 2 daddi R4,R4,8 1 s.d F5,0(R4) 1 bnez R1,loop 1 daddi R3,R3,8 1 halt 1 daddi R4,R4,8 1 4+) branch delay slot enabled

cc cc MAIN: daddui R1,R0,25 MAIN: daddui R1,R0,25 daddui R2,R0,vetX daddui R2,R0,vetX daddui R3,R0,vetZ daddui R3,R0,vetZ daddui R4,R0,vetY daddui R4,R0,vetY loop: l.d F1,0(R2) loop: l.d F1,0(R2) l.d F2,0(R3) l.d F2,0(R3) mul.d F3,F1,F1 mul.d F3,F1,F1 div.d F4,F1,F2 div.d F4,F1,F2 daddi R2,R2,8 add.d F5,F3,F4 add.d F5,F3,F4 s.d F5,0(R4) s.d F5,0(R4) l.d F1,8(R2) daddi R3,R3,8 l.d F2,8(R3) daddi R4,R4,8 mul.d F3,F1,F1 div.d F4,F1,F2 add.d F5,F3,F4 s.d F5,8(R4) 4+) CC l.d F1,16(R2) 1 4+) MAIN: daddui R1,R0,25 5 l.d F2,16(R3) 1 daddui R2,R0,vetX 1 8 8 mul.d F3,F1,F1 4 10 daddui R3,R0,vetZ 1 + div.d F4,F1,F2 1 daddui R4,R0,vetY 1 44 x 25 add.d F5,F3,F4 2 loop: l.d F1,0(R2) 1 + s.d F5,16(R4) 1 l.d F2,0(R3) 1 1 mul.d F3,F1,F1 4 10 l.d F1,24(R2) 1 ______div.d F4,F1,F2 1 l.d F2,24(R3) 1 1109 add.d F5,F3,F4 2 mul.d F3,F1,F1 4

s.d F5,0(R4) 1 div.d F4,F1,F2 1 l.d F1,8(R2) 1 daddi R2,R2,32 0

l.d F2,8(R3) 1 add.d F5,F3,F4 2 14

mul.d F3,F1,F1 4 s.d F5,24(R4) 1 10 div.d F4,F1,F2 1 daddi R1,R1,-1 1 add.d F5,F3,F4 2 daddi R3,R3,32 1 s.d F5,8(R4) 1 bnez R1,loop 1 daddi R4,R4,32 1 Halt 1 1

Counter 1 2- bit Saturating

2 Correlation Predictor

:BPU Examples

ctian BPU Ⱦ Branche beati bit

- E. Sanchez

Politecnico di Torino Dipartimento di Automatica e Informatica BHT Example ↳ bancheEstereIlla

Considering a 2-bit saturating counter BHT of 1K entries; and assuming that the processor executes the following code fragment, determine the BHT final state and calculate the misprediction ratio for the presented case. for(i=1;i<=100;i++){ read_data(); if (aa==2) aa = 0; if (bb==2) bb = 0; if (aa == bb) {…} }

2 BHT Example

Considering a 2-bit saturating counter BHT of 1K entries; and assuming that the processor executes the following code fragment, determine the BHT final state and calculate

ad = the misprediction ratio for the presented case. RO un Rt- RZ =p L0: … i = Ri for(i=1;i<=100;i++){ DADDUI R3, R1, #-2 ; R3 = Rt -2 [BNEZ R3, L1 ; Branch tolti RI

read_data(); - Rt - DADD R1, R0, R0 | 0 if (aa==2) L1: DADDUI R3, R2, #-2 3 - RZ . 2

. aa = 0; [ Bramato . LZ if RHO - BNEZ R3, L2 , if (bb==2) . RZ = DADD R2, R0, R0 | 0

3=0 - RZ bb = 0; L2: DSUB R3, R1, R2 ) " - BNEZ R3, L3 ; Bowhtobif RTTRZ if (aa == bb) [ {…} … L3: … } :* DADDI R4, R4, #-1 ±::* . . ↳ .BNEZ R4, L0 … 3 BHT Example

Considering a 2-bit saturating counter BHT of 1K entries; and assuming that the processor executes the following code fragment, determine the BHT final state and calculate the misprediction ratio for the presented case. Case 1: -Program inputs for aa and bb are always different than 2.

4 ALL BRANCH TAKEN !

ST ITERATION NOT TAKEN [email protected] aa = 4 Iteration 1L bb = 3 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT 0x0014 BNEZ R3, L1 0 NT 0x0018 DADD R1, R0, R0 0 NT

0x001C L1: DADDUI R3, R2, #-2 0 NT 0x0020 BNEZ R3, L2 0 NT 0x0024 DADD R2, R0, R0 0 NT 0x0028 L2: DSUB R3, R1, R2 0 NT 0x002C BNEZ R3, L3 0 NT 0x0030 … 0 NT 0x0034 L3: … 0 NT 0x0038 DADDI R4, R4, #-1 0 NT 0x003C BNEZ R4, L0 0 NT 0x0040 … 0 NT 5 aa = 4 Iteration 1 bb = 3 address Instruction BHT Prediction misP. counter 0x0000 L0: … stanata0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT 0x0014 BNEZ R3, L1 7- #ÈÈ\-1 NTȾ 1 0x0018 DADD R1, R0, R0 0 NT

0x001C L1: DADDUI R3, R2, #-2 0 NT 0x0020 BNEZ R3, L2 0 NT 0x0024 DADD R2, R0, R0 0 NT 0x0028 L2: DSUB R3, R1, R2 0 NT 0x002C BNEZ R3, L3 0 NT 0x0030 … 0 NT 0x0034 L3: … 0 NT 0x0038 DADDI R4, R4, #-1 0 NT 0x003C BNEZ R4, L0 0 NT 0x0040 … 0 NT 6 aa = 4 Iteration 1 bb = 3 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT 0x0014 BNEZ R3, L1 1 NT 1 0x0018 DADD R1, R0, R0 0 NT

0x001C L1: DADDUI R3, R2, #-2 0 NT 0x0020 BNEZ R3, L2 TAKENIT.gr#o1 NT 1 0x0024 DADD R2, R0, R0 0 NT 0x0028 L2: DSUB R3, R1, R2 0 NT 0x002C BNEZ R3, L3 0 NT 0x0030 … 0 NT 0x0034 L3: … 0 NT 0x0038 DADDI R4, R4, #-1 0 NT 0x003C BNEZ R4, L0 0 NT 0x0040 … 0 NT 7 aa = 4 Iteration 1 bb = 3 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT 0x0014 BNEZ R3, L1 1 NT 1 0x0018 DADD R1, R0, R0 0 NT

0x001C L1: DADDUI R3, R2, #-2 0 NT 0x0020 BNEZ R3, L2 1 NT 1 0x0024 DADD R2, R0, R0 0 NT 0x0028 L2: DSUB R3, R1, R2 0 NT 0x002C BNEZ R3, L3 TAKENIT.gr#o1 NT 1 0x0030 … 0 NT 0x0034 L3: … 0 NT 0x0038 DADDI R4, R4, #-1 0 NT 0x003C BNEZ R4, L0 1 NT 1 0x0040 … 0 NT 8 aa = 4 Iteration 2 bb = 3 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT 0x0014 BNEZ R3, L1 2 NT 2 0x0018 DADD R1, R0, R0 0 NT

0x001C L1: DADDUI R3, R2, #-2 0 NT 0x0020 BNEZ R3, L2 2 NT 2 0x0024 DADD R2, R0, R0 0 NT TAKENIT.gr#tAkenI$n$tAkenI$IgtAkenI$n$ 0x0028 L2: DSUB R3, R1, R2 0 NT 0x002C BNEZ R3, L3 2 NT 2 0x0030 … 0 NT 0x0034 L3: … 0 NT 0x0038 DADDI R4, R4, #-1 0 NT 0x003C BNEZ R4, L0 2 NT 2 0x0040 … 0 NT 9 aa = 4 Iteration 3 bb = 3 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT su 0x0014 BNEZ R3, L1 tue 3 T 2 0x0018 DADD R1, R0, R0 0 NT

some 0x001C L1: DADDUI R3, R2, #-2 0 NT 0x0020 BNEZ R3, L2 tue 3 T 2 0x0024 DADD R2, R0, R0 0 NT 0x0028 L2: DSUB R3, R1, R2 0 NT some 0x002C BNEZ R3, L3 takeFÈ\3 ȾT ȾÉ2 0x0030 … FÈ\0 NT É FÈ#ÈFÈ\ j 0x0034 L3: … 0 NT 0x0038 DADDI R4, R4, #-1 0 NT some 0x003C BNEZ R4, L0 take 3 T 2 0x0040 … 0 NT 10 aa = 4 Iteration 4 bb = 3 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT some 0x0014 BNEZ R3, L1 3 T 2 0x0018 DADD R1, R0, R0 0 NT

some 0x001C L1: DADDUI R3, R2, #-2 0 NT BNEZ R3, L2 TAKE 3 TȾ 2 0x0020 TAKEFd.joÌÒy Ó 0x0024 DADD R2, R0, R0 takeoff0 NT 0x0028 L2: DSUB R3, R1, R2 0 NT some 0x002C BNEZ R3, L3 3 T 2 0x0030 … 0 NT 0x0034 L3: … 0 NT 0x0038 DADDI R4, R4, #-1 0 NT some 0x003C BNEZ R4, L0 TAKE 3ÌÒy ȾT Ò 2 0x0040 … 0 NT 11 aa = 4 Iteration 100 bb = 3 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT 0x0014 BNEZ R3, L1 3 T 2 0x0018 DADD R1, R0, R0 0 NT

some 0x001C L1: DADDUI R3, R2, #-2 0 NT 0x0020 BNEZ R3, L2 TAKETAKEFd~gts.me3ÌÒy TȾ Ò 2 0x0024 DADD R2, R0, R0 0 NT .IO#jesome 0x0028 L2: DSUB R3, R1, R2 0 NT 0x002C BNEZ R3, L3 taken 3 T 2 0x0030 … 0 NT 0x0034 L3: … 0 NT 0x0038 DADDI R4, R4, #-1 0 NT

0x003C BNEZ R4, L0 not 2 T 3

= TAKEAÌÌGIdi 9 0x0040 … 0 NT mispredictad 12 BHT Example

Considering a 2-bit saturating counter BHT of 1K entries; and assuming that the processor executes the following code fragment, determine the BHT final state and calculate the misprediction ratio for the presented case. Case 1: -Program inputs for aa and bb are always different than 2.

Misprediton ratio = Number of mispredicted branches / Total branches

Misprediton ratio = 9 / 400 Misprediton ratio = 2.25%

13

BHT Example - Case 2

Considering a 2-bit saturating counter BHT of 1K entries; and assuming that the processor executes the following code fragment, determine the BHT final state and calculate the misprediction ratio for the presented case. Case 2: -Program inputs for aa and bb alternate values different than 2 and equal to 2.

14 aa = 4 Case 2 - Iteration 1 bb = 3 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT 0x0014 BNEZ R3, L1 0 NT 0x0018 DADD R1, R0, R0 0 NT

0x001C L1: DADDUI R3, R2, #-2 0 NT 0x0020 BNEZ R3, L2 0 NT 0x0024 DADD R2, R0, R0 0 NT 0x0028 L2: DSUB R3, R1, R2 0 NT 0x002C BNEZ R3, L3 0 NT 0x0030 … 0 NT 0x0034 L3: … 0 NT 0x0038 DADDI R4, R4, #-1 0 NT 0x003C BNEZ R4, L0 0 NT 0x0040 … 0 NT 15 IÌTI » È[ } Autun aa = 4 Case 2 - Iteration 1 bb = 3 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT 0x0014 BNEZ R3, L1 1 NT 1 0x0018 DADD R1, R0, R0 0 NT

0x001C L1: DADDUI R3, R2, #-2 0 NT 0x0020 BNEZ R3, L2 1 NT 1 DADD R2, R0, R0 tono0 NT 0x0024 0x0028 L2: DSUB R3, R1, R2 0 NT 0x002C BNEZ R3, L3 1 NT 1 0x0030 … 0 NT 0x0034 L3: … 0 NT 0x0038 DADDI R4, R4, #-1 0 NT 0x003C BNEZ R4, L0 ÷1 NT 1 0x0040 … 0 NT 16 QQ = bb = 2

All NOT TAKEN LAST ITERATION LAST is TAKEN EXEPT aa = 2 Case 2 - Iteration 2 bb = 2 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT

Ⱦ 0x0014 BNEZ R3, L1 0 NT 1 0x0018 DADD R1, R0, R0 0 NT

0x001C L1: DADDUI R3, R2, #-2 0 NT 0x0020 BNEZ R3, L2 0 NT 1 0x0024 DADD R2, R0, R0 0 NT 0x0028 L2: DSUB R3, R1, R2 : 0 NT 0x002C BNEZ R3, L3 0 NT 1 0x0030 … 0 NT 0x0034 L3: … -00 NT 0x0038 DADDI R4, R4, #-1 0 NT 0x003C BNEZ R4, L0 2 :NT 2 0x0040 … 0 NT 17 È[ } Autun aa = 4 Case 2 - Iteration 3 bb = 3 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT 0x0014 BNEZ R3, L1 1 NT 2 0x0018 DADD R1, R0, R0 0 NT

0x001C L1: DADDUI R3, R2, #-2 0 NT 0x0020 BNEZ R3, L2 1 NT 2 0x0024 DADD R2, R0, R0 0 NT 0x0028 L2: DSUB R3, R1, R2 0 NT 0x002C BNEZ R3, L3 1 NT 2 0x0030 … 0 NT 0x0034 L3: … é÷0 NT 0x0038 DADDI R4, R4, #-1 0 NT 0x003C BNEZ R4, L0 " 3÷ :T 2

0x0040 … 0 NT 18 QQ = bb = 2

All NOT TAKEN LAST ITERATION LAST is TAKEN EXEPT aa = 2 Case 2 - Iteration 4 bb = 2 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT 0x0010 DADDUI R3, R1, #-2 0 NT 0x0014 BNEZ R3, L1 0 NT 2 0x0018 DADD R1, R0, R0 0 NT

0x001C L1: DADDUI R3, R2, #-2 0 NT 0x0020 BNEZ R3, L2 0 NT 2 0x0024 DADD R2, R0, R0 0 NT 0x0028 L2: DSUB R3, R1, R2 0 NT 0x002C BNEZ R3, L3 : 0 NT 2 … : 0 NT 0x0030 0x0034 L3: … 0 NT 0x0038 DADDI R4, R4, #-1 0 NT 0x003C BNEZ R4, L0 3 T 2 0x0040 … 0 NT 19 QQ = bb = 2

All NOT TAKEN

is TAKEN EXEPT LAST aa = 2 Case 2 - Iteration 100 bb = 2 address Instruction BHT Prediction misP. counter 0x0000 L0: … 0 NT … … 0 NT Half ore wrong 0x0010 DADDUI R3, R1, #-2 0 NT E.IE 0x0014 BNEZ R3, L1 0 NT 50 0x0018 DADD R1, R0, R0 0 NT

0x001C L1: DADDUI R3, R2, #-2 0 NT BNEZ R3, L2 : 0 NT 50 0x0020 0x0024 LASTHEH.tt#DADD R2, R0, R0 0 NT 0x0028 L2: DSUB R3, R1, R2 0 NT O 0x002C BNEZ R3, L3 ✓ 0 NT 50 0x0030 … 0 NT 0x0034 L3: … 0 NT 0x0038 DADDI R4, R4, #-1 0 NT

- 0x003C BNEZ R4, L0 is notte X 2 T 03

= 153

: Tot . mispredicteal 0x0040 … 0 NT 20 BHT Example - Case 2

Considering a 2-bit saturating counter BHT of 1K entries; and assuming that the processor executes the following code fragment, determine the BHT final state and calculate the misprediction ratio for the presented case. Case 2: -Program inputs for aa and bb alternate values different than 2 and equal to 2.

Misprediton ratio = Number of mispredicted branches / Total branches

Misprediton ratio = 153 / 400 Misprediton ratio = 38.25%

21 (2,2) Example

Considering a (2,2) correlating predictor of 1K entries; and assuming that the processor executes the following code fragment, determine the (2,2) final state and calculate the ad = misprediction ratio for the presented case. RO un Rt- RZ =p L0: … i = Ri for(i=1;i<=100;i++){ DADDUI R3, R1, #-2 ; R3 = Rt - 2 [BNEZ R3, L1 ; Branch tolti RI

read_data(); - Rt - DADD R1, R0, R0 | 0 if (aa==2) L1: DADDUI R3, R2, #-2 3 - RZ . 2

. aa = 0; [ Bramato . LZ if RHO - BNEZ R3, L2 , . RZ = if (bb==2) DADD R2, R0, R0 | 0

3 - Rt - RZ bb = 0; L2: DSUB R3, R1, R2 ) " - BNEZ R3, L3 ; Beowhtobif RTHRZ if (aa == bb) [ {…} … L3: … } DADDI R4, R4, #-1 :*±: . . :* ↳ .BNEZ R4, L0 … 22 (2,2) Example

Considering a (2,2) correlating predictor of 1K entries; and assuming that the processor executes the following code fragment, determine the (2,2) final state and calculate the misprediction ratio for the presented case. Case 1: -Program inputs for aa and bb alternate values different than 2 and equal to 2.

fftzmaabtat LAST Branch is even TAKEN but d-

NOT TAKEN LAST iteration is

iw://ite.my÷

EVEN Iteration Ⱦ Iteration

All BRANCH ALL BRANCH TAKEN NOT TAKEN 23 Ⱦ Branche Taken Ⱦ 00 01,10 11 Countee tt If is 22 = 4 values , , a # -

/ ahitattnsist . .tt?=JaIa7Ia==intuì÷:[ - Automa (2,2) - Iteration 1 {¥[; address Instruction 00 01 10 11 2-b sft reg misP. counter 0x0000 L0: … 0 0 0 0 00 … … 0 0 0 0 0x0010 DADDUI R3, R1, #-2 0 0 0 0

0x0014 BNEZ R3, L1 Ⱦ Rxndìaken 0 0 ÷0 0

0x0018 DADD R1, R0, R0 0 0 0 0

0x001C L1: DADDUI R3, R2, #-2 0 0 0 0 0x0020 BNEZ R3, L2 0 0 0 0 0x0024 DADD R2, R0, R0 0 0 0 0 0x0028 L2: DSUB R3, R1, R2 0 0 0 0 0x002C BNEZ R3, L3 0 0 0 0 0x0030 … 0 0 0 0 0x0034 L3: … 0 0 0 0 0x0038 DADDI R4, R4, #-1 0 0 0 0 0x003C BNEZ R4, L0 0 0 0 0 0x0040 … 0 0 0 0 24 ⇐÷E±÷ Otnir ... All Token aa = 3 (2,2) - Iteration 1 bb = 4 address Instruction 00 01 10 11 2-b sft reg misP. counter 0x0000 L0: … 0 0 0 0 00 … … 0 0 0 0 0x0010 DADDUI R3, R1, #-2 0 0 0 0

- -1 0x0014 BNEZ R3, L1 1 0 ÷0 0 01 III. ... 1 0x0018 DADD R1, R0, R0 0 0 0 0

0x001C L1: DADDUI R3, R2, #-2 0 0 0 0 TAKEN BNEZ R3, L2 0 0 0 ÷0 0x0020 0x0024 TAKENÉtienneDADD R2, R0, R0 0 0 0 0 0x0028 L2: DSUB R3, R1, R2 0 0 0 0

0x002C TAKEN BNEZ R3, L3 0 0 0 0 0x0030 … 0 0 0 0 0x0034 L3: … 0 0 0 0 0x0038 DADDI R4, R4,di #-1 0 0 0 0 0x003C TAKEN BNEZ R4, L0 0 0 0 0 0x0040 … 0 0 0 0 25 All Token aa = 3 (2,2) - Iteration 1 bb = 4 address Instruction 00 01 10 11 2-b sft reg misP. counter 0x0000 L0: … 0 0 0 0 00 … … 0 0 0 0 0x0010 DADDUI R3, R1, #-2 0 0 0 0 0x0014 BNEZ R3, L1 1 0 0 0 010 1 0x0018 DADD R1, R0, R0 0 0 0 0

0x001C L1: DADDUI R3, R2, #-2 0 0 0 0 0x0020 taken BNEZ R3, L2 0 1 0 0 11 ÷ :-# 1 0x0024 DADD R2, R0, R0 0 0 0 0 0x0028 L2: DSUB R3, R1, R2 0 0 0 0 0x002C TAKEN BNEZ R3, L3 0 0 0 0§ 0x0030 … 0 0 0 0 0x0034 L3: … 0 0 0 0 0x0038 DADDI¥-0 R4, R4, #-1 0 0 0 0 0x003C TAKEN BNEZ R4, L0 0 0 0 0 0x0040 … 0 0 0 0 26 All Token aa = 3 (2,2) - Iteration 1 bb = 4 address Instruction 00 01 10 11 2-b sft reg misP. counter 0x0000 L0: … 0 0 0 0 00 … … 0 0 0 0 0x0010 DADDUI R3, R1, #-2 0 0 0 0 0x0014 BNEZ R3, L1 1 0 0 0 01 1 0x0018 DADD R1, R0, R0 0 0 0 0

0x001C L1: DADDUI R3, R2, #-2 0 0 0 0 0x0020 BNEZ R3, L2 0 1 0 0 011 1 0x0024 DADD R2, R0, R0 0 0 0 0 0x0028 L2: DSUB R3, R1, R2 0 0 0 0

0x002C TAKEN BNEZ R3, L3 0 0 0 1 11 III. .no# 1 0x0030 … 0 0 0 0 0x0034 L3: … 0 0 0 0 0x0038 DADDI R4,ȼ R4, #-1 0 0 0 0 0x003C taken BNEZ R4, L0 #%0 0 %0 1 11 %:[ ne 1 0x0040 … 0 0 0 0 27 All NOT TAKEN EXEPT THE LAST ( aa = 2 (2,2) - Iteration 2 bb = 2 address Instruction 00 01 10 11 2-b sft reg misP. counter 0x0000 L0: … 0 0 0 0 11 … … 0 0 0 0 0x0010 DADDUI R3, R1, #-2 0 0 0 0 0x0014 ÷: BNEZ R3, L1 1 0 0 0 10 1 ÷: . 0x0018 DADD R1, R0, R0 0 0 0 0

0x001C L1: DADDUI R3, R2, #-2 0 0 0 0 BNEZ R3, L2 0 1 0 0 00 . 1 ¥ 0x0020 ÷ . Doesvùt stage 0x0024 DADD R2, R0, R0 0 - 0 0 0 0x0028 L2: DSUB R3, R1, R2 0 0 0 0 0x002C :-.¥ BNEZ R3, L3 0 0 0 1 00 . 1 :±:-. 0x0030taken … 0 0 0 0 0x0034 L3: … 0 0 0 0 0x0038 DADDI R4, R4, #-1 0 0 0 0 0x003C taken BNEZ÷% R4, L0 Qt÷161 0% 0 1 01 2 ¥ ȾȾ ÌIÌ Nott… 0 0 0 0 0x0040 ÷il % 28 All Token aa = 3 (2,2) - Iteration 3 bb = 4 address Instruction 00 01 10 11 2-b sft reg misP. counter 0x0000 L0: … 0 0 0 0 01O … … 0 0 0 0 0x0010 DADDUI R3, R1, #-2 0 0 0 0

" 0x0014 taken BNEZ R3, L1 1 01 0 0 11 2 ÌIÌ¥ 0x0018 DADD R1, R0, R0 0 0 0 0

0x001C L1: DADDUI R3, R2, #-2 0 0 0 0

' 0x0020 BNEZ R3, L2 0 1 0 1 11 2 t.IE 0x0024 DADD R2, R0, R0 0 0 0 0 0x0028 L2: DSUB R3, R1, R2 0 0 0 0

" ȼ ¥ 0x002C TAKEN BNEZ R3, L3 0 0 0 O2 ¢@11 2 ÌIGI 0x0030 … 0 0 0 0 0x0034 L3: … 0 0 0 0 0x0038 DADDI R4, R4, #-1 0 0 0 0 . 0x003C « BNEZ R4, L0 =1 0 0 2 11 3 :c÷ ¥ ÷ 0x0040 …tarì EÌÌ.0 0 0 0 . 29 All NOT TAKEN EXEPT THE LAST ( TAKE aa = 2 (2,2) - Iteration 4 bb = 2 address Instruction 00 01 10 11 2-b sft reg misP. counter 0x0000 L0: … 0 0 0 0 11 … … 0 0 0 0 0x0010 DADDUI R3, R1, #-2 0 0 0 0 L - 0x0014 BNEZ R3, L1 1 1 0 0 10 2 0x0018 DADD R1, R0, R0 0 0 0 0

0x001C L1: DADDUI R3, R2, #-2 0 0 0 0 0 Cayect 0x0020 BNEZ R3, L2 0 1 O0 1 00 2 Doeiùtsbo 0x0024 [email protected] R2, R0, R0 0 0 0 0 0x0028 L2: DSUB R3, R1, R2 0 0 0- 0 0 Cayect NOTTAKENȾ 0x002C BNEZ R3, L3% @0 0 0 2 00 2 Krishna 0x0030 Ⱦ… 0 0 0 0 0x0034 L3: … 0 0 0 0 0x0038 DADDI R4, R4, #-1 0 0 0- 0 " 0x003C TAKEN BNEZGf @R4,@ L0¢ :2 0 0 2 01 4 ÌIÈ¥ 0x0040 … 0 0 0 0 30 Token All aa = 3 (2,2) - Iteration 5 bb = 4 address Instruction 00 01 10 11 2-b sft reg misP. counter 0x0000 L0: … 0 0 0 0 010 … … 0 0 0 0 0x0010 DADDUI R3, R1, #-2 0 0 0 0 0x0014 BNEZ R3, L1 1 2 0 0 11 3 ÷¥= 0x0018 DADD R1, R0, R0 0 0 0 0

0x001C L1: DADDUI R3, R2, #-2 0 0 0 0

0x0020 BNEZ R3, L2 0 1 0 2 11 3 :#÷ . 0x0024 DADD R2, R0, R0 0 0 0 0 0x0028 L2: DSUB R3, R1, R2 0 0 0 0 0x002C BNEZ R3, L3 ok 0 -0 0 3 @11 2 correct 0x0030 … 0 0 ¢0 0 0x0034 L3: … 0 0 0 0 0x0038 DADDI R4, R4, #-1 0 0 0 0 ' ~ ÷§BNEZ÷ R4, L0 2 0 !0 3 11 4 correct 0x003C 0x0040 … 0 0 0 0 31 All NOT TAKEN EXEPT THE LAST ( TAKE aa = 2 (2,2) - Iteration 6 bb = 2 address Instruction 00 01 10 11 2-b sft reg misP. counter 0x0000 L0: … 0 0 0 0 11 … … 0 0 0 0 0x0010 DADDUI R3, R1, #-2 0 0 0 0 0x0014 BNEZ R3, L1 1 2 0 0 10 3 :-.÷ . 0x0018 DADD R1, R0, R0 0 0 0 0

0x001C L1: DADDUI R3, R2, #-2 0 0 0 0 0x0020 BNEZ R3, L2 0 1 0 2 00 " 3 ÷: . 0x0024 DADD R2, R0, R0 0 0 0 0 0x0028 L2: DSUB R3, R1, R2 0 0 0 0 carpa 0x002C BNEZ R3, L3 0 0 0 3 00 2 Doesvùtshge 0x0030 … -000 0 0 0 0x0034 L3: … ¥0Ⱦ0 0 0 0 0x0038 DADDI R4, R4, #-1 0÷ 0 -0 0 :O. 0x003C BNEZ R4, L0 3 0 :0 3 01 4 ! ÷±÷ :[Im 0x0040 … 0 0 0 0 32 All Token aa = 3 (2,2) - Iteration 7 bb = 4 address Instruction 00 01 10 11 2-b sft reg misP. counter 0x0000 L0: … 0 0 0 0 001 … … 0 0 0 0 0x0010 DADDUI R3, R1, #-2 0 0 0 0 0x0014 BNEZ R3, L1 1 3 0 0 11 . 3 :*: . 0x0018 DADD R1, R0, R0 0 0 0 0

0x001C L1: DADDUI R3, R2, #-2 0 0 0 0 0x0020 BNEZ R3, L2 0 1 0 3 11 3 :*: . 0x0024 DADD R2, R0, R0 0 0 0 0 0x0028 L2: DSUB R3, R1, R2 0 0 0 0 BNEZ R3, L3 0 0 0 3 11 2 0x002C :<: . 0x0030 … 0 0 0 0 0x0034 L3: … ¥-00 0 0 0 0x0038 DADDI R4, R4, #-1 0 0 0 0

=/ . ÷ ȼ 0x003C BNEZ R4, L0 Òó3 %0 0 3 11 4 ...÷ . 0x0040 … 0 0 0 0 33 All NOT TAKEN EXEPT THE LAST ( TAKEN ) aa = 2 (2,2) - Iteration 8 bb = 2 address Instruction 00 01 10 11 2-b sft reg misP. counter 0x0000 L0: … 0 0 0 0 O11 … … 0 0 0 0 0x0010 DADDUI R3, R1, #-2 0 0 0 0 0x0014 BNEZ R3, L1 1 3 0 0 10 o 3 Cayect ÷ Doeiùt stage 0x0018 DADD R1, R0, R0 0 0 0 0

0x001C L1: DADDUI R3, R2, #-2 0 0 0 0 0x0020 BNEZ R3, L2 0 1 0 3 00 : 3 ÷: . 0x0024 DADD R2, R0, R0 0 0 0 0 0x0028 L2: DSUB R3, R1, R2 0 0 0 0

o BNEZ R3, L3 =0 0 0 3 00 - 2 cara 0x002C Doesvùt stage … ¥400 0 0 0 0x0030 0x0034 L3: … 0 0 0 0 0x0038 DADDI R4, R4, IO#-1 0 0 -0 0 0x003C BNEZ R4, L0 3÷ %0 0 3 01 ⇒ 4 Iaea:[ 0x0040 … 0 0 0 0 34 All NOT TAKE aa = 2 (2,2) - Iteration 100 bb = 2 address Instruction 00 01 10 11 2-b sft reg misP. counter 0x0000 L0: … 0 0 0 0 11 … … 0 0 0 0 0x0010 DADDUI R3, R1, #-2 0 0 0 0

0x0014 BNEZ R3, L1 1 - 3 0 0 10 3 :*: . 0x0018 DADD R1, R0, R0 0 0 0 0

0x001C L1: DADDUI R3, R2, #-2 0 0 0 0 . 0x0020 BNEZ R3, L2 0 1 0 3 00 3 : 0x0024 .ie#/0.::::t:::n_:=*OIDADD R2, R0, R0 0 0 0 0 0x0028 L2: DSUB R3, R1, R2 0 0 0 0 0x002C BNEZ R3, L3 0 0 0 3 00 2 0x0030 … ȼ0 0 0 0 0x0034 L3: … 0 0 0 0 0x0038 DADDI R4, R4, #-1 0 0 0 0

. 0x003C BNEZ R4, L0 =/2 :0 0 3 00 5 :#÷ .

0x0040 … 0 0 0 0 Total mispredicted -13 35 (2,2) Example

Considering a (2,2) correlating predictor of 1K entries; and assuming that the processor executes the following code fragment, determine the (2,2) final state and calculate the misprediction ratio for the presented case. Case 1: -Program inputs for aa and bb alternate values different than 2 and equal to 2.

Misprediction ratio = Number of mispredicted branches / Total branches

Misprediton ratio = 13 / 400 Misprediton ratio = 3.25%

36

Questionzaffa 2 Considering the initial loop-based program, and assuming the following processor architecture for a superscalar MIPS64 processor implemented with multiple-issue and speculation: issue 2 instructions per clock cycle jump instructions require 1 issue handle 2 instructions commit per clock cycle timing facts for the following separate functional units: i. 1 Memory address 1 clock cycle ii. 1 Integer ALU 1 clock cycle iii. 1 Jump unit 1 clock cycle iv. 1 FP multiplier unit, which is pipelined: 8 stages v. 1 FP divider unit, which is not pipelined: 8 clock cycles vi. 1 FP Arithmetic unit, pipelined: 2 stages Branch prediction is always correct There are no cache misses There are 2 CDB (Common Data ).

Complete the table reported below showing the processor behavior for the 2 initial iterations of the loop.

# iteration Issue EXE MEM CDB x2 COMMIT x2 1 l.d f1,v1(r1) 1 l.d f2,v2(r1) 1 mul.d f5,f1,f2 1 s.d f5,v5(r1)

1 l.d f3,v3(r1) m 1 l.d f4,v4(r1) 1 div.d f6,f3,f4 1 s.d f6,v6(r1) 1 daddui r1,r1,8 1 daddi r2,r2,-1 1 bnez r2,loop 2 l.d f1,v1(r1) 2 l.d f2,v2(r1) 2 mul.d f5,f1,f2 2 s.d f5,v5(r1) 2 l.d f3,v3(r1) 2 l.d f4,v4(r1) 2 div.d f6,f3,f4 2 s.d f6,v6(r1) 2 daddui r1,r1,8 2 daddi r2,r2,-1 2 bnez r2,loop

Notes taken by Francesco Longo Slides provided by teachers' courses For more notes visit: www.france193.wordpress.com