<<

Outline

! Internal Structure of the CPU ! TDTS10

Erik Larsson Department of Computer Science Linköping University Sweden

2

arrows->buses CPU data and control CPU

Input Main Output CPU device memory device

Secondary memory

3 4 Program execution Registers 0000101110001011 = MOVE Y,R3

! Program (PC): holds the address of the instruction to be fetched. ! Instruction Register (IR): holds the last instruction fetched. (5) Store the data in register ! (MAR): holds the address of a (3) Decode instruction; number 3 00001 – MOVE memory location that is to be read or written. 01110001 - Address (1) Get the 011 – Reg3 ! (MBR): holds the data to be written to instruction at memory or the data most recently read. 00001000 ! Program Status Word (PSW): Condition Code Flags + other bits (2) Move the instruction defining the status of the CPU (interrupt enabled/disabled, 0000101110001011 supervisor, etc.) to CPU (4) Get the data at 01110001

5 6

Registers Registers

! Some architectures provide a set of registers which can be ! The registers in the CPU - top level of the used without restrictions as operands for any and as ! User visible registers: can be accessed by assembly language address registers; these are so called general-purpose programmers. registers. ! Control and Status registers: used by the to control the ! Often the architecture creates a separation between: operation of the CPU; not directly accessible by the programmer. ! data registers: can be used to hold only data. Some architectures ! Many general purpose registers -> large number of bits for encoding impose restrictions to the use of data registers: for example there register operands; specialization of registers reduces this need. can be disjoint sets of registers for integer and for floating point computation. ! Too small number of registers creates problems to the programmer and leads to an increased memory traffic. ! address registers: registers used only for address representation and computation: base registers, index registers, stack pointer, etc. ! The number of general-purpose or data registers is often between 8 In some architectures address registers can be specialized for - 32. some of the previous functions. ! RISC processors often have a very large number of registers (~ 100)

7 8 Some Examples of Register Organizations

! Z8000: 16 General purpose registers; no restrictions in use FI EI FI EI FI EI ! Intel 80X86, Pentium: FI ! 4 Data registers Instr. 1 Instr. 2 Instr. 3 ! 4 Index&address registers ! 4 Base (segment) registers EI

! Some of the Address registers can also be used for general Each instruction takes T time; purpose total time for 3 instructions is 3*T ! PowerPC: ! 2 groups of General purpose registers, each of 32 registers; one Main group is for integer (fixed point) computation, the other one for CPU floating point computation. memory

9 10

Pipelining Pipelining FI

DI

FI CO FI EI FI EI FI EI FI EI 3*T FI EI FO FI EI EI EI 2*T WO

FI: Fetch Instruction Main CPU DI: Decode Instruction memory Main CO: Calculate operand CPU memory FO: Fetch Operand EI: Execute Instruction WO: Write Operand

11 12 FI: Fetch Instruction DI: Decode Instruction Pipelining CO: Calculate operand Pipelining FO: Fetch Operand EI: Execute Instruction WO: Write Operand FI DI CO FO EI WO

FI DI CO FO EI WO

FI DI CO FO EI WO

FI DI CO FO EI WO

FI DI CO FO EI WO

FI DI CO FO EI WO

FI DI CO FO EI WO

2*T

13 14

Pipelining Pipelining

! Instruction execution is complex and several operations are ! After N-1 instructions, all N stages are working: now, the executed successively. pipeline works providing maximal parallelism. ! This implies much hardware, but only one part of the hardware ! Many stages provides better performance. works at a given moment. ! However many stages: ! In pipelining instructions are overlapped in execution but: ! increases the overhead ! no additional hardware ! increases CPU complexity ! different parts of the hardware work for different instructions ! makes it difficult to keep pipeline full ! The pipeline is similar to an assembly line: ! 80486 and Pentium: ! the work of in instruction is broken into smaller steps ! five-stage pipeline for integer instr. ! each step is a pipe stage ! eight-stage pipeline for FP instr. ! The time required for moving an instruction from one stage to the next: a machine cycle (often this is one clock cycle). The ! PowerPC: execution of one instruction takes several machine cycles as it ! four-stage pipeline for integer instr. passes through the pipeline. ! six-stage pipeline for FP instr. 15 16 Program execution Pipeline Hazards 0000101110001011 = MOVE Y,R3

! Structural hazards ! Data hazards (5) EI -WO ! Control hazards Store the data (3) DI in register 3 00001 – MOVE 01110001 - Address (1) - FI - Get 011 – Reg3 the instruction at 00001000 Pipeline hazards prevent the next instruction (2) - FI - The instruction is said to be stalled. Move the instruction When an instruction is stalled, all instructions later in the 0000101110001011 pipeline than the stalled instruction are also stalled. (4) - FO - Get Instructions earlier than the stalled one can continue. the data at No new instructions are fetched during the stall. 01110001

17 18

Structural hazards Structural hazards

ADD R4, X FI DI CO FO EI WO ADD R4, X FI DI CO FO EI WO

Instruction 2 FI DI CO FO EI WO Instruction 2 FI DI CO FO EI WO

Instruction 3 FI DI CO FO EI WO Instruction 3 FI DI CO FO EI WO

Instruction 4 FI DI CO FO EI WO Instruction 4 FI FI DI CO FO EI WO

Instruction 5 FI DI CO FO EI WO Instruction 5 FI DI CO FO EI WO

Instruction 6 FI DI CO FO EI WO

Instruction 7 FI DI CO FO EI WO Penalty: 1 cycle

Structural hazards occur when a certain resource (memory, functional unit) is

requested by more than one instruction at 19 20 the same time. Structural hazards Data hazards

MUL R2,R3 // R2=R2*R3 FI DI CO FO EI WO

ADD R1,R2 //R1=R1+R2 FI DI CO FO EI WO

Instruction 3 FI DI CO FO EI WO

R2 needed before data is done!

21 22

Forwarding (bypassing) can Data hazards Data Hazards handle some hazards Skips WO

MUL R2,R3 // R2=R2*R3 FI DI CO FO EI WO

ADD R1,R2 //R1=R1+R2 FI DI CO DI CO FO EI WO

Instruction 3 FI DI FI DI CO FO EI WO

Penalty: 2 cycles

23 24 Data hazards Control hazards BR - change value of - make CPU to execute instructions instruction1 at another part of the program instruction2 MUL R2,R3 // R2=R2*R3 FI DI CO FO EI WO BR target ADD R1,R2 //R1=R1+R2 FI DI CO DI FO EI WO instruction4

Instruction 3 FI DI FI CO FO EI WO instruction5 instruction6 target: instruction7 instruction8

Penalty: 1 cycles

25 26

Control hazards Control Hazards Target known

! Conditional branch ADD R1,R2 R1 <- R1 + R2 BR target FI DI CO FO EI WO BEZ TARGET branch if zero instruction 4 FI DI CO FI DI CO FO EI WO instruction i+1

Instruction 5 DI FI FI DI CO FO EI WO ------TARGET ------

! Two alternatives How solve conditional jumps? Something is wrong Control register + update PC! instruction 7 ! branch is taken and instruction 8 ! branch is not taken.

Penalty: 3 cycles

27 28

instruction 4 Control hazards Control hazards

Evaluation ok Evaluation ok Assumption: Branch is taken Assumption: Branch is not taken Target known Target known

ADD R1,R2 FI DI CO FO EI WO ADD R1,R2 FI DI CO FO EI WO

BEZ TARGET FI DI CO FO EI WO BEZ TARGET FI DI CO FO EI WO

Instruction i+1 FI DI CO FI DI CO FO EI WO Instruction i+1 FI DI CO DI CO FO EI WO

Something can be wrong Something can be wrong instruction at target With conditional branch - penalty even if the branch has not been taken. This is because we have to wait until the branch condition is available. Penalty: 3 cycles Penalty: 2 cycles Branch instructions represent a major problem in assuring an optimal flow through the pipeline. Several approaches have been taken for reducing

29 branch penalties. 30

instruction 4 instruction 4

Reducing Pipeline Branch Penalties Instruction Fetch Units and Instruction Queues Most processors employ sophisticated fetch units that fetch instructions before they are needed ! Branch instructions can dramatically affect pipeline performance. and store them in a queue. ! Some statistics: ! 20% - 35% of the instructions executed are branches ! ~ 65% of the branches actually take the branch ! Conditional branches are more frequent than unconditional ones ! Techniques: ! Delayed branch ! Branch prediction The fetch unit also has the ability to recognize branch instructions and to generate the target address. Control operations (conditional and unconditional Thus, penalty produced by unconditional branches can be drastically reduced: the fetch unit branch) are very frequent in current programs. computes the target address and continues to fetch instructions from that address, which are Techniques: stop pipleline -> preformance goes sent to the queue. down. Thus, the rest of the pipeline gets a continuous stream of instructions, without stalling. The rate at which instructions can be read (from the instruction ) must be sufficiently high to avoid an empty queue. With conditional branches penalties can not be avoided. The branch condition, which usually depends

31 on the result of the preceding instruction, has to be known in order to determine 32 Delayed branching Delayed branching

Assumption: Branch is taken

! The idea is to let the CPU do some useful work during stalling ADD R1,R2 FI DI CO FO EI WO ! The CPU always executes the instruction that follows after the branch and only then alters (if necessary) the sequence of BEZ TARGET FI DI CO FO EI WO execution. Instruction i+1 FI DI CO FI DI CO FO EI WO ! The instruction after the branch is said to be in the branch delay slot. Penalty: 3 cycles ! For between 60% and 85% of branches, compilers find an Assumption: Branch is not taken instruction for the branch delay slot. ADD R1,R2 FI DI CO FO EI WO

BEZ TARGET FI DI CO FO EI WO

Instruction i+1 FI DI CO DI CO FO EI WO

Penalty: 2 cycles

33 34

instruction 4

Delayed branching Delayed branching

Assumption: Branch is taken This is what the compiler This is what the programmer (assembler) has produced and ADD R1,R2 FI DI CO FO EI WO has written: what actually will be executed: BEZ TARGET FI DI CO FO EI WO MUL R3,R4 SUB #1,R2 MUL R3, R4 FI DI CO FO EI WO SUB #1,R2 ADD R1,R2 Instruction i+1 FI DI FI DI CO FO EI WO ADD R1,R2 BEZ TAR BEZ TAR Penalty: 2 cycles at target MUL R3,R4 MOVE #10,R1 Assumption: Branch is not taken MOVE #10,R1 ------ADD R1,R2 FI DI CO FO EI WO TAR ------TAR ------BEZ TARGET FI DI CO FO EI WO FI DI CO FO EI WO MUL will always be executed Instruction i+1 FI DI DI CO FO EI WO Penalty: 1 cycles

35 36

instruction 4 Delayed branching Branch Prediction

! What happens if the compiler cannot find an instruction to be ! In the last example we have considered that the branch will not moved into the branch delay slot? be taken and we fetched the instruction following the branch; in the case the branch was taken the fetched instruction was discarded. MUL R2,R4 ! As result, we had branch penalty of SUB #1,R2 ! 1 - if the branch is not taken (prediction fulfilled) ADD R1,R2 ! 2 - if the branch is taken (prediction not fulfilled) BEZ TAR ! The reverse: NOP ! 2 - if the branch is not taken (prediction not fulfilled) MOVE #10,R1 ! 1 - if the branch is taken (prediction fulfilled) ------TAR ------

37 38

Branch prediction Static branch prediction

! Correct branch prediction is very important and can produce ! Static prediction techniques do not take into consideration substantial performance improvements. execution history. ! means that instructions are executed ! Static approaches: before the is certain that they are in the correct ! Predict never taken: assumes that the branch is not taken. execution path. ! Predict always taken: assumes that the branch is taken. ! Branch prediction strategies: ! Predict depending on the branch direction: ! Static prediction ! predict branch taken for backward branches; Pipeline can move on ! Dynamic prediction If it turns out that the prediction was correct, ! predict branch not taken for forward branches. Minimum of penalty execution goes on without introducing any branch Page faults penalty. If, however, the prediction is not fulfilled, the instruction(s) started in advance and all their associated data must be purged and the state previous to their execution restored.

39 40 Dynamic Branch Prediction Static branch prediction

! Dynamic prediction techniques improve the accuracy of the ! From data collection, branch type instructions: prediction by recording the history of conditional branches. ! 70% are branches ! One-bit is used in order to record the last execution ! Unconditional branches 40% ! The system predicts the same behavior as for the last time. ! Conditional branches 60% ------! 10% loops LOOP ------! 20% call/return ------branches 70% loops 10% ! Prediction - branch taken BNZ LOOP procedure calls 20% ! Unconditional branches - 70*0.4 -> 28% correct ------! Conditional branches - not taken in 60% - (70*0.6) -> 25% correct ! Loop - jump back 90% (10*0.9) -> 9% correct ! Call/return -> 20% ! Total: 82% overall correct prediction

41 42

Implementation of Branch Prediction Two-Bit Prediction Scheme

Valid bit Branch instruction address Target address Prediction bits

With a two-bit scheme predictions can be made depending on the last two instances of execution.

43 A typical scheme is to change the prediction only if there have been two incorrect predictions in a 44 row. Branch History Table Branch History Table

! Address where to fetch from: ! If the branch instruction is not in the table the next instruction (address PC+1) is to be fetched. ! If the branch instruction is in the table first of all a prediction based on the prediction bits is made. Depending on the prediction outcome the next instruction (address PC+1) or the instruction at the target address is to be fetched. ! Update entry: If the branch instruction has been in the table, the respective entry has to be updated to reflect the correct or incorrect prediction. ! Add new entry: If the branch instruction has not been in the table, it is added to the table with the corresponding information concerning branch outcome and target address. If needed one of the existing table entries is discarded. Replacement algorithms similar to those for cache memories are used. ! Using dynamic branch prediction with history tables up to 90% of predictions can be correct. ! Both Pentium and PowerPC 620 use speculative execution with dynamic branch prediction based on a branch history table.

History information can be used not only to predict the outcome of a conditional branch but also to avoid recalculation of the target address. Together with the bits used for 45 46 prediction, the target address can be stored for later use in a branch history table.

Internal - connects ALU, CU, registers Registers - highest level in memory Summary Summary Register: some registers are visable; user-visable can be general purpose or specialized

! The main components of the CPU are: Control Unit, ALU and ! Instruction fetch units are able to recognize branch instructions Register set. and generate the target address. Fetching at a high rate from ! Instructions are executed by the CPU as a sequence of steps. the instruction cache and keeping the instruction queue loaded, it is possible to reduce the penalty for unconditional branches to ! Execution can be substantially accelerated by instruction zero. For conditional branches this is not possible because we pipelining. have to wait for the outcome of the decision. ! Keeping a pipeline at its maximal rate is prevented by hazards. ! Delayed branching is a compiler based technique aimed to ! Structural hazards are due to resource conflicts. reduce the branch penalty by moving instructions into the ! Data hazards are produced by data dependencies between branch delay slot. instructions. ! Efficient reduction of the branch penalty for conditional ! Control hazards are produced as consequence of branch branches needs a clever branch prediction strategy. Static instructions branch prediction does not take into consideration execution ! Branch instructions can dramatically affect pipeline history. Dynamic branch prediction is based on a record of the performance. It is very important to reduce penalties. history of a conditional branch. Branch history tables are used to store both information on the outcome 47 of branches and the target address of the respective branch. 48 www.liu.se