Branch Prediction, and Dynamic Scheduling in Superscalar

Flow Path Model of Superscalars CS 211: Branch Prediction for Superscalar I-cache Processors Branch Instruction FETCH Flow Predictor Instruction Buffer DECODE Integer Floating-point Media Memory Memory Data EXECUTE Flow Reorder Buffer Register (ROB) Data COMMIT Flow Store D-cache Queue Instruction Fetch Buffer Fetch Out-of-order Unit Core Instruction Flow Bandwidth Fetch buffer smoothes out the rate mismatch between fetch and execution - neither the fetch bandwidth nor the execution bandwidth is consistent Fetch bandwidth should be higher than execution bandwidth - we prefer to have a stockpile of instructions in the buffer to hide cache miss latencies. This requires both raw cache bandwidth + control flow speculation Instruction Cache Basic Spatial Locality and Fetch Bandwidth 00 01 10 11 00 01 10 11 000 000 001 001 PC=..xxRRRCC00 111 PC=..xxRRRCC00 111 Row Decoder Row Decoder Mutiplexer Inst0 Inst1 Inst2 Inst3 Instruction example: 4 instructions per cache line Instruction Decoding Issues Intel Pentium Pro Fetch/Decode Unit Primary tasks: x86 Macro-Instruction Bytes from IFU - Identify individual instructions Instruction Buffer 16 bytes To Next - Determine instruction types Address Calc. - Detect inter-instruction dependences uROM Decoder Decoder Decoder 0 1 2 Two important factors: Branch - Instruction set architecture Address Calc. - Width of parallel pipeline 4 uops 1 uop 1 uop uop Queue (6) Up to 3 uops Issued to dispatch Instruction Flow– Control Flow Throughput of early stages places bound an upper bound on per. Of subsequent stages Program control flow represented by Control Flow Control Dependence Graph (CFG) - Nodes represent basic block of code • Sequence of instructions with no incoming or outgoing branches - Edges represent transfer of control flow from one block to another IBM’s Experience on Pipelined Processors Control Flow Graph [Agerwala and Cocke 1987] Shows possible paths of control flow through basic blocks BB 1 main: Code Characteristics (dynamic) addi r2, r0, A addi r3, r0, B - loads - 25% addi r4, r0, C BB 1 addi r5, r0, N BB 2 add r10,r0, r0 - stores - 15% bge r10,r5, end loop: - ALU/RR - 40% lw r20, 0(r2) lw r21, 0(r3) BB 2 bge r20,r21,T1 - branches - 20% BB 3 BB 4 sw r21, 0(r4) BB 3 b T2 • 1/3 unconditional (always taken) T1: sw r20, 0(r4) BB 4 T2: unconditional - 100% schedulable addi r10,r10,1 BB 5 addi r2, r2, 4 • 1/3 conditional taken addi r3, r3, 4 BB 5 addi r4, r4, 4 blt r10,r5, loop • 1/3 conditional not taken end: conditional - 50% schedulable Control Dependence - Node X is control dependant on Node Y if the computation in Y determines whether X executes Mapping CFG to CFG and Branches Linear Instruction Sequence Basic blocks and their constituent instructions must A AA be stored in sequential location in memory - In mapping a CFG to linear consecutive mem location, additional unconditional branches must be added C B B Encounter of branches (cond and uncond.) at run- C time induces deviations from implied sequential control flow and consequent disruptions to sequential D fetching of instructions - These disruptions cause stalls in Inst.Fetch (IF) stage and D reduce overall IF bandwidth D B C Branch Types and Implementation Branch-- actions Types of Branches When branches occur, disruption to IF occurs - Conditional or Unconditional? For unconditional branches - Subroutine Call (aka Link), needs to save PC? - Subsequent instruction cannot be fetched until target - How is the branch target computed? address determined • Static Target e.g. immediate, PC-relative For conditional branches • Dynamic targets e.g. register indirect - Machine must wait for resolution of branch condition - And if branch taken then wait till target address computed Conditional Branch Architectures Branch inst executed by the branch functional unit - Condition Code ‘N-Z-C-V’ e.g. PowerPC Note: Cost in superscalar/ILP processors = width - General Purpose Register e.g. Alpha, MIPS (parallelism) X stall cycles - Special Purposes register e.g. Power’s Loop Count - 3 stall cycles on a 4 wide machine = 12 lost cycles Condition Resolution Target Address Generation Fetch Fetch Decode Buffer PC- Decode Buffer Decode rel. CC Reg. Decode reg. ind. GP Dispatch Buffer Reg. Dispatch Buffer reg. Dispatch ind. value with Dispatch comp. offset Reservation Reservation Stations Stations Issue Issue Branch Branch Execute Execute Finish Finish Completion Buffer Completion Buffer Complete Complete Store Buffer Store Buffer Retire Retire What’s So Bad About Branches? Branch penalties Performance Penalties When branch occurs two parts needed: - Use up execution resources - Branch target address (BTA) has to be computed - Branch condition resolution - Fragmentation of I-Cache lines Addressing modes will affect BTA delay - Disruption of sequential control flow - For PC relative, BTA can be generated during Fetch stage • Need to determine branch direction (conditional for 1 cycle penalty branches) - For Register indirect, BTA generated after decode stage (to access register) = 2 cycle penalty • Need to determine branch target - For register indirect with offset = 3 cycle penalty For condition resolution, depends on methods Robs instruction fetch bandwidth and ILP - If condition code registers used, then penalty =2 - If ISA permits comparison of 2 registers then output of ALU => 3 cycles Penalty will be max of penalties for condition resolution and BTA What to do with branches Riseman and Foster’s Study To maximize sustained instruction fetch bandwidth, 7 benchmark programs on CDC-3600 number of stall cycles in fetch stage must be Assume infinite machine: minimized - Infinite memory and instruction stack, register file, fxn units The primary aim of instruction flow techniques Consider only true dependency at data-flow limit (branch prediction) is to minimize stall cycles and/or If bounded to single basic block, i.e. no bypassing of make use of these cycles to do useful work branches ⇒ maximum speedup is 1.72 - Note that there must be a mechanism to validate prediction Suppose one can bypass conditional branches and and to safely recover from misprediction jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution) Br. Bypassed:012832128 Max Speedup: 1.72 2.72 3.62 7.21 24.4 51.2 Determining Branch Direction Determining Branch Target Problem: Cannot fetch subsequent instructions until branch Problem: Cannot fetch subsequent instructions until direction is determined branch target is determined Minimize penalty Minimize delay - Move the instruction that computes the branch condition - Generate branch target early in the pipeline away from branch (ISA&compiler) Make use of delay Make use of penalty - Bias for not taken - Bias for not-taken - Predict branch target - Fill delay slots with useful/safe instructions (ISA&compiler) - Follow both paths of execution (hardware) PC-relative vs Register Indirect targets - Predict branch direction (hardware) Branch Target Speculation – Branch Keys to Branch Prediction Target Buffer Target Address Generation Use branch target buffer (BTB) to store previous - Access register branch target address • PC, GP register, Link register BTB is a small fully associative cache - Perform calculation - Accessed during instruction fetch using PC • +/- offset, auto incrementing/decrementing BTB can have three fields ⇒ Target Speculation - Branch instruction address (BIA) - Branch target address (BTA) - History bits Condition Resolution When PC matches BIA, an entry is made into BTB - Access register - A hit in BTB Implies inst being fetched is branch inst • Condition code register, data register, count register - The BTA field can be used to fetch next instruction if - Perform calculation particular branch is predicted to be taken • Comparison of data register(s) - Note: br inst is still fetched and executed for ⇒ Condition Speculation validation/recovery Branch Condition Speculation History based prediction Biased For Not Taken - Does not affect the instruction set architecture Make prediction based on previous observation - Not effective in loops - Assumption guiding history based prediction is that historical info on direction taken by branch in previous execution can Software Prediction give helpful hints on direction it will take in future execution - Encode an extra bit in the branch instruction How much history ? What prediction ? • Predict not taken: set bit to 0 • Predict taken: set bit to 1 Finite state machine algorithm - Bit set by compiler or user; can use profiling - N state variables encode direction taken by last n exec of branch - Static prediction, same behavior every time • Each state represents particular history pattern in terms Prediction Based on Branch Offsets of taken/not-taken (T/NT) - Positive offset: predict not taken • Output logic generates prediction based on history - Negative offset: predict taken - When predicted branch is finally executed, use actual Prediction Based on History outcome to transition to next state - Next state logic – chain state variables into shift Reg. Branch Instruction Speculation Branch Target Buffer (BTB) A small “cache-like” memory in the instruction fetch stage nPC to Icache prediction FA-mux specu. target nPC(seq.) = PC+4 current Branch PC Fetch PC ……. ……. …… specu. cond. Predictor (using a BTB) Decode Buffer BTB Decode Branch Inst. Branch Branch Target update (target addr. Dispatch Buffer Address (tag) History (Most Recent) nPC=BP(PC) and history) Dispatch Remembers

Branch Prediction, and Dynamic Scheduling in Superscalar

Computer Science 246 Computer Architecture Spring 2010 Harvard University

Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo's Algorithm

Multiple Instruction Issue and Completion Per Clock Cycle Using Tomasulo’S Algorithm – a Simple Example

Tomasulo's Algorithm

Tomasulo's Algorithm

Tomasulo Algorithm and Dynamic Branch Prediction

WCAE 2003 Workshop on Computer Architecture Education

Verification of an Implementation of Tomasulo's Algorithm by Compositional Model Checking

MIPS Architecture with Tomasulo Algorithm [12]

MP-Tomasulo: a Dependency-Aware Automatic Parallel Execution Engine for Sequential Programs

California State University, Northridge a Tomasulo

Superscalar Techniques – Register Data Flow Inside the Processor