Lecture Topics

ECE 486/586 • Instruction-Level Parallelism – Dynamic Scheduling via Tomasulo Algorithm Computer Architecture Reference: • Chapter 3: Section 3.4 and 3.5 Lecture # 16

Spring 2015

Portland State University

Dynamic Scheduling Dynamic Scheduling Key Idea: • • Example: – Hardware re-arranges the instruction execution to reduce stalls while LW R1, 0(R2) maintaining data flow and exception behavior DADDU R3, R1, R4 • Advantages: DSUBU R5, R2, R6 – Code compiled with one pipeline in mind can be run efficiently on another • If we are to allow DSUBU to proceed, we need to separate issue into pipeline => no need to recompile for a different two parts: – Handle dependences which are unknown at compile time – Check for structural hazards – Enables the to tolerate unpredictable delays, such as cache misses – Await absence of any data hazards • Example: • In-order issue but out-order execution (and completion) LW R1, 0(R2) • But, out-of-order introduces possibility of WAW & WAR hazards DADDU R3, R1, R4 Previously we studied scoreboarding to allow out-of-order execution DSUBU R5, R2, R6 • – If LW misses in the cache, in-order pipeline gets stalled (in-order issue and in- • Today, we’ll look at Tomasulo’s algorithm (more sophisticated than order execution) scoreboard) – Dynamic scheduling allows DSUBU to proceed (no dependence on LW) Tomasulo’s Algorithm Tomasulo’s Algorithm

• Used in IBM 360/91 – Employed in floating point unit • FP units were the major source of hazards at that time, since there were no caches • Variations of Tomasulo algorithm are in use in modern processors • Key common characteristics – Track instruction dependences to allow execution as soon as operands are available – Rename registers to avoid WAR and WAW haazards

Tomasulo’s Algorithm Tomasulo’s Algorithm

• Issue • Execute – Get next instruction from the head of the queue – If one or more operands unavailable, monitor CDB for them – If matching reservation station empty and operands available – When an operand becomes available, it is placed in corresponding • Issue instruction to reservation station, indicating that all operands are available reservation station – If no empty reservation station, then structural hazard occurs – When all operands are available, operation can be executed at the • Stall instruction functional unit – If empty reservation station but operands not available, keep track of – Several instructions could become ready in the same clock cycle for the functional units producing them same functional unit => Need a selection heuristic to chose which • Issue instruction to reservation station indicating FU that will provide operands instruction to execute first – Effectively renames registers eliminating WAR and WAW hazards • Write Result – When result is available, broadcast it on the CDB – CDB communicates the result to and all reservation stations – Stores write data to memory during this step Load/Store Buffers Load/Store Operations

• Load Buffers • Proceed in two steps: – Hold components of effective address until it is computed – Compute effective address when base register is available – Track outstanding loads waiting on memory – Effective address is placed in the load or store buffer – Hold results of completed loads waiting on CDB – Loads in load buffer execute as soon as memory is available – Stores in store buffer must wait for value being stored

• Store Buffers – Hold components of effective address until it is computed – Hold destination addresses for stores waiting for data value – Hold address and value until memory is available

Register Renaming via Tags Data Structures

• Register names are effectively replaced with tags • Reservation Station – Identify reservation station which will produce operand – Op: operation to perform on source operands

– Values broadcast on CDB include the tag – Qj, Qk: reservation stations that will produce source operands (0  • Reservation stations with pending instructions awaiting operands monitor source operand already in Vj or Vk or not required) CDB for tags and values – Vj, Vk: actual value of source operands if available • This allows direct communication of data between producer and – A: used for effective address calculation for loads/stores (initially holds consumer instructions => can read operands as soon as they become immediate field of instruction, hold effective address once calculated) available – Busy: indicates reservation station is busy and FU occupied • It also enables WAR and WAW hazard avoidance (will discuss later) • Register Status:

– Qi: Reservation station that will write this register. If the value of Qi is blank, no currently active instruction is computing a result that is destined for this register Data Structures How Tomasulo differs from Scoreboard?

1) Data structures are distributed among reservation stations rather than centrally located 2) Common Data Bus (CDB) permits bypass rather than await writes to register file 3) Scoreboard stalls instruction until all operands are available while in Tomasulo’s algorithm, instructions can read operands from CDB as they become available • This eliminates WAR hazards because an earlier instruction in a WAR dependence can read its operands directly from its producer, rather than reading it from the source register which can be overwritten safely by the later instruction

How Tomasulo differs from Scoreboard?

4) Scoreboard stalls instructions to prevent WAW hazard while Tomasulo’s algorithm effectively renames the output register and allows the instruction to proceed • Later instruction in a WAW dependence overwrites RegisterStatus[rd], even if earlier instruction recorded its FU there • This is OK because subsequent instructions which needed it recorded the FU producing it • When FU of earlier instruction writes register to CDB – Reservation stations needing it, retrieve it – Register file not written, since RegisterStat[reg] is no longer that FU