Highlights from last week

CSE 586 Computer Architecture • ILP: where can the compiler optimize – Loop unrolling and software pipelining – (we’ll see predication today) Lecture 4 • ILP: Dynamic scheduling in a single issue machine – Scoreboard Jean-Loup Baer – Tomasulo’s algorithm http://www.cs.washington.edu/education/courses/586/00sp

CSE 586 Spring 00 1 CSE 586 Spring 00 2

Highlights from last week (c’ed) -- Highlights from last week (c’ed) – Scoreboard Tomasulo’s algorithm

• The scoreboard keeps a record of all data dependencies • Decentralized control • The scoreboard keeps a record of all functional unit • Use of reservation stations to buffer and/or rename occupancies registers (hence gets rid of WAW and WAR hazards) • The scoreboard decides if an instruction can be issued • Results –and their names– are broadcast to reservations • The scoreboard decides if an instruction can store its result stations and • Implementation-wise, scoreboard keeps track of which • Instructions are issued in order but can be dispatched, registers are used as sources and destinations and which executed and completed out-of-order functional units use them

CSE 586 Spring 00 3 CSE 586 Spring 00 4

Highlights from last week (c’ed) Multiple Issue Alternatives

• Register renaming: avoids WAW and WAR hazards • Superscalar (hardware detects conflicts) • Is performed at “decode” time to rename the result register – Statically scheduled (in order dispatch and hence execution; cf DEC Alpha 21164) • Two basic implementation schemes – Dynamically scheduled (in order issue, out of order dispatch and – Have a separate physical register file execution; cf MIPS 10000, IBM Power PC 620 and Intel Pentium – Use of reorder buffer and reservation stations (cf. Tomasulo Pro) algorithm extended implementation) • VLIW – EPIC (Explicitly Parallel Instruction Computing) – Often a mix of the two (cf. description of MIPS 10000 in Smith – Compiler generates “bundles “ of instructions that can be executed and Sohi) concurrently (cf. Intel Merced – )

CSE 586 Spring 00 5 CSE 586 Spring 00 6 Multiple Issue for Static/Dynamic Scheduling Impact of Multiple Issue on IF

• Issue in order • IF: Need to fetch more than 1 instruction at a time – Otherwise bookkeeping too complex (the old “data flow” – Simpler if instructions are of fixed length machines could issue any ready instruction in the whole program) – In fact need to fetch as many instructions as the issue stage can – Check for structural hazards; if any stall handle in one cycle (otherwise the issue stage will stall) • Dispatch for static scheduling • Simpler if restricted not to overlap I-cache lines – Check for data dependencies; stall adequately – But with branch prediction and superblocks, this is not realistic – Can take forwarding into account hence introduction of (instruction) fetch buffers • Dispatch for dynamic scheduling – Always attempt to keep at least as many instructions in the fetch – Dispatch out of order (reservation stations, instruction window) buffer as can be issued in the next cycle (BTB’s help for that) – Requires possibility of dispatching concurrently dependent – For example, have an 8 wide instruction buffer for a machine that instructions (otherwise little benefit over static. sched.) can issue 4

CSE 586 Spring 00 7 CSE 586 Spring 00 8

Stalls at the IF Stage Sample of Current Micros

• Instruction buffer is full • Two instruction issue: Alpha 21064, Sparc 2, Pentium, – Most likely there are stalls in the stages downstream Cyrix • Branch misprediction • Three instruction issue: Pentium Pro (but 5 uops from • Instructions are stored in several I-cache lines IF/ID to EX; AMD has 4 uops) – In one cycle one I-cache line can be brought into fetch buffer • Four instruction issue: Alpha 21164, Alpha 21264, Power – A basic block might start in the middle (or end) of an I-cache line PC 620, Sun UltraSparc, HP PA-8000, MIPS R10000 – Requires several cache lines to fill the buffer • Many papers written in 1995-98 predicted 16-way issue by – The ID (issue-dispatch) stage will stall if not enough instructions in 2000. We are still at 4! the fetch buffer • Instruction cache miss

CSE 586 Spring 00 9 CSE 586 Spring 00 10

The Decode Stage (simple case: dual issue Decode in Simple Multiple Issue Case and static scheduling)

• ID = Issue + Dispatch • If instructions i and i+1 are fetched together and: • Look for conflicts between the (say) 2 instructions – Instruction i stalls, instruction i+1 will stall – If one integer unit and one f-p unit, only check for structural – Instruction i is dispatched but instruction i+1 stalls (e.g., because hazard, i.e. the two instructions need the same f-u (easy to check of structural hazard = need the same f-u), instruction i+2 will not with opcodes ) advance to the issue stage. It will have to wait till both i and i+1 have been dispatched – Slight difficulty for integer ops that are f-p load/store/move (potential multiple accesses to f-p register file; solution: provide additional ports) – RAW dependencies resolved as in single pipelines – Note that the load delay (assume 1 cycle) can now delay up to 3 instructions, i.e., 3 issue slots are lost

CSE 586 Spring 00 11 CSE 586 Spring 00 12 Alpha 21064 FP S0 S1 S2 S3 • IF – S0: Access I-cache Fet Swap Dec Iss Load-store – Prefetcher fetches 2 instructions (8 bytes) at a time 4 stages common to • Swap stage - S1: all instructions. Once In blue, a – Prefetcher contains branch prediction logic tested at this stage: 4 passed the 4th stage, Integer subset of entry return stack; 1 bit/instruction in the I-cache + static no stalls . the 38 prediction BTFNT Branch prediction bypasses – Initial decode yields 0, 1 or 2 instruction potential issue; align (forwarding during “swap” instructions depending on the functional unit there are headed for. paths) Check structural and • End of decode: S2. data hazards during issue Alpha 21064 2-way issue – Check for WAW and WAR (my guess) Alpha 21164 4-way issue (more pipes)

CSE 586 Spring 00 13 CSE 586 Spring 00 14

Alpha 21064 (c’ed) Alpha 21164

• Instruction Issue: S3 • Main differences in with 21064 (besides caches) – Check for RAW; forwarding etc – Up to 4 instructions issued/cycle – Two integer units; Two f-p units (one add, one multiply; divide can • Conditions for 2 instruction issue (S2 and S3) be concurrent with add) – The first instruction must be able to issue (in order execution) – Slightly different execution pipe organizations – Load/store can issue with an operate except stores cannot issue • Still common trunk of 4 stages with an operate of different format (share the same result bus) – S0: Access I-cache. The instructions are predecoded – An integer op. can issue with a f-p op. (determination of whether the instruction is a branch – used in S1 – – A branch can issue with a load/store/operate (but not with stores of and of the pipeline executing the instruction – used in S2 –) the same format) – S1: Branch prediction (2-bit saturating counters in I-cache associated with each instruction). Buffer 4 instructions for next stage

CSE 586 Spring 00 15 CSE 586 Spring 00 16

Alpha 21164 (c’ed) Pentium

• Still common trunk of 4 stages (c’ed) • Recall dual integer pipeline and a f-p – S2: Slot-swap instructions so that they are headed for the right • Decode 2 consecutive instructions I1 and I2. If both are pipeline. If functional unit conflicts, stall previous stages until all intended for the integer pipes, issue both iff four are gone – I1 and I2 are “simple” instructions (no ) – S3: Check for WAW hazards. Read integer file. Stall if results are not ready. – I1 is not a jump instruction – No WAR and WAW hazard between I1 and I2 (I1 precedes I2)

CSE 586 Spring 00 17 CSE 586 Spring 00 18 The Decode Stage (dynamic scheduling) Stalls in Decode (issue/dispatch) Stage

• Decode means: • There can be several instructions ready to be dispatched in – Dispatch to either same cycle to same functional unit • A FIFO queue associated with each functional unit (not done any • There might not be enough bus/ports to forward values to more) all the reservation stations that need them in the same cycle • A centralized instruction window common to all functional units (Pentium Pro and Pentium III -- I think) • Reservation stations associated with functional units (MIPS 10000, AMD K5, IBM Power PC 620) – Rename registers (if supported by architecture) – Set up entry at tail of reorder buffer (if supported by architecture) – Issue operands, when ready, to functional unit

CSE 586 Spring 00 19 CSE 586 Spring 00 20

The Execute Stage The Commit Step (in-order completion)

• Use of forwarding in the case of static scheduling • Recall: need of a mechanism (reorder buffer) to: • Use of broadcast bus and reservation stations for dynamic –“Complete” instructions in order. This commits the instruction. scheduling Since multiple issue machine, should be able to commit (retire) several instructions per cycle • We’ll talk at length about memory operations (load-store) – Know when an instruction has completed non-speculatively,i.e., when we study memory hierarchies what to do with branches – Know whether the result of an instruction is correct, i.e., what to do with exceptions

CSE 586 Spring 00 21 CSE 586 Spring 00 22

Power PC 620 (see figure in book) Pentium Pro

• Issue stage. Up to 4 instructions issued/cycle except if • Fetch-Decode unit structural hazards such as: – Transforms instructions into micro-operations (uops) and stores – No reservation station available them in a global reservation table (instruction window). Does – No rename register available register renaming (RAT = register alias table) – Reorder buffer is full. • Dispatch (aka issue)- – Two operations (e.g., forwarding operands) for the same unit. Only – Issues uops to functional units that execute them and temporarily one write port/set of reservation stations for each unit. store the results (the reservation table is 5-ported, hence 5 uops can – Miscellaneous structural hazards, e.g., too many concurrent reads be issued concurrently) to the register file • Retire unit • First 3 due to “the program”, last 2 to the implementation – Commits the instructions in order (up to 3 commits/cycle)

CSE 586 Spring 00 23 CSE 586 Spring 00 24 Impact on Branch Prediction and Completion

Fetch/Decode Dispatch/Execute unit Retire unit Unit • When a conditional branch is decoded: – Save the current physical-logical mapping – Predict and proceed • When branch is ready to commit (head of buffer) – If prediction correct, discard the saved mapping – If prediction incorrect Instruction pool • Flush all instructions following mispredicted branch in reorder buffer • Restore the mapping as it was before the branch as per the saved map • Note that there have been proposals to execute both sides The 3 units of the Pentium Pro are “independent” of a branch using register shadows and communicate through the instruction pool – limited to one extra set of registers

CSE 586 Spring 00 25 CSE 586 Spring 00 26

Exceptions Limits to Hardware-based ILP

• Instructions carry their exception status • Inherent lack of parallelism in programs • When instruction is ready to commit – Partial remedy: loop unrolling and other compiler optimizations; – No exception: proceed normally – Branch prediction to allow earlier issue and dispatch – Exception • Complexity in hardware • Flush (as in mispredicted branch) – Needs large bandwidth for instruction fetch (might need to fetch • Restore mapping (more difficult than with branches because the from more than one I-cache line in one cycle) mapping is not saved at every instruction; this method can also be – Requires large register bandwidth (multiported register files ) used for branches – Forwarding/broadcast requires “long wires” (long wires are slow) as soon as there are many units.

CSE 586 Spring 00 27 CSE 586 Spring 00 28

Limits to Hardware-based ILP (c’ed) A (naïve) Primer on VLIW - EPIC

• Difficulties specific to the implementation • Disclaimer: Some of the next few slides are taken (and slightly edited) from an Intel-HP presentation (see Outline for a reference) – More possibilities of structural hazards (need to encode some priorities in case of conflict in resource allocations) • VLIW direct descendant of horizontal microprogramming – Parallel search in reservation stations, reorder buffer etc. – Two commercially unsuccessful machines: Multiflow and Cydrome – Additional state savings for branches (mappings), more complex updating of BPT’s and BTB’s. • Compiler generates instructions that can execute together – Keeping precise exceptions is more complex – Instructions executed in order and assumed to have a fixed latency • Difficulties occur with : – Branch prediction -> Use of predication – Pointer-based computations -> Use cache hints and speculative loads – Unpredictable latencies (e.g., cache misses)

CSE 586 Spring 00 29 CSE 586 Spring 00 30 IA-64 : Explicitly Parallel Architecture IA-64 Architecture : Explicit Parallelism 128 bits (bundle) Original Source Parallel Machine Instruction 2 Instruction 1 Instruction 0 Template Code Code 41 bits 41 bits 41 bits 5 bits

Compile Memory (M) Memory (M) Integer (I) (MMI)

• IA-64 template specifies Compiler Hardware multiple functional units – The type of operation for each instruction M=Memory • MFI, MMI, MII, MLI, MIB, MMF, MFB, MMB, MBB, BBBF=Floating-point I=Integer – Intra-bundle relationship L=Long Immediate IA-64 Compiler • M / MI or MI / I Views Wider B=Branch – Inter-bundle relationship Scope More efficient use of . . . . execution resources . . . . • Most common combinations covered by templates – Headroom for additional templates • Simplifies hardware requirements • Scales compatibly to future generations Fundamental design philosophy enables new levels of headroom CSE 586 Spring 00 31 Basis for increasedCSE 586 Spring 00parallelism 32

Merced/ Itanium implementation (?) Merced/ Itanium implementation (?)

• Can execute 2 bundles (6 instructions) per cycle • Predication reduces number of branches and number of • 10 stage pipeline mispredicts, • 4 integer units (2 of them can handle load-store), 2 f-p • Nonetheless: sophisticated units and 3 branch units – Compiler hints: BPR instruction provides “easy” to predict branch address and addresses; reduces number of entries in BTB • Issue in order, execute in order but can complete out of – Two-level hardware prediction Sas (4,2) (512 entry local history order. Uses a (restricted) register scoreboard technique to table 4-way set-associative, indexing 128 PHT –one per set- each resolve dependencies. with 16 entries –2-bit saturating counters). Number of bubbles on predicted branch taken: 2 or 3 – And a 64-entry BTB (only 1 bubble) – Mispredicted branch penalty: 9 cycles

CSE 586 Spring 00 33 CSE 586 Spring 00 34

IA-64 for High Performance

Merced/ Itanium implementation (?) • Number of branches in large server apps overwhelm traditional processors • There are “instruction queues” between the fetch unit and – IA-64 predication removes branches, avoids mispredicts the execution units. Therefore branch bubbles can often be • Environments with a large number of users require high absorbed because of long latencies (and stalls) in the performance execute stages – IA-64 uses speculation to reduce impact of memory latency – 64-bit addressing enables systems with very large virtual and physical memory

CSE 586 Spring 00 35 CSE 586 Spring 00 36 Middle Tier Application Needs IA-64’s Large Register File Floating-Point Branch Predicate • Mid-tier applications (ERP, etc.) have diverse code requirements Integer Registers Registers Registers Registers – Integer code with many small loops 63 0 81 0 63 0 bit 0 0 0.0 BR0 – Significant call / return requirements (C++, Java) GR0 GR0 PR0 1 GR1 GR1 • IA-64’s unique register model supports these various BR7 PR1 requirements GR31 GR31 PR15 – Large register file provides significant resources for optimized GR32 GR32 performance PR16 – Rotating registers enables efficient loop execution PR63 GR127 GR127 – Register stack to handle call-intensive code NaT 32 Static 32 Static 16 Static

96 Stacked, Rotating 96 Rotating 48 Rotating

IA-64 resources enable optimization for a Large number of registers enables variety of application requirements flexibility and performance CSE 586 Spring 00 37 CSE 586 Spring 00 38

Software Pipelining via Rotating Registers Traditional Register Models • Software pipelining - improves performance by overlapping execution of different software loops - execute more loops in the same amount of time Traditional Register Models Traditional Register Stacks Sequential Loop Execution Software Pipelining Loop Execution ProcedureRegister Memory Procedures Register

B A A A A

Time B Time • Procedure A calls procedure B B • Procedures must share space in register C • Performance penalty due to C register save / restore ? • Traditional architectures need complex software loop unrolling for pipelining D D – Results in code expansion --> Increases cache misses --> Reduces performance I think that the “traditional register stack” model • IA-64 utilizes rotating registers to achieve software pipelining they refer to is the “register windows” model – Avoids code expansion --> Reduces cache misses --> Higher performance • Eliminate the need for save / restore by reserving fixed blocks in register IA-64 significantly improves upon this • However, fixed blocks waste resources IA-64 rotating registers enable optimized loop execution

CSE 586 Spring 00 39 CSE 586 Spring 00 40

IA-64 Register Stack IA-64 Floating-Point Architecture (82 bit floating point numbers) Traditional Register Stacks IA-64 Register Stack Multiple read ports A X+B C Procedures Register Procedures Register 128 FP Memory Register FMAC #1 FMAC #2 . . . FMAC FMAC . . . A A A A File D B Multiple write ports B B B C C C C D • 128 registers D – Allows parallel execution of multiple floating-point operations ? • Simultaneous Multiply - Accumulate (FMAC) D D D – 3-input, 1-output operation : a * b + c = d – Shorter latency than independent multiply and add • Eliminate the need for save / restore by reserving fixed • IA-64 able to reserve variable – Greater internal precision and single rounding error blocks in register block sizes • However, fixed blocks waste resources • No wasted resources

Resourced for scientific IA-64 combines high performance and high efficiency CSE 586 Spring 00 41 analysis andCSE 586 3D Spring graphics 00 42 IA-64 Features Function Benefits IA-64 : Next Generation Architecture Explicit Parallelism : compiler / Executes more instructions in • Maximizes headroom for hardware synergy the same amount of time the future Predication Basic Idea Register Model : large register Able to optimize for scalar • World-class file, rotating registers, register and object oriented performance for stack engine applications complex applications • Associate a Boolean condition (predicate) with the issue, Floating Point Architecture : • Enables more complex execution, or commit of an instruction extended precision High performance 3D scientific analysis calculations,128 registers, graphics and scientific • Faster digital content – The stage in which to test the predicate is an implementation FMAC, SIMD analysis creation and rendering choice

Multimedia Architecture : Improves calculation • Efficient delivery of rich • If the predicate is true, the result of the instruction is kept parallel arithmetic, parallel shift, throughput for multimedia Web content data arrangement instructions data • If the predicate is false, the instruction is nullified

Memory Management : 64-bit Manages large amounts of • Increased architecture & • Distinction between addressing, speculation, memory, efficiently organizes system scalability – Partial predication: only a few opcodes can be predicated memory hierarchy control data from / to memory – Full predication: every instruction is predicated Compatibility : full binary Existing software runs • Preserves investment in compatibility with existing IA- seamlessly existing software 32 instructions in hardware, PA- RISC through software CSE 586 Spring 00 43 CSE 586 Spring 00 44 translation

Predication Benefits Predication Costs

• Allows compiler to overlap the execution of independent • Increased fetch utilization control constructs w/o code explosion • Increased register consumption • Allows compiler to reduce frequency of branch • If predication is tested at commit time, increased instructions and, consequently, of branch mispredictions functional-unit utilization • Reduces the number of branches to be tested in a given • With code movement, increased complexity of exception cycle handling • Reduces the number of multiple execution paths and – For example, insert extra instructions for exception checking associated hardware costs (copies of register maps etc.) • Allows code movement in superblocks

CSE 586 Spring 00 45 CSE 586 Spring 00 46

Flavors of Predication Implementation Partial Predication: Conditional Moves

• Has its roots in vector machines like CRAY-1 • CMOV R1, R2, R3 – Creation of vector masks to control vector operations on an – Move R2 to R1 if R3 = 0 element per element basis • Main compiler use: If (cond ) S1 (with result in Rres) • Often (partial) predication limited to conditional moves as, e.g., in the Alpha, MIPS 10000, Power PC, SPARC and – (1) Compute result of S1 in Rs1; the Pentium Pro – (2) Compute condition in Rcond; • Predication to nullify next instruction as in HP PA-RISC – (3) CMOV Rres, Rs1, Rcond • Full predication:Every instruction predicated as in IA-64 • Increases register pressure (Rcond is general register) – The guarded execution model where a special instruction controls • No need (in this example) for branch prediction the conditional execution of several of the subsequent instructions is similar • Very useful if condition can be computed ahead or, e.g., in parallel with result.

CSE 586 Spring 00 47 CSE 586 Spring 00 48 Other Forms of Partial Predication Full Predication

• Select dest, src1, src2,cond • Define predicates with instructions of the form:

– Corresponds to C-like --- dest = ( (cond) ? src1 : src2) Pred_ Pout1 , Pout2,, src1, src2 (Pin) where – Note the destination register is always assigned a value – Pout1 and Pout2 are assigned values according to the comparison – Use in the Multiflow (first commercial VLIW machine) between src1 and src2 and the cmp “opcode” • Nullify – The predicate types are most often U (unconditional) and U its complement, and OR and OR – Any register-register instruction can nullify the next instruction, – The predicate define instruction can itself be predicated with the thus making it conditional value of Pin • There are definite rules for that, e.g., if Pin = 0, U and U are set to 0 independently of the result of the comparison and the OR predicates are not modified.

CSE 586 Spring 00 49 CSE 586 Spring 00 50

Levels of Parallelism within a Single If-conversion

if • ILP: smallest grain of parallelism The if condition will set – Resources are not that well utilized (far from ideal CPI) p1 to U – Stalls on operations with long latencies (division, cache miss) then else The then will be executed • Multiprogramming: Several applications (or large sections predicated on p1(U) of applications) running concurrently The else will be executed – O.S. directed activity predicated on p1(U) – Change of application requires a context-switch (e.g., on a page fault) join The “join” will in general be predicated on some • Multithreading form of OR predicate – Main goal: tolerate latency of long operations without paying the price of a full context-switch

CSE 586 Spring 00 51 CSE 586 Spring 00 52

Multithreading Fine Grain Multithreading

• The processor supports several instructions streams • Conceptually, at every cycle a new instruction stream running “concurrently” dispatches an instruction to the functional units • Each instruction stream has its own context (process state) • If enough instruction streams are present, long latencies can be hidden – Registers – For example, if 32 streams can dispatch an instruction, latencies of – PC, status register, special control registers etc. 32 cycles could be tolerated • The multiple streams are multiplexed by the hardware on a • For a single application, requires highly sophisticated set of common functional units compiler technology – Discover many threads in a single application • Basic idea behind Tera’s MTA – Burton Smith third such machine (he started in late 70’s)

CSE 586 Spring 00 53 CSE 586 Spring 00 54 Tera’s MTA Tera’s MTA (c’ed)

• Each processor can execute • Since several streams belong to the same application, – 16 applications in parallel (multiprogramming) synchronization is very important (will be discussed alter – 128 streams in the quarter) • At every clock cycle, processor selects a ready stream and • Needs instructions – and compiler support – to allocate, issues an instruction from that stream activate, and deallocate streams • An instruction is a “LIW”: Memory, arithmetic, control • Compiler support: loop level parallelism and software • Several instructions of the same stream can be in flight pipelining simultaneously (ILP) • Hardware support: dynamic allocation of streams • Instructions of different streams can be in flight (depending on mix of applications etc.) simultaneously (multithreading)

CSE 586 Spring 00 55 CSE 586 Spring 00 56

Coarse Grain Multithreading Simultaneous Multithreading (SMT)

• Switch threads (contexts) only at certain events Combines the advantages of ILP and fine grain multithreading – Change thread context takes a few (10-20?) cycles – Used when long memory latency operations, e.g., access to a Hoz. waste still Vertical waste remote memory in a shared-memory multiprocessor (100’s of Hoz. waste present but not as of the order of cycles) much. Vert. waste 60% of does not • Of course, context-switches occur when there are overall waste necessarily all exceptions such as page faults disappear as this • Many fewer contexts needed than in fine-grain figure implies multithreading ILP SMT

CSE 586 Spring 00 57 CSE 586 Spring 00 58

SMT (a UW invention) SMT (c’ed)

• Needs one context per thread • Compared with an ILP superscalar of same issue width – But fewer threads needed than in fine grain multithreading – Requires 5% more real estate – Slightly more complex to design (thread scheduling, identifying • Can issue simultaneously from distinct threads in the same threads that raise exceptions etc.) cycle • Drawback (common to all wide-issue processors): • Can share resources centralized design – For example: physical registers for renaming, caches, BPT etc. • Benefits • Future generation Alpha based on SMT – Increases throughput of applications running concurrently – Dynamic scheduling – No partitioning of many resources (in contrast with chip multiprocessors)

CSE 586 Spring 00 59 CSE 586 Spring 00 60 Trace Caches

• Filling up the instruction buffer of wide issue processors is a challenge (even more so in SMT) • Instead of fetching from I-cache, fetch from a trace cache • The trace cache is a complementary instruction cache (I- cache) that stores sequences of instructions organized in dynamic program execution order • Implemented in forthcoming Intel Willamette (thanks Luni for the pointer) and some Sun Sparc architecture.

CSE 586 Spring 00 61