Pipelining I

University of Central Florida EEL 4768 Predication Prof. Mark Heinrich School of Electrical Engineering and Computer Science [email protected] Announcements . Lab 4 rubric in place and being actively graded . Lab 5 due Today • Lab sections as normally scheduled this week . HW #4 is posted, due Tuesday December 1st . Today • Predication • Intel IA-64 Architecture UCF EEL 4768 — Computer Architecture Increasing ILP . We’ve seen several compiler techniques to increase ILP • Loop unrolling • Software pipelining • Trace scheduling . But gains are limited if branch behavior unknown . Can we do something besides hw speculation? . Want a set of techniques that always increase ILP • Conditional moves • Predication UCF Conditional Instructions . Have an instruction refer to a condition • If condition is TRUE, execute instruction • If condition is FALSE, instruction is a NOP . Many architectures include conditional move instructions • Alpha, MIPS, PowerPC, SPARC • Pentium Pro -- (only user insts added between P5 and P6) . Conditional move moves a value from one register to another • Only if the condition is TRUE . Can use conditional moves to completely eliminate branches UCF Conditional Move Example . If (A==0) S=T; . Assume r1, r2, r3 hold the latest values of A,S,T… . Code without conditional move: . bnez r1, L . mov r2, r3 . L: . This entire fragment can be replaced with 1 instruction • cmovz r2, r3, r1 ; executed only if r1==0 . Control dependence is converted into a data dependence • Dependence can be resolved later in the pipeline (WB) • Have to maintain the conditions! • Another register file port? UCF Using Conditional Instructions . Can use conditional instructions to improve schedule • Schedule time-critical instructions into idle slots . lw r1,40(r2) add r3,r4,r5 . nop add r6,r3,r7 . beqz r10,L nop . lw r8,20(r10) nop . lw r9,0(r8) nop . 2nd load after branch depends on 1st, branch protects following load (don’t want to load if r10 is 0) . Can imagine a conditional load instruction that loads unless a third operand is 0… UCF Using Conditional Instructions II . Schedule with the conditional load: . lw r1,40(r2) add r3,r4,r5 . lwc r8,20(r10),r10 add r6,r3,r7 . beqz r10,L nop . lw r9,0(r8) nop . Improves performance by eliminating the stall and scheduling the data dependence . If compiler mispredicts branch, instruction will have no effect • Conditional load is speculative • Have to worry about exceptions – Above, incorrect execution may cause fatal exception • In general, conditional instructions have restrictions UCF Predication . Intel IA-64 architecture uses full predication • Every register-writing instruction can be conditional • Has 64 1-bit predicate registers • Estimates you lose 20-30% in branch mispredictions • Mahlke et al. claim can remove 50% of branches – And 40% of the mispredictions! . Idea: Since branches chop up ILP opportunity • Execute both sides of the branch! • Fill available slots with useful work • Tag each instruction with a predicate condition • Instruction commits only if predicate is true • Can combine predicates (make compound predicates) UCF cmp Predication Example then cmp P1 P2 else P1 P2 P1 P2 . Removes branches • Executes multiple paths simultaneously . Increases performance by exposing parallelism • Reduces critical path • Better utilization of wide machines UCF Combining Predicates p1, p2 (p2) p3,p4 p2&p3 p2&p4 (p3)… (p4)… . Predicates can be combined to predicate many if-then-else levels • This is difficult to do with simple conditional instructions • Intel cites study – cmove: 39% more instructions, 23% worse performance UCF Limitations of Conditional Instructions . Is predication really a good idea? • Open question (Intel hoped so) . Potential disadvantages and limitations • Conditional instructions that are annulled (predicate is false) still take execution time and occupy a potential issue slot. There is “zero cost” only if potential instruction fills an otherwise idle slot • Need swanky compiler technology • Compound predicates must be computed (more slots) • Full benefits of predication still require speculation • Conditional instructions may have a speed penalty compared with unconditional insts (lower clock rate?) • If branch prediction accuracy is high, can be better to be greedy and not execute both sides of every branch • Room for multithreading? UCF Intel/HP EPIC Architecture . EPIC (Explicitly Parallel Instruction Computing) • Similar philosophy to VLIW • Tries to overcome VLIW limitations . IA-64 is the ISA, Intel now calls Intel Itanium Architecture • As opposed to IA-32 (x86) in Pentium line • HP, Compaq (Alpha), and SGI (MIPS) all signed on . Itanium (Merced), and Itanium II (McKinley) • Itanium III (Madison) (2003) 410 million transistors! • Itanium IV (Montecito) 2006, dual-core CMP • latest version from Intel (Montvale) 2007 (another 2010 – Tukwila) . Intel produced x86-64 Xeon in 2004… UCF Itanium Characteristics . Explicitly parallel • ILP encoded in machine binary • Compiler looks beyond basic block scheduling . Advanced ILP techniques • Full predication • ISA support for control and data speculation • HW support for SW pipelining (modulo scheduling) . IA-32 hardware support . Massive hardware resources • Lots of FUs, tons of registers UCF Instruction Format . 3 instructions packaged in 128-bit bundles • Instructions are 41 bits each • 5-bit template at the end of each bundle . Template bits encode dependences • Both within (using stops) and between bundles • Also identify type of packet to help decode • More efficient and flexible than fixed VLIW format . Each instruction • Has a 6-bit field specifying 1 of 64 predicate registers • Has a max of 3 7-bit register specifiers UCF Itanium Architecture Features . 128 integer registers, 128 FP registers • 32 fixed, 96 “rotating” • Int registers are 65 bits! (NaT bit, later) . 64 predicate registers • 16 fixed, 48 “rotating” . Register stack on function calls • Similar to SPARC’s register windows . Multi-way branches (more than 1 branch unit) . Parallel compares (AND, OR, ANDOR) UCF Control Speculation Traditional Architectures IA-64 ld.s instr 1 instr 1 instr 2 . instr 2 br Barrier br Load chk.s use use Allows elevation of load, even above a branch . Improve memory latency by speculation at compile time . Defer exceptions by setting NaT (bit 65) • ld.s sets NaT, chk.s checks it, branches to recovery code UCF Hoisting Uses as Well IA-64 ld.s ld.s instr 1 instr 1 instr 2 instr 2 uses br br Recovery code chk.s chk.s ld use (Home Block) uses br home . Uses of speculative loads can execute speculatively • Distinguishes speculation from simple prefetch . NaT propagates down the dependent inst chain • Single chk.s instruction handles recovery UCF Data Speculation Traditional Architectures IA-64 instr 1 ld8.a instr 2 instr 1 . instr 2 Store(*) Barrier st8 Load (*) ld.c use use . Compiler can hoist load prior to preceding, possibly conflicting store . ALAT (Advanced Load Address Table) is used to check address of every store in-between UCF Hoisting Uses (again) ld8.a ld8.a instr 1 instr 1 use instr 2 instr 2 st8 st8 Recovery code ld.c chk.a ld8 use uses br home . Can also hoist uses past dependent store . chk.a jumps to recovery code if mis-speculation UCF Rotating Registers . Instructions contain “virtual” register number • Rotating Register Base (RRB) + VRN = PRN • Int and FP regs 32-127 can rotate (each has RRB) . Predicate registers can also rotate . Avoids code expansion from unrolling loops and code explosion from prologue/epilogue • Fewer cache misses • Can now SW pipeline small loops with unknown number of iterations (typical in integer code) UCF Itanium II Implementation (McKinley) . 0.18u process, 220 million transistors • Intel announced 65nm process . 10-stage in-order pipe, 1 GHz • Executes 2 bundles/CLK • 11 issue ports, 6 integer ALUs, 2 LD & 2 ST per cycle . Caches • L1 – 2 16KB, 1 cycle • L2 – 256KB, 5 cycles • L3 – 3MB, 12 cycles on chip! . Addressing • 50-bit PA, 64-bit VA UCF Itanium “Tukwila” (2008, 2009, 2010) . Intel data from 4/2008: . 4 cores . 6MB $/core, 24MB $ total . ~2.0 GHz . 698mm2 in 65nm CMOS!!!!! . 170W . Over 2 billion transistors UCF Itanium Sales UCF Itanium Reception . EPIC Fail in the marketplace, initial delays and underperformance . “Unobtanium” . “The Itanium approach … was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write.” – Donald Knuth . The “Itanic”. Huge monetary investment – quick demise . “…one of the great fiascos of the last 50 years” – John C. Dvorak • “The Macintosh uses an experimental pointing device called a ‘mouse’. There is no evidence that people want to use these things. I don’t want one of these new fangled devices.” . In 1987 IBM took its ball and went home, and developed the PS/2 • The world did not follow and stuck with cheap PC clones • IBM never recovered its position in the PC industry… UCF AMD Hammer Family . “The other white meat” • AMD calls architecture x86-64 • Released in 2003 • Why bother with all this EPIC stuff? • “Biggie” size it! . Extend x86 registers to 64-bits • Automatic backward compatibility to IA-32 . Two instances • Clawhammer (256KB L2) and Sledgehammer (1MB) • Much like P4 (w/o HT) but with integrated MC and better MP support UCF .

Pipelining I

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support