Pipelining I

University of Central Florida EEL 4768 Predication Prof. Mark Heinrich School of Electrical Engineering and Computer Science [email protected] Announcements . Lab 4 rubric in place and being actively graded . Lab 5 due Today • Lab sections as normally scheduled this week . HW #4 is posted, due Tuesday December 1st . Today • Predication • Intel IA-64 Architecture UCF EEL 4768 — Computer Architecture Increasing ILP . We’ve seen several compiler techniques to increase ILP • Loop unrolling • Software pipelining • Trace scheduling . But gains are limited if branch behavior unknown . Can we do something besides hw speculation? . Want a set of techniques that always increase ILP • Conditional moves • Predication UCF Conditional Instructions . Have an instruction refer to a condition • If condition is TRUE, execute instruction • If condition is FALSE, instruction is a NOP . Many architectures include conditional move instructions • Alpha, MIPS, PowerPC, SPARC • Pentium Pro -- (only user insts added between P5 and P6) . Conditional move moves a value from one register to another • Only if the condition is TRUE . Can use conditional moves to completely eliminate branches UCF Conditional Move Example . If (A==0) S=T; . Assume r1, r2, r3 hold the latest values of A,S,T… . Code without conditional move: . bnez r1, L . mov r2, r3 . L: . This entire fragment can be replaced with 1 instruction • cmovz r2, r3, r1 ; executed only if r1==0 . Control dependence is converted into a data dependence • Dependence can be resolved later in the pipeline (WB) • Have to maintain the conditions! • Another register file port? UCF Using Conditional Instructions . Can use conditional instructions to improve schedule • Schedule time-critical instructions into idle slots . lw r1,40(r2) add r3,r4,r5 . nop add r6,r3,r7 . beqz r10,L nop . lw r8,20(r10) nop . lw r9,0(r8) nop . 2nd load after branch depends on 1st, branch protects following load (don’t want to load if r10 is 0) . Can imagine a conditional load instruction that loads unless a third operand is 0… UCF Using Conditional Instructions II . Schedule with the conditional load: . lw r1,40(r2) add r3,r4,r5 . lwc r8,20(r10),r10 add r6,r3,r7 . beqz r10,L nop . lw r9,0(r8) nop . Improves performance by eliminating the stall and scheduling the data dependence . If compiler mispredicts branch, instruction will have no effect • Conditional load is speculative • Have to worry about exceptions – Above, incorrect execution may cause fatal exception • In general, conditional instructions have restrictions UCF Predication . Intel IA-64 architecture uses full predication • Every register-writing instruction can be conditional • Has 64 1-bit predicate registers • Estimates you lose 20-30% in branch mispredictions • Mahlke et al. claim can remove 50% of branches – And 40% of the mispredictions! . Idea: Since branches chop up ILP opportunity • Execute both sides of the branch! • Fill available slots with useful work • Tag each instruction with a predicate condition • Instruction commits only if predicate is true • Can combine predicates (make compound predicates) UCF cmp Predication Example then cmp P1 P2 else P1 P2 P1 P2 . Removes branches • Executes multiple paths simultaneously . Increases performance by exposing parallelism • Reduces critical path • Better utilization of wide machines UCF Combining Predicates p1, p2 (p2) p3,p4 p2&p3 p2&p4 (p3)… (p4)… . Predicates can be combined to predicate many if-then-else levels • This is difficult to do with simple conditional instructions • Intel cites study – cmove: 39% more instructions, 23% worse performance UCF Limitations of Conditional Instructions . Is predication really a good idea? • Open question (Intel hoped so) . Potential disadvantages and limitations • Conditional instructions that are annulled (predicate is false) still take execution time and occupy a potential issue slot. There is “zero cost” only if potential instruction fills an otherwise idle slot • Need swanky compiler technology • Compound predicates must be computed (more slots) • Full benefits of predication still require speculation • Conditional instructions may have a speed penalty compared with unconditional insts (lower clock rate?) • If branch prediction accuracy is high, can be better to be greedy and not execute both sides of every branch • Room for multithreading? UCF Intel/HP EPIC Architecture . EPIC (Explicitly Parallel Instruction Computing) • Similar philosophy to VLIW • Tries to overcome VLIW limitations . IA-64 is the ISA, Intel now calls Intel Itanium Architecture • As opposed to IA-32 (x86) in Pentium line • HP, Compaq (Alpha), and SGI (MIPS) all signed on . Itanium (Merced), and Itanium II (McKinley) • Itanium III (Madison) (2003) 410 million transistors! • Itanium IV (Montecito) 2006, dual-core CMP • latest version from Intel (Montvale) 2007 (another 2010 – Tukwila) . Intel produced x86-64 Xeon in 2004… UCF Itanium Characteristics . Explicitly parallel • ILP encoded in machine binary • Compiler looks beyond basic block scheduling . Advanced ILP techniques • Full predication • ISA support for control and data speculation • HW support for SW pipelining (modulo scheduling) . IA-32 hardware support . Massive hardware resources • Lots of FUs, tons of registers UCF Instruction Format . 3 instructions packaged in 128-bit bundles • Instructions are 41 bits each • 5-bit template at the end of each bundle . Template bits encode dependences • Both within (using stops) and between bundles • Also identify type of packet to help decode • More efficient and flexible than fixed VLIW format . Each instruction • Has a 6-bit field specifying 1 of 64 predicate registers • Has a max of 3 7-bit register specifiers UCF Itanium Architecture Features . 128 integer registers, 128 FP registers • 32 fixed, 96 “rotating” • Int registers are 65 bits! (NaT bit, later) . 64 predicate registers • 16 fixed, 48 “rotating” . Register stack on function calls • Similar to SPARC’s register windows . Multi-way branches (more than 1 branch unit) . Parallel compares (AND, OR, ANDOR) UCF Control Speculation Traditional Architectures IA-64 ld.s instr 1 instr 1 instr 2 . instr 2 br Barrier br Load chk.s use use Allows elevation of load, even above a branch . Improve memory latency by speculation at compile time . Defer exceptions by setting NaT (bit 65) • ld.s sets NaT, chk.s checks it, branches to recovery code UCF Hoisting Uses as Well IA-64 ld.s ld.s instr 1 instr 1 instr 2 instr 2 uses br br Recovery code chk.s chk.s ld use (Home Block) uses br home . Uses of speculative loads can execute speculatively • Distinguishes speculation from simple prefetch . NaT propagates down the dependent inst chain • Single chk.s instruction handles recovery UCF Data Speculation Traditional Architectures IA-64 instr 1 ld8.a instr 2 instr 1 . instr 2 Store(*) Barrier st8 Load (*) ld.c use use . Compiler can hoist load prior to preceding, possibly conflicting store . ALAT (Advanced Load Address Table) is used to check address of every store in-between UCF Hoisting Uses (again) ld8.a ld8.a instr 1 instr 1 use instr 2 instr 2 st8 st8 Recovery code ld.c chk.a ld8 use uses br home . Can also hoist uses past dependent store . chk.a jumps to recovery code if mis-speculation UCF Rotating Registers . Instructions contain “virtual” register number • Rotating Register Base (RRB) + VRN = PRN • Int and FP regs 32-127 can rotate (each has RRB) . Predicate registers can also rotate . Avoids code expansion from unrolling loops and code explosion from prologue/epilogue • Fewer cache misses • Can now SW pipeline small loops with unknown number of iterations (typical in integer code) UCF Itanium II Implementation (McKinley) . 0.18u process, 220 million transistors • Intel announced 65nm process . 10-stage in-order pipe, 1 GHz • Executes 2 bundles/CLK • 11 issue ports, 6 integer ALUs, 2 LD & 2 ST per cycle . Caches • L1 – 2 16KB, 1 cycle • L2 – 256KB, 5 cycles • L3 – 3MB, 12 cycles on chip! . Addressing • 50-bit PA, 64-bit VA UCF Itanium “Tukwila” (2008, 2009, 2010) . Intel data from 4/2008: . 4 cores . 6MB $/core, 24MB $ total . ~2.0 GHz . 698mm2 in 65nm CMOS!!!!! . 170W . Over 2 billion transistors UCF Itanium Sales UCF Itanium Reception . EPIC Fail in the marketplace, initial delays and underperformance . “Unobtanium” . “The Itanium approach … was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write.” – Donald Knuth . The “Itanic”. Huge monetary investment – quick demise . “…one of the great fiascos of the last 50 years” – John C. Dvorak • “The Macintosh uses an experimental pointing device called a ‘mouse’. There is no evidence that people want to use these things. I don’t want one of these new fangled devices.” . In 1987 IBM took its ball and went home, and developed the PS/2 • The world did not follow and stuck with cheap PC clones • IBM never recovered its position in the PC industry… UCF AMD Hammer Family . “The other white meat” • AMD calls architecture x86-64 • Released in 2003 • Why bother with all this EPIC stuff? • “Biggie” size it! . Extend x86 registers to 64-bits • Automatic backward compatibility to IA-32 . Two instances • Clawhammer (256KB L2) and Sledgehammer (1MB) • Much like P4 (w/o HT) but with integrated MC and better MP support UCF .

Pipelining I

Introduction to the Poulson (Intel 9500 Series) Processor Openvms Advanced Technical Boot Camp 2015 Keith Parris / September 29, 2015

Multiprocessing Contents

A 65 Nm 2-Billion Transistor Quad-Core Itanium Processor

Poulson: an 8 Core 32 Nm Next Generation Intel* Itanium* Processor

Intel Developer Forum Day 1 News Disclosures from Shanghai

Intel® Processor Architecture

Impact of the New Generation of X86 on the Server Market

Static Scheduling, VLIW, EPIC & Speculation Beating the IPC=1

Intel's Core 2 Family

Montecito Overview

DCG) Platform Roadmap Data Center Group Marketing (DCGM

How Intel® Itanium®-Based Servers Are Changing the Economics of Mission-Critical Computing