University of Central Florida

EEL 4768 Predication

Prof. Mark Heinrich School of Electrical Engineering and Computer Science [email protected]

Announcements

. Lab 4 rubric in place and being actively graded . Lab 5 due Today • Lab sections as normally scheduled this week . HW #4 is posted, due Tuesday December 1st

. Today • Predication • IA-64 Architecture

UCF EEL 4768 — Computer Architecture Increasing ILP

. We’ve seen several compiler techniques to increase ILP • Loop unrolling • Software pipelining • Trace scheduling . But gains are limited if branch behavior unknown . Can we do something besides hw speculation? . Want a set of techniques that always increase ILP • Conditional moves • Predication

UCF Conditional Instructions

. Have an instruction refer to a condition • If condition is TRUE, execute instruction • If condition is FALSE, instruction is a NOP . Many architectures include conditional move instructions • Alpha, MIPS, PowerPC, SPARC • Pro -- (only user insts added between and ) . Conditional move moves a value from one register to another • Only if the condition is TRUE . Can use conditional moves to completely eliminate branches

UCF Conditional Move Example

. If (A==0) S=T; . Assume r1, r2, r3 hold the latest values of A,S,T… . Code without conditional move: . bnez r1, L . mov r2, r3 . L: . This entire fragment can be replaced with 1 instruction • cmovz r2, r3, r1 ; executed only if r1==0 . Control dependence is converted into a data dependence • Dependence can be resolved later in the pipeline (WB) • Have to maintain the conditions! • Another register file port?

UCF Using Conditional Instructions

. Can use conditional instructions to improve schedule • Schedule time-critical instructions into idle slots

. lw r1,40(r2) add r3,r4,r5

. nop add r6,r3,r7

. beqz r10,L nop

. lw r8,20(r10) nop

. lw r9,0(r8) nop . 2nd load after branch depends on 1st, branch protects following load (don’t want to load if r10 is 0) . Can imagine a conditional load instruction that loads unless a third operand is 0…

UCF Using Conditional Instructions II

. Schedule with the conditional load:

. lw r1,40(r2) add r3,r4,r5

. lwc r8,20(r10),r10 add r6,r3,r7

. beqz r10,L nop

. lw r9,0(r8) nop . Improves performance by eliminating the stall and scheduling the data dependence . If compiler mispredicts branch, instruction will have no effect • Conditional load is speculative • Have to worry about exceptions – Above, incorrect execution may cause fatal exception • In general, conditional instructions have restrictions

UCF Predication

. Intel IA-64 architecture uses full predication • Every register-writing instruction can be conditional • Has 64 1-bit predicate registers • Estimates you lose 20-30% in branch mispredictions • Mahlke et al. claim can remove 50% of branches – And 40% of the mispredictions! . Idea: Since branches chop up ILP opportunity • Execute both sides of the branch! • Fill available slots with useful work • Tag each instruction with a predicate condition • Instruction commits only if predicate is true • Can combine predicates (make compound predicates)

UCF cmp Predication Example

then cmp P1 P2 else P1 P2 P1 P2

. Removes branches • Executes multiple paths simultaneously . Increases performance by exposing parallelism • Reduces critical path • Better utilization of wide machines UCF Combining Predicates

p1, p2 

(p2) p3,p4  p2&p3 p2&p4 (p3)… (p4)…

. Predicates can be combined to predicate many if-then-else levels • This is difficult to do with simple conditional instructions • Intel cites study – cmove: 39% more instructions, 23% worse performance

UCF Limitations of Conditional Instructions

. Is predication really a good idea? • Open question (Intel hoped so) . Potential disadvantages and limitations • Conditional instructions that are annulled (predicate is false) still take execution time and occupy a potential issue slot. There is “zero cost” only if potential instruction fills an otherwise idle slot • Need swanky compiler technology • Compound predicates must be computed (more slots) • Full benefits of predication still require speculation • Conditional instructions may have a speed penalty compared with unconditional insts (lower clock rate?) • If branch prediction accuracy is high, can be better to be greedy and not execute both sides of every branch • Room for multithreading? UCF Intel/HP EPIC Architecture

. EPIC (Explicitly Parallel Instruction Computing) • Similar philosophy to VLIW • Tries to overcome VLIW limitations . IA-64 is the ISA, Intel now calls Intel Architecture • As opposed to IA-32 () in Pentium line • HP, Compaq (Alpha), and SGI (MIPS) all signed on . Itanium (Merced), and Itanium II (McKinley) • Itanium III (Madison) (2003) 410 million transistors! • Itanium IV () 2006, dual-core CMP • latest version from Intel (Montvale) 2007 (another 2010 – Tukwila) . Intel produced x86-64 in 2004…

UCF Itanium Characteristics

. Explicitly parallel • ILP encoded in machine binary • Compiler looks beyond basic block scheduling . Advanced ILP techniques • Full predication • ISA support for control and data speculation • HW support for SW pipelining (modulo scheduling) . IA-32 hardware support . Massive hardware resources • Lots of FUs, tons of registers

UCF Instruction Format

. 3 instructions packaged in 128-bit bundles • Instructions are 41 bits each • 5-bit template at the end of each bundle . Template bits encode dependences • Both within (using stops) and between bundles • Also identify type of packet to help decode • More efficient and flexible than fixed VLIW format . Each instruction • Has a 6-bit field specifying 1 of 64 predicate registers • Has a max of 3 7-bit register specifiers

UCF Itanium Architecture Features

. 128 integer registers, 128 FP registers • 32 fixed, 96 “rotating” • Int registers are 65 bits! (NaT bit, later) . 64 predicate registers • 16 fixed, 48 “rotating” . Register stack on function calls • Similar to SPARC’s register windows . Multi-way branches (more than 1 branch unit) . Parallel compares (AND, OR, ANDOR)

UCF Control Speculation

Traditional Architectures IA-64 ld.s instr 1 instr 1 instr 2 . . . instr 2 br Barrier br

Load chk.s use use Allows elevation of load, even above a branch . Improve memory latency by speculation at compile time . Defer exceptions by setting NaT (bit 65) • ld.s sets NaT, chk.s checks it, branches to recovery code

UCF Hoisting Uses as Well

IA-64 ld.s ld.s instr 1 instr 1 instr 2 instr 2 uses br br Recovery code chk.s chk.s ld use (Home Block) uses br home . Uses of speculative loads can execute speculatively • Distinguishes speculation from simple prefetch . NaT propagates down the dependent inst chain • Single chk.s instruction handles recovery

UCF Data Speculation

Traditional Architectures IA-64 instr 1 ld8.a instr 2 instr 1 . . . instr 2 Store(*) Barrier st8

Load (*) ld.c use use

. Compiler can hoist load prior to preceding, possibly conflicting store . ALAT (Advanced Load Address Table) is used to check address of every store in-between

UCF Hoisting Uses (again)

ld8.a ld8.a instr 1 instr 1 use instr 2 instr 2 st8 st8 Recovery code ld.c chk.a ld8 use uses br home

. Can also hoist uses past dependent store . chk.a jumps to recovery code if mis-speculation

UCF Rotating Registers

. Instructions contain “virtual” register number • Rotating Register Base (RRB) + VRN = PRN • Int and FP regs 32-127 can rotate (each has RRB) . Predicate registers can also rotate . Avoids code expansion from unrolling loops and code explosion from prologue/epilogue • Fewer cache misses • Can now SW pipeline small loops with unknown number of iterations (typical in integer code)

UCF Itanium II Implementation (McKinley)

. 0.18u process, 220 million transistors • Intel announced 65nm process . 10-stage in-order pipe, 1 GHz • Executes 2 bundles/CLK • 11 issue ports, 6 integer ALUs, 2 LD & 2 ST per cycle . Caches • L1 – 2 16KB, 1 cycle • L2 – 256KB, 5 cycles • L3 – 3MB, 12 cycles  on chip! . Addressing • 50-bit PA, 64-bit VA

UCF Itanium “Tukwila” (2008, 2009, 2010)

. Intel data from 4/2008: . 4 cores . 6MB $/core, 24MB $ total . ~2.0 GHz . 698mm2 in 65nm CMOS!!!!! . 170W . Over 2 billion transistors

UCF Itanium Sales

UCF Itanium Reception

. EPIC Fail in the marketplace, initial delays and underperformance . “Unobtanium” . “The Itanium approach … was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write.” – Donald Knuth . The “Itanic”. Huge monetary investment – quick demise . “…one of the great fiascos of the last 50 years” – John C. Dvorak • “The Macintosh uses an experimental pointing device called a ‘mouse’. There is no evidence that people want to use these things. I don’t want one of these new fangled devices.” . In 1987 IBM took its ball and went home, and developed the PS/2 • The world did not follow and stuck with cheap PC clones • IBM never recovered its position in the PC industry…

UCF AMD Hammer Family

. “The other white meat” • AMD calls architecture x86-64 • Released in 2003 • Why bother with all this EPIC stuff? • “Biggie” size it! . Extend x86 registers to 64-bits • Automatic backward compatibility to IA-32 . Two instances • Clawhammer (256KB L2) and Sledgehammer (1MB) • Much like P4 (w/o HT) but with integrated MC and better MP support

UCF