Intel Itanium Architecture Alex Crawford Matt Ofalt Brief History
● Merced - 2001 ○ Slower than competing RISC and CISC ● McKinley (Itanium 2) - 2002 ○ Fixed many of the performance problems on Merced ● Montecito (Itanium 2 9000) - 2006 ○ Dual-core, roughly doubled performance ● Tukwila (Itanium 2 9300) - 2010 ○ Quad-core, memory error correction ○ Shares its chipset with Nehalem Itanium Overview
● 64-bit (path, data, address space) ● Explicit instruction-level parallelism (VLIW) ○ Static "superscaling" ● Compiler ○ Predication ○ Speculation ○ Branch Prediction ● 128 integer registers, 128 FP registers ● 30 functional execution units Compilers
● Very difficult to write ○ Predication ○ Speculation ○ Branch Prediction ● This is the reason the architecture is failing, but... ● Allows for huge improvements ● We like assembly better anyway, right? IA-64 Instructions
● Issued in 128-bit "bundles" ● Three 41-bit instructions per bundle ● Template tells CPU which instructions execute in parallel ○ Not constrained to just one bundle (8 inst. in parallel) ● Six instruction types ○ A Integer ALU I/M unit ○ I Non-ALU integer I unit ○ M Memory M unit ○ F Floating-point F unit ○ B Branch B unit ○ X Extended I/B unit Execution Units
● I-Unit ○ Integer arithmetic ○ Shift and add ○ Logical ● M-Unit ○ Load and Store ○ Basic integer ALU operations ● B-Unit ○ Branches ● F-Unit ○ Floating point IA-64 Assembly
[pq] mnemonic [.comp] dest = src [;;] [//] (p0) cmp.eq p1,p2=5,r7 // conditional 5 == r7
pq - 1-bit predicate register mnemonic - name of instruction comp - instruction completer dest - one or more destination operands src - one or more source operands ;; - instruction group stops // - comment Assembly Example ld8 r2 = [r3] sub r4 = r10, r11 ;; add r5 = r2, r6 st8 [r4] = r7 ;; add r2 = r2, 1 ;; st8 [r2] = r5 Assembly Example ld8 r2 = [r3] sub r4 = r10, r11 ;; add r5 = r2, r6 st8 [r4] = r7 ;; add r2 = r2, 1 ;; st8 [r2] = r5 IA-64 Instruction Format
128-Bit Bundle
Instruction 1 Instruction 2 Instruction 3 Template (41 bits) (41 bits) (41 bits) (5 bits)
41-Bit Instruction
Major Opcode Modifying Bits GR3 GR2 GR1 PR (4 bits) (10 bits) (7 bits) (7 bits) (7 bits) (6 bits) Template Field
Template Slot 1 Slot 2 Slot 3 Template Slot 1 Slot 2 Slot 3
00000 M I I 01110 M M F
00001 M I I 01111 M M F
00010 M I I 10000 M I B
00011 M I I 10001 M I B
00100 M L X 10010 M B B
00101 M L X 10011 M B B
01000 M M I 10110 B B B
01001 M M I 10111 B B B
01010 M M I 11000 M M B
01011 M M I 11001 M M B
01100 M F I 11100 M F B
01101 M F I 11101 M F B Branching on x86 if (G_LIKELY(random() != 1)) call 8048440
if (G_UNLIKELY(random() != 1)) call 8048440
Branching on IA-64
// random() -> r14 // not_ones -> r31 // ones -> r32 if(random() != 1) cmp.eq p1,p2=1,r14 not_ones++; (p1) adds r31=1,r31 else (p2) adds r32=1,r32 ones++;
Data Speculation on IA-64
ld8.a r6 = [r8] ;; // other stuff ld8.c r6 = [r8] add r5 = r6, r7 ;; st8 [r18] = r5 Data Speculation on IA-64 (cont.)
ld8.a r6 = [r8] // other stuff ;; add r5 = r6, r7 // more stuff chk.a r6, dirty origin: st8 [r18] = r5 dirty: ld8.a r6 = [r8] ;; add r5 = r6, r7 ;; br origin Data Speculation on x86
??? Rotating Register Stack
● r32-r127 can rotate ("register renaming") ● loop unrolling ● parameter passing ● overflows to memory Performance
● Two bundles per cycle ○ Up to six instructions per cycle ○ Multiply-accumulate allows for 4 FLOPs per cycle ● Quad core ○ QPI (96 GiB/s) ○ Four memory controllers (34 GiB/s) ● Split L1 cache (16kiB Data, 16kiB Data) ● Unified L2 cache (256kiB) ● Unified L3 cache (24MiB) Where do I buy one?
● $3,838 for the Tukwila 9350 ● Servers in excess of $200,000 ● newegg doesn't have them Emulation
● ski ○ ski - ncurses-based IA-64 simulator ○ xski - ski with a GUI ○ http://ski.sourceforge.net/ ● cross compile ○ ia64-gcc ○ ia64-as (live on the edge) Questions?