Intel Itanium Architecture Alex Crawford Matt Ofalt Brief History

Intel Itanium Architecture Alex Crawford Matt Ofalt Brief History ● Merced - 2001 ○ Slower than competing RISC and CISC ● McKinley (Itanium 2) - 2002 ○ Fixed many of the performance problems on Merced ● Montecito (Itanium 2 9000) - 2006 ○ Dual-core, roughly doubled performance ● Tukwila (Itanium 2 9300) - 2010 ○ Quad-core, memory error correction ○ Shares its chipset with Nehalem Itanium Overview ● 64-bit (path, data, address space) ● Explicit instruction-level parallelism (VLIW) ○ Static "superscaling" ● Compiler ○ Predication ○ Speculation ○ Branch Prediction ● 128 integer registers, 128 FP registers ● 30 functional execution units Compilers ● Very difficult to write ○ Predication ○ Speculation ○ Branch Prediction ● This is the reason the architecture is failing, but... ● Allows for huge improvements ● We like assembly better anyway, right? IA-64 Instructions ● Issued in 128-bit "bundles" ● Three 41-bit instructions per bundle ● Template tells CPU which instructions execute in parallel ○ Not constrained to just one bundle (8 inst. in parallel) ● Six instruction types ○ A Integer ALU I/M unit ○ I Non-ALU integer I unit ○ M Memory M unit ○ F Floating-point F unit ○ B Branch B unit ○ X Extended I/B unit Execution Units ● I-Unit ○ Integer arithmetic ○ Shift and add ○ Logical ● M-Unit ○ Load and Store ○ Basic integer ALU operations ● B-Unit ○ Branches ● F-Unit ○ Floating point IA-64 Assembly [pq] mnemonic [.comp] dest = src [;;] [//] (p0) cmp.eq p1,p2=5,r7 // conditional 5 == r7 pq - 1-bit predicate register mnemonic - name of instruction comp - instruction completer dest - one or more destination operands src - one or more source operands ;; - instruction group stops // - comment Assembly Example ld8 r2 = [r3] sub r4 = r10, r11 ;; add r5 = r2, r6 st8 [r4] = r7 ;; add r2 = r2, 1 ;; st8 [r2] = r5 Assembly Example ld8 r2 = [r3] sub r4 = r10, r11 ;; add r5 = r2, r6 st8 [r4] = r7 ;; add r2 = r2, 1 ;; st8 [r2] = r5 IA-64 Instruction Format 128-Bit Bundle Instruction 1 Instruction 2 Instruction 3 Template (41 bits) (41 bits) (41 bits) (5 bits) 41-Bit Instruction Major Opcode Modifying Bits GR3 GR2 GR1 PR (4 bits) (10 bits) (7 bits) (7 bits) (7 bits) (6 bits) Template Field Template Slot 1 Slot 2 Slot 3 Template Slot 1 Slot 2 Slot 3 00000 M I I 01110 M M F 00001 M I I 01111 M M F 00010 M I I 10000 M I B 00011 M I I 10001 M I B 00100 M L X 10010 M B B 00101 M L X 10011 M B B 01000 M M I 10110 B B B 01001 M M I 10111 B B B 01010 M M I 11000 M M B 01011 M M I 11001 M M B 01100 M F I 11100 M F B 01101 M F I 11101 M F B Branching on x86 if (G_LIKELY(random() != 1)) call 8048440 <random@plt> printf("not one"); cmp $0x1,%eax je 8048524 <main+0x20> mov $0x80485f0,%eax mov %eax,(%esp) call 8048410 <printf@plt> if (G_UNLIKELY(random() != 1)) call 8048440 <random@plt> printf("not one"); cmp $0x1,%eax jne 8048524 <main+0x1B> mov $0x0,%eax leave ret Branching on IA-64 // random() -> r14 // not_ones -> r31 // ones -> r32 if(random() != 1) cmp.eq p1,p2=1,r14 not_ones++; (p1) adds r31=1,r31 else (p2) adds r32=1,r32 ones++; Data Speculation on IA-64 ld8.a r6 = [r8] ;; // other stuff ld8.c r6 = [r8] add r5 = r6, r7 ;; st8 [r18] = r5 Data Speculation on IA-64 (cont.) ld8.a r6 = [r8] // other stuff ;; add r5 = r6, r7 // more stuff chk.a r6, dirty origin: st8 [r18] = r5 dirty: ld8.a r6 = [r8] ;; add r5 = r6, r7 ;; br origin Data Speculation on x86 ??? Rotating Register Stack ● r32-r127 can rotate ("register renaming") ● loop unrolling ● parameter passing ● overflows to memory Performance ● Two bundles per cycle ○ Up to six instructions per cycle ○ Multiply-accumulate allows for 4 FLOPs per cycle ● Quad core ○ QPI (96 GiB/s) ○ Four memory controllers (34 GiB/s) ● Split L1 cache (16kiB Data, 16kiB Data) ● Unified L2 cache (256kiB) ● Unified L3 cache (24MiB) Where do I buy one? ● $3,838 for the Tukwila 9350 ● Servers in excess of $200,000 ● newegg doesn't have them Emulation ● ski ○ ski - ncurses-based IA-64 simulator ○ xski - ski with a GUI ○ http://ski.sourceforge.net/ ● cross compile ○ ia64-gcc ○ ia64-as (live on the edge) Questions?.

Load more