<<

Intel Architecture Alex Crawford Matt Ofalt Brief History

● Merced - 2001 ○ Slower than competing RISC and CISC ● McKinley (Itanium 2) - 2002 ○ Fixed many of the performance problems on Merced ● (Itanium 2 9000) - 2006 ○ Dual-core, roughly doubled performance ● (Itanium 2 9300) - 2010 ○ Quad-core, memory error correction ○ Shares its with Nehalem Itanium Overview

● 64-bit (path, data, address space) ● Explicit instruction-level parallelism (VLIW) ○ Static "superscaling" ● ○ Predication ○ Speculation ○ Branch Prediction ● 128 integer registers, 128 FP registers ● 30 functional execution units

● Very difficult to write ○ Predication ○ Speculation ○ Branch Prediction ● This is the reason the architecture is failing, but... ● Allows for huge improvements ● We like assembly better anyway, right? IA-64 Instructions

● Issued in 128-bit "bundles" ● Three 41-bit instructions per bundle ● Template tells CPU which instructions execute in parallel ○ Not constrained to just one bundle (8 inst. in parallel) ● Six instruction types ○ A Integer ALU I/M unit ○ I Non-ALU integer I unit ○ M Memory M unit ○ F Floating-point F unit ○ B Branch B unit ○ X Extended I/B unit Execution Units

● I-Unit ○ Integer arithmetic ○ Shift and add ○ Logical ● M-Unit ○ Load and Store ○ Basic integer ALU operations ● B-Unit ○ Branches ● F-Unit ○ Floating point IA-64 Assembly

[pq] mnemonic [.comp] dest = src [;;] [//] (p0) cmp.eq p1,p2=5,r7 // conditional 5 == r7

pq - 1-bit predicate register mnemonic - name of instruction comp - instruction completer dest - one or more destination operands src - one or more source operands ;; - instruction group stops // - comment Assembly Example ld8 r2 = [r3] sub r4 = r10, r11 ;; add r5 = r2, r6 st8 [r4] = r7 ;; add r2 = r2, 1 ;; st8 [r2] = r5 Assembly Example ld8 r2 = [r3] sub r4 = r10, r11 ;; add r5 = r2, r6 st8 [r4] = r7 ;; add r2 = r2, 1 ;; st8 [r2] = r5 IA-64 Instruction Format

128-Bit Bundle

Instruction 1 Instruction 2 Instruction 3 Template (41 bits) (41 bits) (41 bits) (5 bits)

41-Bit Instruction

Major Opcode Modifying Bits GR3 GR2 GR1 PR (4 bits) (10 bits) (7 bits) (7 bits) (7 bits) (6 bits) Template Field

Template Slot 3 Template Slot 1 Slot 2 Slot 3

00000 M I I 01110 M M F

00001 M I I 01111 M M F

00010 M I I 10000 M I B

00011 M I I 10001 M I B

00100 M L X 10010 M B B

00101 M L X 10011 M B B

01000 M M I 10110 B B B

01001 M M I 10111 B B B

01010 M M I 11000 M M B

01011 M M I 11001 M M B

01100 M F I 11100 M F B

01101 M F I 11101 M F B Branching on if (G_LIKELY(random() != 1)) call 8048440 printf("not one"); cmp $0x1,%eax je 8048524 mov $0x80485f0,%eax mov %eax,(%esp) call 8048410

if (G_UNLIKELY(random() != 1)) call 8048440 printf("not one"); cmp $0x1,%eax jne 8048524 mov $0x0,%eax leave ret

Branching on IA-64

// random() -> r14 // not_ones -> r31 // ones -> r32 if(random() != 1) cmp.eq p1,p2=1,r14 not_ones++; (p1) adds r31=1,r31 else (p2) adds r32=1,r32 ones++;

Data Speculation on IA-64

ld8.a r6 = [r8] ;; // other stuff ld8.c r6 = [r8] add r5 = r6, r7 ;; st8 [r18] = r5 Data Speculation on IA-64 (cont.)

ld8.a r6 = [r8] // other stuff ;; add r5 = r6, r7 // more stuff chk.a r6, dirty origin: st8 [r18] = r5 dirty: ld8.a r6 = [r8] ;; add r5 = r6, r7 ;; br origin Data Speculation on x86

??? Rotating Register Stack

● r32-r127 can rotate ("") ● loop unrolling ● parameter passing ● overflows to memory Performance

● Two bundles per cycle ○ Up to six ○ Multiply-accumulate allows for 4 FLOPs per cycle ● Quad core ○ QPI (96 GiB/s) ○ Four memory controllers (34 GiB/s) ● Split L1 (16kiB Data, 16kiB Data) ● Unified L2 cache (256kiB) ● Unified L3 cache (24MiB) Where do I buy one?

● $3,838 for the Tukwila 9350 ● Servers in excess of $200,000 ● newegg doesn't have them Emulation

● ski ○ ski - ncurses-based IA-64 simulator ○ xski - ski with a GUI ○ http://ski.sourceforge.net/ ● cross compile ○ ia64-gcc ○ ia64-as (live on the edge) Questions?