<<

IIntnteell®® IIttananiiuumm™™ AArrcchhiitteeccttururee

RReeffrreesshh oonn ssoommee ffeeaattuurreess rreelleevvaanntt ffoorr ddeevveellooppeerrss

HHeeiinnzz BBaasstt [[email protected]] March 2005 IIttaanniiuumm®® PPrroocceessssoorr ArArcchhiitteeccttuurree SSeelleecctteedd FFeeaattuurreses

 64-biit Addressiing Fllat Memory Modell  Instructiion Levell Parallllelliism (6-way)  Large Regiister Fiilles  Automatiic Regiister Stack Engiine  Prediicatiion  Software Piipelliiniing Support  Regiister Rotatiion  Loop Controll Hardware  Sophiistiicated Branch Archiitecture  Controll & Data Specullatiion  Powerfull 64-biit Integer Archiitecture  Advanced 82-biit Flloatiing Poiint Archiitecture  Mulltiimediia Support (MMX™ Technollogy)

Copyright © 2001, Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners TTraraddiittiioonnaall AArcrchhiitteeccttuureress:: LLiimmiitteedd PPaarraalllleelliissmm

Original Source Sequential Machine Code Code Hardware

Compile parallelized code

multiple functional units

Execution Units Available- Execution Units Available- . . . . Used Inefficiently . . . .

TTooddaayy’’ss PPrroocceessssoorrss aarree oofftteenn 6600%% IIddllee

Copyright © 2001, Intel Corporation. All rights reserved. 3 *Other brands and names are the property of their respective owners IInntteell®® IIttaanniiuum®m® AArrchchiitteectctuurere:: EExxpplliicciitt PPaarraalllleelliissmm

Origiinaal SSoouurcee Parallel Machine Code Code Compile

Compiller Hardware multiple functional units

Itanium Architecture Views More efficient use of . . . . Wider execution resources . . . . Scope IInnccrreeaasseess PPaarraalllleell EExxeeccuuttiioonn

Copyright © 2001, Intel Corporation. All rights reserved. 4 *Other brands and names are the property of their respective owners EEPPIICC IInnssttruruccttiioonn PPaararalllleelliissmm

Source Code

Instruction Groupps  No RAW or WAW dependencies (serries of bundles) dependencies  Issued in parallel depending on resources

Instructionn  3 instructions + Bundles template (3 Instructions)  3 x 41 bits + 5 bits = (3 Instructions) 128 bits

Up to 6 iinsstruucctioonss exeecuted pperr clloock

Copyright © 2001, Intel Corporation. All rights reserved. 5 *Other brands and names are the property of their respective owners IInnssttruruccttiioonn LLeevveell PPaararalllleelliissmm  Instruction Groups • No RAW or WAW dependencies instr 1 // 1st. group instr 2;; // 1st. group • Delimited by ‘stops’ in assembly code instr 3 // 2nd. group • Instructions in groups issued in parallel, depending instr 4 // 2nd. group on available resources.  IInnssttrruuccttiioonn BBuunnddlleess { .mii • 3 instructions and 1 template in 128-bit bundle ld4 r28=[r8] // load add r9=2,r1 // Int op. • Instruction dependencies by using ‘stops’ Instruction dependencies by using ‘stops’ add r30=1,r1 // Int op. • Instruction groups can span multiple bundles } 128 bits (bundle)

Instruction 2 Instruction 1 Instruction 0 template 41 bits 41 bits 41 bits 5 bits

MMeemmoorryy ((MM)) MMeemmoorryy ((MM)) IInntteeggeerr ((II)) (MMI) FFlleexxiibbllee IIssssuuee CCaappaabbiilliittyy

Copyright © 2001, Intel Corporation. All rights reserved. 6 *Other brands and names are the property of their respective owners LLaarrggee RReeggiisstteerr SSeett General Floating-point Predicate Branch Application Registers Registers Registers Registers Registers NaT 64-bit 82-bit 64-bit 64-bit GR0 0 FR0 + 0.0 PR0 1 BR0 AR0 GR1 FR1 + 1.0 PR1 AR1 BR7 GR31 FR31 PR15 AR31 GR32 FR32 PR16 AR32

PR63 GR127 FR127 AR127

32 Static 32 Static 16 Static

96 Stacked 96 Rotating 48 Rotating

Copyright © 2001, Intel Corporation. All rights reserved. 7 *Other brands and names are the property of their respective owners PPrreeddiiccaattiioonn

 PPrreeddiiccaattee rreeggiisstteerrss aaccttiivvaattee//iinnaaccttiivvaattee iinnssttrruuccttiioonnss  PPrreeddiiccaattee RReeggiisstteerrss aarree sseett bbyy CCoommppaarree IInnssttrruuccttiioonnss • EExxaammppllee:: ccmmpp..eeqq pp11,, pp22 == rr22,, rr33  ((AAllmmoosstt)) alalll iinnssttrruuccttiioonnss ccaann bbee pprreeddiiccaatteedd (p1) ldfd f32=[r32],8 (p2) fmpy.d f36=f6,f36  PPrreeddiiccaattiioonn:: • eelliimmiinnaatteess bbrraanncchhiinngg iinn iiff//eellssee llooggiicc bblloocckkss • ccrreeaatteess llaarrggeerr ccooddee bblloocckkss ffoorr ooppttiimmiizzaattiioonn • ssiimmpplliiffiieess ssttaarrtt uupp//sshhuuttddoowwnn ooff ppiippeelliinneedd llooooppss

Copyright © 2001, Intel Corporation. All rights reserved. 8 *Other brands and names are the property of their respective owners PPrreeddiiccaattiioonn

 Code Example: absolute difference of two numbers

C Code if (r2 >= r3) Non-Predicated r4 = r2 - r3; Pseudo Code else r4 = r3 - r2; cmpGE r2, r3 jump_zero P2 P1: sub r4 = r2, r3 Predicated Assembly Code jump end cmp.ge p1,p2 = r2,r3 ;; P2: sub r4 = r3, r2 (p1) sub r4 = r2,r3 end: ... (p2) sub r4 = r3, r2

Predication Removes Branches, Enables Parallel Execution

Copyright © 2001, Intel Corporation. All rights reserved. 9 *Other brands and names are the property of their respective owners RReeggiisstteerr SSttaacckk EEnnggiinnee

TThhee ttrraaddiittiioonnaall uussee ooff aa pprroocceedduurree ssttaacckk iinn mmeemmoorryy ffoorr pprroocceedduurree ccaallll mmananaaggeemmeenntt ddeemmananddss aa llaarrggee oovveerrhheeaadd..

TThhee IInntteell®® IIttaanniiuumm™™ pprroocceessssoorr ffamamiillyy uusseess tthhee ggeenneerraall rreeggiisstteerr ssttaacckk ffoorr pprroocceedduurree ccaallll mmaannagageemmeenntt,, tthhuuss eelliimmiinnaattiinngg tthhee ffrreeqquueenntt mmeemmoorryy acaccceesssseess..

Copyright © 2001, Intel Corporation. All rights reserved. 10 *Other brands and names are the property of their respective owners RReeggiisstteerr SSttaacckk

 GGRRss 00--3311 aarree gglloobbaall ttoo aallll pprroocceedduurreess  SSttaacckkeedd rreeggiisstteerrss bbeeggiinn aatt GGRR3322 aanndd aarree 96 Stacked llooccaall ttoo eeaacchh pprroocceedduurree 127  Each procedure’s register stack frame Each procedure’s register stack frame PROC B vvaarriieess ffrroomm 00 ttoo 9966 rreeggiisstteerrss Overlap  OOnnllyy GGRRss iimmpplleemmeenntt aa rreeggiisstteerr ssttaacckk • The FRs, PRs, and BRs are glloball to allll procedures PROC A 32  RReeggiisstteerr SSttaacckk EEnnggiinnee ((RRSSEE)) 31 32 Global • Upon stack overfllow/underfllow, regiisters are saved/restored to/from a backiing store transparentlly 0

Optimizes the Call/Return Mechanism

Copyright © 2001, Intel Corporation. All rights reserved. 11 *Other brands and names are the property of their respective owners RReeggiisstteerr SSttaacckk EEnnggiinnee aatt WWoorrkk::

 CCaallll cchhaannggeess ffrraammee ttoo ccoonnttaaiinn oonnllyy tthhee ccaalllleerr’’ss oouuttppuutt  AAlllloocc iinnssttrr.. sseettss tthhee ffrraammee rreeggiioonn ttoo tthhee ddeessiirreedd ssiizzee • Three architecturere parameters: local, ouutpuut, and rotating  RReettuurrnn rreessttoorreess tthhee ssttaacckk ffrraammee ooff tthhee ccaalllleerr 56 Outputs 48 Virtual Local 52 52 Outputs Outputs (Inputs) Outputs 46 46 32 32

Local Local

(Inputs) (Inputs) 32 Call Alloc Ret 32 PROC A PROC B PROC B PROC A IImmpprroovveedd eexxeeccuuttiioonn ssppeeeedd iinn oooo llaanngguuaaggeess,, ii..ee.. JJaavvaa,, CC++++ Copyright © 2001, Intel Corporation. All rights reserved. 12 *Other brands and names are the property of their respective owners RReeggiisstteerr rroottaattiioonn

 EExxaammppllee:: 8 general  Floaatinng point and pprereddiicaate registers rottating, cocounnted registers loop (br.ctop) • Always rotate the same set off registers Before: After br.ctop taken: – FR 32-127

gr32 123 29 gr32 • Rotate in the same gr33 8189 123 gr33 diirection as general gr34 0 8189 gr34 registers 99 0 gr35 gr35 – highest rotates to lowest gr36 gr36 abc 99 register number gr37 gr37 9ad6 abc – all other values rotate beef 9adc gr38 gr38 towards larger register gr39 29 Wraparound beef gr39 numbers gr40 4567 4567 gr40 • Rotate at the same time as gr41 818 818 gr41 geeneral registers (at the ...... modulo-scheduled loop ...... instruction)

Copyright © 2001, Intel Corporation. All rights reserved. 13 *Other brands and names are the property of their respective owners SSooffttwwaarree PPiippeelliinniinngg Sequential Loop Software-Pipelined Loop load compute store e e e e m m m m i i i i T T T T

 Traditional archiitectuuress uussee looopp unrolllinng • Results in code expansion and increased misses  ™ Software Piipelinning uses rotating registers • Allows overlapping execution of multiple loop instances

IIttaanniiuumm™™ pprroovviiddeess ddiirreecctt ssuuppppoorrtt ffoorr SSooffttwwaarree PPiippeelliinniinngg

Copyright © 2001, Intel Corporation. All rights reserved. 14 *Other brands and names are the property of their respective owners SSooffttwwaarere ppiippeelliinneedd LLoooopp

 CCoonnssiiddeerr Pseudo Code: loop: ldfd x[i] code: fmpy.d y[i] = a, x[i] for (i = 0; i < n; i++) stfd y[i] y[i] = a * x[i]; br.ctop loop

 AAssssuummee • Instruction Latencies: – ldfd (fp load) 4 cycles* *Cycle counts for – fmpy.d (fp mul) 2 cycles* demonstration – stfd (fp store) 1 cycle* purposes only. – br.ctop (branch counted loop top) 1 cycle* • ldfd, fmpy.d, stfd and br can be issueed in the same instruction group ( only w/o RAW or WAW dependencies)

Copyright © 2001, Intel Corporation. All rights reserved. 15 *Other brands and names are the property of their respective owners SSooffttwwaarere ppiippeelliinneedd lloooopp Cycle 1: lld x[1] Cycle 2: lld x[2] For n = 8 Cycle 3: lld x[3] Cycle 4: lld x[4] Prolog Cycle 5: lld x[5] fmpy y[1]=a,x[1] Cycle 6: lld x[6] fmpy y[2]=a,x[2] Cycle 7: lld x[7] fmpy y[3]=a,x[3] stf y[1] Cycle 8: lld x[8] fmpy y[4]=a,x[4] stf y[2] Kernel Cycle 9: fmpy y[5]=a,x[5] stf y[3] Cycle 10: fmpy y[6]=a,x[6] stf y[4] Cycle 11: fmpy y[7]=a,x[7] stf y[5] Epilog Cycle 12: fmpy y[8]=a,x[8] stf y[6] Cycle 13: stf y[7] Cycle 14: stf y[8]

* Cycle counts for demonstration purposes only. In this example, one iteration takes 7 cycles

Copyright © 2001, Intel Corporation. All rights reserved. 16 *Other brands and names are the property of their respective owners Software pipelined loop Loop body: Remember our example latencies: (p16) ldf f32,[r32],8 ldf 4 cycles (p20) fmpy f36=f6,f36 fmpy 2 cycles (p22) stf [r33],f38,8 stf 1 cycle br 1 cycle br.ctop

floating point registers predicate registers 32 33 34 35 36 37 38 ... 16 17 18 19 20 21 22 ... ldf f32, x[1] 1 0 0 0 0 0 0 ldf f32, x[2] 1 1 0 0 0 0 0 ldf f32, x[3] 1 1 1 0 0 0 0 ldf f32, x[4] 1 1 1 1 0 0 0 ldf f32, x[5] fmpy y[1]=a, x[1] 1 1 1 1 1 0 0 ldf f32, x[6] fmpy y[2]=a, x[2] 1 1 1 1 1 1 0 ldf f32, x[7] fmpy y[3]=a, x[3] + stf y[1] 1 1 1 1 1 1 1 ldf f32, x[8] fmpy y[4]=a, x[4] + stf y[2] 1 1 1 1 1 1 1 fmpy y[5]=a, x[5] + stf y[3] 0 1 1 1 1 1 1 fmpy y[6]=a, x[6] + stf y[4] 0 0 1 1 1 1 1 fmpy y[7]=a, x[7] + stf y[5] 0 0 0 1 1 1 1 fmpy y[8]=a, x[8] + stf y[6] 0 0 0 0 1 1 1 stf y[7] 0 0 0 0 0 1 1 stf y[8] 0 0 0 0 0 0 1

Copyright © 2001, Intel Corporation. All rights reserved. 17 *Other brands and names are the property of their respective owners SSooffttwwaarere ppiippeelliinneedd lloooopp  Actuual codee eexaamplle: // Initialization mov pr.rot=0 // Clear all rotating predicates cmp.eq p16,p0=r0,r0 // Set p16=1 mov ar.lc=7 // Set loop to n-1 mov ar.ec=7 // Set epilog counter # of stages ... loop: { .mfi (p16) ldfd f32=[r32],8 // Stage 1: Load x (p20) fmpy.d f36=f6,f36 // Stage 5: y=a*x nop.i 0 } { .mfb (p22) stfd [r33]=f38,8 // Stage 7: Store y nop.f 0 br.ctop.sptk.few loop // Branch back }

Copyright © 2001, Intel Corporation. All rights reserved. 18 *Other brands and names are the property of their respective owners CCoonnttrrooll && DDaattaa SSppeeccuullaattiioonn

iinnssttrr 11 iinnssttrr 11 .i.i n n.. ss..ttrr 22 iinnssttrr 22 bbrr BBaarrrriieerr sstt [[??]] BBaarrrriieerr

lldd rr11== lldd rr11== uussee ==rr11 uussee ==rr11

 CCoonnttrrooll SSppeeccullaattiioonn  DDaattaa SSppeeccuullaattiioonn mmoovveess llooaaddss aabboovvee mmoovveess llooaaddss aabboovvee bbrraanncchheess // ccaallllss ppoossssiibbllyy ccoonnfflliiccttiinngg ssttoorreess

SSppeecuullaattiioonn rreedduucceess tthhee iimpmpaacctt ooff mmeemomorryy llaatteennccyy

Copyright © 2001, Intel Corporation. All rights reserved. 19 *Other brands and names are the property of their respective owners CCoonnttrrooll SSppeeccuullaattiioonn Itanium® Traditional Architecture lldd..ss rr11== Detect exception

iinnssttrr 11 p iinnssttrr 11 r o iinnssttrr 22 p ...... iinnssttrr 22 a g a t bbrr BBaarrrriieerr bbrr e

lldd rr11== cchhkk..ss rr11 Deliver exception uussee ==rr11 uussee ==rr11

 CCoonnttrrooll SSppeeccuullaattiioonn mmoovveess llooaaddss aabboovvee bbrraanncchheess • Dettected exception indicated using NaT bit / NaTVal  CChheecckk rraaiisseess ddeetteecctteedd eexxcceeppttiioonnss

0 clk checks: depeennddent uses issued in parallel with check

Copyright © 2001, Intel Corporation. All rights reserved. BBaarrrriieerr bbrriiddggeedd,, mmeemmoorryy llaatteennccyy mmaasskkeedd 20 *Other brands and names are the property of their respective owners Data Speculation Itanium™ Processor Data Speculation lldd..aa rr11== Itanium™ Processor iinnssttrr 11 Traditional Arch lldd..aa rr11== ...... iinnssttrr 11 iinnssttrr 11 uussee ==rr11 Speculative use iinnssttrr 22 iinnssttrr 22 ...... sstt [[??]] BBaarrrriieerr st [?] Recovery sstt [[??]] st [?] code lldd rr11== lldd rr11== lldd..cc rr11 cchhkk..aa rr11 uussee uussee ==rr11 uussee ==rr11 ...... ==rr11 bbrr  DDaattaa SSppeeccuullaattiioonn mmoovveess llooaaddss aabboovvee ppoossssiibbllyy ccoonnfflliiccttiinngg ssttoorreess • Keeeps trarack oof load adadddresses used in advance (ALAT)  AAddvvaanncceedd--llooaaddeedd ddaattaa ccaann bbee uusseedd ssppeeccuullaattiivveellyy DDaattaa aanndd CCoonnttrrooll SSppeeccuullaattiioonnss ccaann bbee ccoommbbiinneedd Copyright © 2001, Intel Corporation. All rights reserved. 21 *Other brands and names are the property of their respective owners IIttaanniiuumm®® HHWW DDaattaa TTyyppeess

64-bit IInteger

2x32-bit SIMD Integer

4x16-bit SIMD Integer

8x8-bit SIMD Integer

64-bit DDP F.P.

2x32-bit SP F.P.

Copyright © 2001, Intel Corporation. All rights reserved. 22 *Other brands and names are the property of their respective owners SSIIMMDD -- IInntteeggeerr  EExxppllooiittss ddaattaa ppaarraalllleelliissmm wwiitthh SSIIMMDD ((SSiinnggllee IInnssttrruuccttiioonn MMuullttiippllee DDaattaa))  PPeerrffoorrmmaannccee bboooostt ffoorr aauuddiioo,, vviiddeeoo,, iimmaaggiinngg eettcc.. ffuunnccttiioonnss 64 bits  GGRRss ttrreeaatteedd aass 88xx88,, 44xx1166,, oorr 22xx3322 8x8, 4x16, or 2x32 bbiitt eelleemmeennttss  SSeevveerraall iinnssttrruuccttiioonn ttyyppeess a3 a2 a1 a0 • Addiitioon andd subtraction, multiply + • Pack/Unppack b3 b2 b1 b0 • Left shift, signed/unsigned right shift  CCoommppaattiibbllee wwiitthh IInntteell MMMMXX a3+b3 a2+b2 a1+b1 a0+b0 TTeecchhnnoollooggyy

AAvvaaiillaabbllee tthhrroouugghh CCoommppiilleerr IInnttrriinniiccss

Copyright © 2001, Intel Corporation. All rights reserved. 23 *Other brands and names are the property of their respective owners FFllooaattiinngg--PPooiinntt AArcrchhiitteeccttuurere  112288 FFllooaattiinngg PPooiinntt rreeggiisstteerrss ((8822 bbiitt)) • Single, double, double-extendeed data types  FFuullll IIEEEEEE..775544 ccoommpplliiaannccee  AArriitthhmmeettiicc • FMA – Multiply-Add instruction f = a * b + c • SW Divide / Sqrt, pprovvidde high throughput, take advantage of wide FP machine • Max, Min instructions for Floating-point  DDaattaa ttrraanssffeerr • load, store, GR ⇔ FR conversion; load ppair to double data

2 inddepenndeentt FP Uniits Excellent Up to 4 DDPP FP operaatioons per cclocck /3D Apps Up to 4 DDPP FP operaandss lloadeed pperr clock Performance

Copyright © 2001, Intel Corporation. All rights reserved. 24 *Other brands and names are the property of their respective owners FFllooaattiinngg PPooiinntt FFeeaattuurreess

 Native 82-bit hardware provides support for multiple numeric models  2 pipelined FMACs deliver 4 EP / DP FLOPs/cycle  Performance for security, efficient use of hardware: Integer mul-add, s/w divide  Balanced with plenty of operand bandwidth from registers / memory

2 stores/clk 6 x 82-bit operands

Up to even 128 entry 9MB L2 L3 82-bit Cache od Cache RF d 2 DP 4 DP Ops/clk Ops/clk (2 x Fld-pair) 2 x 82-bit results

Copyright © 2001, Intel Corporation. All rights reserved. 25 *Other brands and names are the property of their respective owners SSIIMMDD -- FFllooaattiinngg--PPooiinntt  EExxppllooiittss ddaattaa ppaarraalllleelliissmm wwiitthh SSIIMMDD ((SSiinnggllee IInnssttrruuccttiioonn MMuullttiippllee DDaattaa))  UUpp ttoo 22xx ppeerrffoorrmamannccee bboooosstt  FFPP RReeggiisstteerrss ttrreeaatteedd aass ttwwoo 3322 bbiitt 64 bits ssiinnggllee pprreecciissiioonn eelleemmeennttss 2x32 bit SP FP elements • Fulll IEEE.754 compliance • Availability of fast dividee ((nnonn IEEEE)) a1 a0  CCoommppaattiibbllee wwiitthh SSttrreeaammiinngg SSIIMMDD + EExxtteennssiioonnss ((SSSSEE)) b1 b0

a1+b1 a0+b0

EEnnaabblleess WWoorrlldd CCllaassss 33DD GGrraapphhiiccss PPeerrffoorrmamannccee

Copyright © 2001, Intel Corporation. All rights reserved. 26 *Other brands and names are the property of their respective owners SSoomeme FFllooaattiinngg PPooiinntt LLaatteenncicieess

Operation Latency

FP Load (L2 Cache hit) 6

FMAC,FMISC 4

FP -> Int (getf) 5

Int -> FP (setf) 6

Fcmp to branch 2 Fcmp to qual pred 2

Copyright © 2001, Intel Corporation. All rights reserved. 27 *Other brands and names are the property of their respective owners LL11DD CCaacchhee (I(Inntteeggeer-r-oonnllyy)) • HHiigghh PPeerrffoorrmamannccee 1166GGBB//ss,, 22 lldd AANNDD 22 sstt ppoorrttss – Writite Through – aallll ssttores aree ppusshhedd tto tthe L22 – FP loloads force mmiiss, FPP storres iinvalidlidate – Truue duual-l-porrtted rreead accesss –– noo loload coonfllicicts – pseudo-duall store port writite access – 2 store coalescing buffers/porrtt hold data until L1D updatte – Storee to loload forwardining

One clock data cache provides a significant performance benefit

Copyright © 2001, Intel Corporation. All rights reserved. 28 *Other brands and names are the property of their respective owners LL22 aandnd L3L3 CCaacchhee

– LL22 225566KKBB,, 3322GGBBss,, 55--77 ccllkk – DDaattaa aarrrraayy iiss bbaannkkeedd -- 1166 bbaannkkss ooff 1166KKBB eeacachh – NNoonn--bblloocckkiinngg // oouutt--ooff--oorrddeerr – LL22 qquueeuuee ((3322 eennttrriieess)) -- hhoollddss aallll iinn--fflliigghhtt llooadad//ssttoorreess – oouutt--ooff--oorrddeerr sserervviiccee -- ssmmooootthheses oovveerr llooaadd//ssttoorre/e/bbaannkk ccoonnfflliiccttss,, ffiillllss – CCaann iissssuuee//rretetiirree 44 ssttoorreess//llooaaddss ppeerr clcloockck – CCaann bbyyppaassss LL22 qquueeuuee ((55,,77,,9 clk bypass)) iiff • nnoo adadddrresesss oorr bbaannkk ccoonnfflliiccttss iinn ssaammee iissssuuee ggrroouupp • nnoo pprriioorr ooppss iinn LL22 qquueeuuee wwaanntt aacccceessss ttoo LL22 ddaattaa aarrrraayyss – UUpp ttpp 99MMBB LL33,, 3322GGBBss,,1122--1133 ccllkk ccaacchhee oonn ddiiee!!!! – SSiinnggllee ppoorrtteedd –– ffuullll ccaacchhee lliinnee ttrraannssffeerrss Copyright © 2001, Intel Corporation. All rights reserved. 29 *Other brands and names are the property of their respective owners TTLLBBss

 22--lleevveell TTLLBB hhiieerraarrcchhyy • DDTTCC//IITTCC (32/32 entry, fully associative, 0.5 clk) – Small fast translation caches tied to L1D/L1I – Key to achieving veryy fast 1-clk L1D, L11I cacche accessessess • DDTTLLBB//IITTLLBB (128/128 entry, fully associative, 1 clk) – All architected sizes (4K to 4GB) – Supports up to 64/64 ITR/DTRs – TLB miss starts hardware page walker

Small fast TLBs enable low latency caches

Copyright © 2001, Intel Corporation. All rights reserved. 30 *Other brands and names are the property of their respective owners IIttaanniiuumm®® 22 CCaacchheess

L1I L1D L2 L3 Size 16K 16K 256K 1.5/3/6M/9M on die Line Size 64B 64B 128B 128B Ways 4 4 8 12 Replacement LRU NRU NRU NRU

Latency I-Fetch:1 INT:1 INT: 5 12 (load to use) FP: 6 Write Policy - WT (RA) WB (WA + WB (WA) RA) Bandwidth R: 32 GBs R: 16 GBs R: 32 GBs R: 32 GBs W: 16 GBs W: 32 GBs W: 32 GBs

All caches are pipelined, and non-blocking

Copyright © 2001, Intel Corporation. All rights reserved. 31 *Other brands and names are the property of their respective owners IIttaanniiuumm®® 22 PPiippeelliinneess

FPU FP1 FP2 FP3 FP4 WB Core IPG ROT EXP REN REG EXE DET WB L2 L2N L2I L2A L2M L2D L2C L2W

IPG IP Generate, L1I Cache and TLB access EXE ALU Execute(6), L1D Cache and TLB access + L2 Cache Tag Access ROT Instruction Rotate and Buffer DET Exception Detect, Branch Correction EXP Expand, Port Assignment and Routing WB Writeback, Integer Register update REN Integer and FP Register Rename FP1-WB FP FMAC + reg write REG Integer and FP read L2N-L2I L2 Queue Nominate/Issue L2A-W L2 Access, Rotate, Correct, Write

Short 8-stage iin-order maiin piipelliine – In-order issue, out-of-order completion – Reduced branch misprediction penalties

Pipelines are designed for very low latency

Copyright © 2001, Intel Corporation. All rights reserved. 32 *Other brands and names are the property of their respective owners FFuunnccttiioonnaall UUnniittss

Itanium 2 Processor  Integer  6 ALUs Integer  6 Multi-Media  Memory Ports FP - 64/82bit  2 Load, 4 FP Load FP - 32bit (SIMDFP)  2 Store  Floating Point Multimedia  2 MACs  Branch Ports Load/Store  3 Branches Branch

Copyright © 2001, Intel Corporation. All rights reserved. 33 *Other brands and names are the property of their respective owneIrnstel, Itanium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. Dispersal Matrix – Itanium® 2 Processor

MII MLI MMI MFI MMF MIB MBB BBB MBB MFB MII * * * * * X X * X MLI * * * X * X * X * X MMI * * * * * * * X * * MFI * X * X * X X X * X MMF * * * * * * * X * * MIB§ * X * X * X X * X MBB BBB MMB§ * * * * * * * * * MFB§ X X * X * X X * X §Hint in first bundle * Possible Itanium 2 processor full issue X Possible Itanium 2 and Itanium® processor full issue Architectural Improvements Allow Faster Execution Through More Issued Instructions/Cycle

Copyright © 2001, Intel Corporation. All rights reserved. 34 *Other brands and names are the property of their respective owneIrnstel, Itanium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. TTLLBBss

 22--lleevveell TTLLBB hhiieerraarrcchhyy • DDTTCC//IITTCC (32/32 entry, fully associative, 0.5 clk) – Small fast translation caches tied to L1D/L1I – Key to achieving veryy fast 1-clk L1D, L11I cacche accessessess • DDTTLLBB//IITTLLBB (128/128 entry, fully associative, 1 clk) – All architected page sizes (4K to 4GB) – Supports up to 64/64 ITR/DTRs – TLB miss starts hardware page walker

Small fast TLBs enable low latency caches

Copyright © 2001, Intel Corporation. All rights reserved. 35 *Other brands and names are the property of their respective owners TTLLBB LLooookkuupp

Virtual Address 63 60 0 Region rr0 vrn virtual page number (vpn) Registers rr1 rr2 3

region ID

rr7 24 hash Search Search

region ID key virtual page number rights physical page number (ppn)

TrTarnasnlastliaotnio Ln oLookoaksiadseid Be uBffuefrf e r(T (LTBL)B &) & VirVtuiratlu Hala Hsha sPha gPea gTea Tblaeb (lVe H(VPHTP) T)

pkr0 key rights pkr1 Protection Key 63 0 Registers physical page number (ppn) offset Physical Address

Copyright © 2001, Intel Corporation. All rights reserved. 36 *Other brands and names are the property of their respective owners VVHHPPTT WWaallkk

Virtual Address 63 60 0 Region rr0 vrn virtual page number (vpn) Registers rr1 rr2 3

region ID

rr7 24 hash Search Search

region ID key virtual page number rights physical page number (ppn)

TrTarnasnlastliaotnio Ln oLookoaksiadseid Be uBffuefrf e r(T (LTBL)B &) & VirVtuiratlu Hala Hsha sPha gPea gTea Tblaeb (lVe H(VPHTP) T)

pkr0 key rights pkr1 Protection Key 63 0 Registers physical page number (ppn) offset Physical Address

Copyright © 2001, Intel Corporation. All rights reserved. 37 *Other brands and names are the property of their respective owners PPeerfrfoormrmaannccee MMoonniittooririnngg UUnniitt

 PPeerrffoorrmmaannccee MMoonniittoorriinngg • Capabiiliitiees arree iimplementation specific • Minimum of 4 PMC/PMD guaranteed • Allows user-level ccounnter enabbling//diisababllingg • Counnters can be enabablledd, ddisabled, rread or wriitten • Allows generic counter overflow interruppt hhaandliing anndd context off performance counters  IInntteerrvvaall TTiimmeerr CCoouunntteerr ((IITTCC)) • Runss at tthhe frequency of the CPU • Can be read by Appplications • Operating Systems can secure it

Copyright © 2001, Intel Corporation. All rights reserved. 38 *Other brands and names are the property of their respective owners IIttaanniiuumm®® 22 BBlloocckk DDiiaaggrraamm

L1 Instruction Cache and ECC ECC ITLB Fetch/Pre-fetch Engine IA-32 Branch Instruction Decode Prediction 8 bundles Queue and Control 11 Issue Ports B B B M M M M I I F F t r o P Register Stack Engine / Re-Mapping d a e u h c Q Branch & Predicate

a e

t 128 Integer Registers 128 FP Registers – Registers a

C s

c n e i 3 o d h i L t e c r p a P e

, c

C Integer x d Branch Quad-Port r E T 2

a

, and o

L Units L1 A s b T MM Units L e

a Data Floating r A N o ,

c Cache Point S and Units ECC ECC DTLB

ECC ECC ECC Controller

Copyright © 2001, Intel Corporation. All rights reserved. 39 *Other brands and names are the property of their respective owners IIA-A-3322 AApppp SSuuppppoorrtt oonn IIttaanniiuum®m® IA-32 code IPF code IA-32 code IPF code or IA-32 EL IA-32 EL enables increased utilization IA-32 IA-32 of key Itanium® H/W H/W architecture features Native H/W Native H/W

Today w/ IA-32 EL

• IA-32 applications are supported on all Itanium® processor offerings • Before: IA-32 hardware-based approach • Since 2004: IA-32 Execution Layer enhances 32-bit application support • IA-32 Execution Layer – What is it? • A software binary that ships with the OS; initiated by the OS when an IA-32 app is launched • IA-32 EL then translates the IA-32 app into a native Itanium®-based app • Benefits of IA-32 EL • Enables the use of new IA-32 instructions (e.g. SSE2) • Can accelerate IA-32 applications on Itanium®-based systems • Available for Windows today ( SP2 or patch from www.microsoft.com ) • For Redhat available too; for SUSE in major release • Available for SGI too ( ProPack 3.0 )

Copyright © 2001, Intel Corporation. All rights reserved. 40 *Other brands and names are the property of their respective owners SSoommee nnootteess oonn IInntteell®® IIttaanniiuumm®® AArrcchhiitteeccttuurree AApppplliiccaattiioonn PPrrooggrraammmmiinngg MMooddeell IInnttrroodduuccttiioonn ttoo BBaassiicc IIttaanniiuumm® PPrroocceessssoorr IInnssttruruccttiioonnss

HHiigghh--LLeevveell iinnttrroodduuccttiioonn ttoo bbaassiicc iinnssttrruuccttiioonn sseett EExxaammpplleess,, nnoott ccoommpprreehheennssiivvee IImmppoorrttaanntt ttoo bbee aabbllee ttoo rreeaadd,, uunnddeerrssttaanndd aanndd ttoo eevvaalluuaattee qquuaalliittyy ooff ccoommppiilleerr ggeenneerraatteedd ccooddee HHeellppffuull ttoo uunnddeerrssttaanndd ddeebbuuggggeerr aanndd ppeerrffoorrmmaannccee aannaallyyzzeerr ddiissppllaayyss NNoo rreeqquuiirreemmeenntt ttoo ddoo aasssseemmbbllyy pprrooggrraammmmiinngg

In fact, hand coding for Itanium processors is discouraged!

Copyright © 2001, Intel Corporation. All rights reserved. Intel, Itanium, VTune, and the Intel logo are trademarks or registered trademarks of4 2Intel *Other brands and names are the property of their respective owneCrsorporation or its subsidiaries in the United States or other countries. BBaassiicc IInnssttruruccttiioonn FFoorrmmaatt

[[((qqpp))]] mmnneemmoonniicc[[..ccoommpp]] ddeesstt == ssrrccss

where: qp = qualifying predicate. mnemoniic = unique name identifying the instruction. comp = one or more completers. dest = destination operand(s). srcs = source operand(s).

Examplle: (p10) lld4.s r31 = [r3]

Copyright © 2001, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporatio4n3 or its *Other brands and names are the property of their respective ownesubrs sidiaries in the United States or other countries. IInsnsttrruuccttiioonn GGrrooupupss

 Insttrucctioon group ddeepennddencyy vioolations

PERMITTED NOT PERMITTED§ within an instr group within an instr group

§§ Register Write-After-Read (WAR) Read-After-Write (RAW) Write-After-Write (WAW)§§ Memory/ RAW, WAW, WAR ALAT

§Not permitted is used in the context that it will result in undefined behavior. §§See notes for RAW and WAW exceptions to these rules.

mov r31 = ip mov r31 = ip ld r2 = [r3] ld r2 = [r3];; Instr group st [r4] = r2 st [r4] = r2 boundary add r5 = r6, r7 add r5 = r6, r7

Copyright © 2001, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporatio4n4 or its *Other brands and names are the property of their respective ownesubrs sidiaries in the United States or other countries. AAsssseemmbbllyy LLaanngguuaaggee FFeeaattuurreess  BBuunnddlliinngg:: OOnnee eexxaammppllee

{ .mii mov r31 = ip Template Explicit } selection bundling { .mmi directive ld4 r2 = [r3];; st4 [r4] = r2 add r5 = r6, r7 } mii tmpl Slot 0 Bundle 1 nop.i 0 mov r31 = ip nop.m 0 00

Bundle 2 add r5 = r6, r7 st [r4] = r2 ld r2 = [r3] 0A m_mi tmpl

Copyright © 2001, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporatio4n5 or its *Other brands and names are the property of their respective ownesubrs sidiaries in the United States or other countries. AAsssseemmbbllyy LLaanngguuaaggee FFeeaattuurreess  SSaammee ccooddee sseeqquueennccee bbuutt ddiiffffeerreenntt bbuunnddlliinngg

{ .mfi mov r31 = ip Template Explicit } selection bundling { .mmi directive ld4 r2 = [r3];; st4 [r4] = r2 add r5 = r6, r7 nops } mfi tmpl slot 2 slot 1 slot 0 Bundle 1 mov r31 = ip nop.f 0 nop.m 0 0C

Bundle 2 add r5 = r6, r7 st [r4] = r2 ld r2 = [r3] 0A m_mi tmpl

Copyright © 2001, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporatio4n6 or its *Other brands and names are the property of their respective ownesubrs sidiaries in the United States or other countries. AAsssseemmbbllyy LLaanngguuaaggee FFeeaattuurreess

 TThhee vveerryy ssaammee aaccttiioonn ccaann bbee iimmpplleemmeenntteedd bbyy mmuullttiippllee,, ddiiffffeerreenntt bbuunnddllee ccoommbbiinnaattiioonnss iinn ggeenneerraall  TThhiiss iiss kkeeyy ttoo IIttaanniiuumm® aarrcchhiitteeccttuurree ppeerrffoorrmmaannccee..  IItt iiss tthhee jjoobb ooff tthhee ccoommppiilleerr// pprrooggrraammmmeerr ttoo ffiinndd tthhee ooppttiimmaall ccoommbbiinnaattiioonnss ooff bbuunnddlleess ffoorr mmaaxxiimmuumm ppeerrffoorrmmaannccee!!

Copyright © 2001, Intel Corporation. All rights reserved. 47 *Other brands and names are the property of their respective owneIrnstel, Itanium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. RReeggiisstteerr SSyymmbbooll NNaammeess

 Generall Registers  Branch Registers fixed regs r0-r31 branch regs b0-b7 stacked regs r32-r127 return pointer (b0) rp input regs iin0-iin95  Application Registers local regs lloc0-loc95 app regs ar0-ar127 output regs out0-out95 kernel regs ar.k0-ar.k7 global pointer (r1) gp RSE control reg ar.rsc return value regs ret0-ret3 backing store ptr ar.bsp stack pointer (r12) sp BSP memory ptr ar.bspstore  Flloating-Poiint Registers RSE NaT collection ar.rnat floating-point regs f0-f127 user NaT collection ar.unat argument register farg0-farg7 FP status reg ar.fpsr return value regs fret0-fret7 prev frame state ar.pfs  Predicate Registers loop counter ar.lc predicate regs p0-p63 epilog counter ar.ec all predicates pr  Others rotating regs pr.rot user mask psr.um

Italics font indicates alias or alternate names. Copyright © 2001, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporatio4n8 or its *Other brands and names are the property of their respective ownesubrs sidiaries in the United States or other countries. CCoommppaarree IInnssttruruccttiioonnss SSeett PPrereddiiccaatteess

Mnemonic Operation cmp/cmp4 GR compare (64-bit/32-bit) tbit/tnat Test GR bit/test NaT bit fcmp FR compare fclass FR class frcpa/frsqrta FP reciprocal approx FP reciprocal sq root approx

Test if bit zero Target predicate regs

Code Example: tbit.z p1,p2 = r3, 6

Position 63 6 0 PR1 1 GR3 0 PR2 0 == 0? Copyright © 2001, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporatio4n9 or its *Other brands and names are the property of their respective ownesubrs sidiaries in the United States or other countries. MMeemmooryry AAcccceessss IInnssttruruccttiioonnss

Load ...

Itanium® Store P R O C E S S O R Data Semaphore Data RCG

Processor Memory  Memory addressing by (64-bit address) • It is specified by the contents of a general register. • Load/store can also specify base-address register update.  Byte ordering • Big-endian or little-endian data accesses are determined by UM.be bit, a user readable bit in the Processor .  Data elements should be stored (aligned) on natural boundaries • Hardware fault is on misaligned references. • 10-byte floats are stored on 16-byte boundary. • Instruction bundles are 16 long and always aligned on 16-byte boundaries.

Copyright © 2001, Intel Corporation. All rights reserved. 50 *Other brands and names are the property of their respective owneIrnstel, the Intel logo, and Itanium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. FFllooaattiinngg--PPooiinntt DDaattaa MMoovvee

Type Operation Mnemonic

FP Memory Load to FR ldfs, ldf8, ldfd, ldfe, ldf.fill Access Load pair to FR ldfps, ldfp8, ldfpd Store from FR stfs, stf8, stfd, stfe, stf.spill

FR to/from GR to FR setf.s, setf.d, setf.exp, setf.sig GR FR to GR getf.s, getf.d, getf.exp, getf.sig

Note: This is a subset of FP instructions. Code examplle:: setf.exp f2 = r31 Transfer GR1 to FR2 exp and sign

63 17 16 0 GR31

FR2 1000 ... 000 81 80 64 0

Copyright © 2001, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporatio5n1 or its *Other brands and names are the property of their respective ownesubrs sidiaries in the United States or other countries. DDaattaa SSppeecucullaattiioonn

 Code example: usingg daataa specuulatioonn

ld4.a r4 = [r3] ;; ... st4 [r2] = r31 Advanced load st4 [r2] = r31 ld4.c.clr r4 = [r3] with ld.c ld4 r4 = [r3] ;; add r6 = r4,r5 ;; add r6 = r4,r5 ;; sxt4 r7 = r6 sxt4 r7 = r6 ld4.a r4 = [r3] ;; Without advanced load ... add r6 = r4,r5 ;; ... st4 [r2] = r31 Advanced load Use ld.c if only load is chk.a.clr r4, recv with chk.a speculated, chk.a if load back: sxt4 r7 = r6 and its uses are speculated. recv: ld4 r4 = [r3] ;; add r6 = r4, r5 (p0) br.cond.sptk back

Copyright © 2001, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporatio5n2 or its *Other brands and names are the property of their respective ownesubrs sidiaries in the United States or other countries. SSooffttwwaarere PPiippeelliinneedd LLooooppss int csum(double a[], double b[], double c[], double x, int max) for (i=0; i

Intel® Itanium™ processor optimized Itanium™ -2 processor optimized assembler: assembler: .b1_2: .b1_2: { .mmi { .mfi (p16) ldfd f32=[r34],8 (p16) ldfd f32=[r34],8 (p16) ldfd f37=[r33],8 (p22) fma.d f46=f8,f38,f45 nop.I 0 ;; nop.i 0 } { .mfb } { .mmb (p24) stfd [r32]=f46,8 (p16) ldfd f39=[r33],8 (p20) fma.d f42=f8,f36,f41 (p26) stfd [r32]=f50,8 br.ctop.sptk .b1_2 ;; br.ctop.sptk .b1_2 ;;

Itanium-2® processor can do 4 FP load/store L2 Latency Itanium: 8 operations/cycle, Itanium® only 2 !! Itanium-2: 6

( double C popyerighrt ©f o200r1,m Intela Cornpocrateion. Afllo righrts rleaserrvegd. e max ) 53 *Other brands and names are the property of their respective owners Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries CCaacchhee LLaatteennccyy iinn SSooffttwwaarree PPiippeelliinneedd LLooooppss

Integer (L1 Cache) Floating Point (L2 Cache) for (i=0; i

Itanium® 2 processor optimized assembler: Itanium 2 processor optimized assembler: .b1_3: .b1_2: { .mfb { .mib (p16) ldfs f32=[r32],4 (p16) ld4 r32=[r2],4 (p22) fma.s f39=f43,f1,f38 (p17) add r34=r35,r33 br.ctop.sptk .b1_3 ;; br.ctop.sptk .b1_2 ;;

Copyright © 2001, Intel Corporation. All rights reserved. 54 *Other brands and names are the property of their respective owneIrnstel, Itanium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. Archiitecturall Impacts CCaacchhee LLaatteennccyy iinn SSooffttwwaarree PPiippeelliinneedd LLooooppss

Integer (L1 Cache) Floating Point (L2 Cache) for (i=0; i

Itanium® 2 processor optimized assembler: Itanium 2 processor optimized assembler: .b1_3: .b1_2: { .mfb { .mib { .mib (p16) ldfs f32=[r32],4 (p16) ld4 r32=[r2],4 (p22) fma.s f39=f43,f1,f38 (p17) add r34=r35,r33 br.ctop.sptk .b1_3 ;; br.ctop.sptk .b1_2 ;;

Copyright © 2001, Intel Corporation. All rights reserved. 55 *Other brands and names are the property of their respective owneIrnstel, Itanium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. SSuummmmaarryy

 EEPPIICC rreeqquuiirreess ssoopphhiissttiiccaatteedd ccoommppiilleerr • NNoo OOOOOO hheellpp aass ffoorr IIAA3322 (( PPeennttiiuumm®® 44,, XXEEOONN™™ ))  SSWWPP ((ssooffttwwaarree ppiippeelliinniinngg)) mmoosstt rreelleevvaanntt ffeeaattuurree ffoorr ppeerrffoorrmmaannccee • TThhiinnkk ooff iitt aass rruunnnniinngg lloooopp iitteerraattiioonnss iinn ppaarraalllleell  SSoommee uunnddeerrssttaannddiinngg ooff IIttaanniiuumm®® aasssseemmbblleerr ccaann hheellpp ttoo uunnddeerrssttaanndd ffeeaattuurreess uusseedd iinn yyoouurr ccooddee

Copyright © 2001, Intel Corporation. All rights reserved. 56 *Other brands and names are the property of their respective owners