IA-64 and Itanium(Tm) Processor Architecture Overview
Total Page:16
File Type:pdf, Size:1020Kb
IIntnteell®® IIttananiiuumm™™ AArrcchhiitteeccttururee RReeffrreesshh oonn ssoommee ffeeaattuurreess rreelleevvaanntt ffoorr ddeevveellooppeerrss HHeeiinnzz BBaasstt [[email protected]] March 2005 IIttaanniiuumm®® PPrroocceessssoorr ArArcchhiitteeccttuurree SSeelleecctteedd FFeeaattuurreses 64-biit Addressiing Fllat Memory Modell Instructiion Levell Parallllelliism (6-way) Large Regiister Fiilles Automatiic Regiister Stack Engiine Prediicatiion Software Piipelliiniing Support Regiister Rotatiion Loop Controll Hardware Sophiistiicated Branch Archiitecture Controll & Data Specullatiion Powerfull 64-biit Integer Archiitecture Advanced 82-biit Flloatiing Poiint Archiitecture Mulltiimediia Support (MMX™ Technollogy) Copyright © 2001, Intel Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners TTraraddiittiioonnaall AArcrchhiitteeccttuureress:: LLiimmiitteedd PPaarraalllleelliissmm Original Source Sequential Machine Code Code Hardware Compile parallelized code multiple functional units Execution Units Available- Execution Units Available- . Used Inefficiently . TTooddaayy’’ss PPrroocceessssoorrss aarree oofftteenn 6600%% IIddllee Copyright © 2001, Intel Corporation. All rights reserved. 3 *Other brands and names are the property of their respective owners IInntteell®® IIttaanniiuum®m® AArrchchiitteectctuurere:: EExxpplliicciitt PPaarraalllleelliissmm Origiinaal SSoouurcee Parallel Machine Code Code Compile Compiller Hardware multiple functional units Itanium Architecture Compiler Views More efficient use of . Wider execution resources . Scope IInnccrreeaasseess PPaarraalllleell EExxeeccuuttiioonn Copyright © 2001, Intel Corporation. All rights reserved. 4 *Other brands and names are the property of their respective owners EEPPIICC IInnssttruruccttiioonn PPaararalllleelliissmm Source Code Instruction Groupps No RAW or WAW dependencies (serries of bundles) dependencies Issued in parallel depending on resources Instructionn 3 instructions + Bundles template (3 Instructions) 3 x 41 bits + 5 bits = (3 Instructions) 128 bits Up to 6 iinsstruucctioonss exeecuted pperr clloock Copyright © 2001, Intel Corporation. All rights reserved. 5 *Other brands and names are the property of their respective owners IInnssttruruccttiioonn LLeevveell PPaararalllleelliissmm Instruction Groups • No RAW or WAW dependencies instr 1 // 1st. group instr 2;; // 1st. group • Delimited by ‘stops’ in assembly code instr 3 // 2nd. group • Instructions in groups issued in parallel, depending instr 4 // 2nd. group on available resources. IInnssttrruuccttiioonn BBuunnddlleess { .mii • 3 instructions and 1 template in 128-bit bundle ld4 r28=[r8] // load add r9=2,r1 // Int op. • Instruction dependencies by using ‘stops’ Instruction dependencies by using ‘stops’ add r30=1,r1 // Int op. • Instruction groups can span multiple bundles } 128 bits (bundle) Instruction 2 Instruction 1 Instruction 0 template 41 bits 41 bits 41 bits 5 bits MMeemmoorryy ((MM)) MMeemmoorryy ((MM)) IInntteeggeerr ((II)) (MMI) FFlleexxiibbllee IIssssuuee CCaappaabbiilliittyy Copyright © 2001, Intel Corporation. All rights reserved. 6 *Other brands and names are the property of their respective owners LLaarrggee RReeggiisstteerr SSeett General Floating-point Predicate Branch Application Registers Registers Registers Registers Registers NaT 64-bit 82-bit 64-bit 64-bit GR0 0 FR0 + 0.0 PR0 1 BR0 AR0 GR1 FR1 + 1.0 PR1 AR1 BR7 GR31 FR31 PR15 AR31 GR32 FR32 PR16 AR32 PR63 GR127 FR127 AR127 32 Static 32 Static 16 Static 96 Stacked 96 Rotating 48 Rotating Copyright © 2001, Intel Corporation. All rights reserved. 7 *Other brands and names are the property of their respective owners PPrreeddiiccaattiioonn PPrreeddiiccaattee rreeggiisstteerrss aaccttiivvaattee//iinnaaccttiivvaattee iinnssttrruuccttiioonnss PPrreeddiiccaattee RReeggiisstteerrss aarree sseett bbyy CCoommppaarree IInnssttrruuccttiioonnss • EExxaammppllee:: ccmmpp..eeqq pp11,, pp22 == rr22,, rr33 ((AAllmmoosstt)) alalll iinnssttrruuccttiioonnss ccaann bbee pprreeddiiccaatteedd (p1) ldfd f32=[r32],8 (p2) fmpy.d f36=f6,f36 PPrreeddiiccaattiioonn:: • eelliimmiinnaatteess bbrraanncchhiinngg iinn iiff//eellssee llooggiicc bblloocckkss • ccrreeaatteess llaarrggeerr ccooddee bblloocckkss ffoorr ooppttiimmiizzaattiioonn • ssiimmpplliiffiieess ssttaarrtt uupp//sshhuuttddoowwnn ooff ppiippeelliinneedd llooooppss Copyright © 2001, Intel Corporation. All rights reserved. 8 *Other brands and names are the property of their respective owners PPrreeddiiccaattiioonn Code Example: absolute difference of two numbers C Code if (r2 >= r3) Non-Predicated r4 = r2 - r3; Pseudo Code else r4 = r3 - r2; cmpGE r2, r3 jump_zero P2 P1: sub r4 = r2, r3 Predicated Assembly Code jump end cmp.ge p1,p2 = r2,r3 ;; P2: sub r4 = r3, r2 (p1) sub r4 = r2,r3 end: ... (p2) sub r4 = r3, r2 Predication Removes Branches, Enables Parallel Execution Copyright © 2001, Intel Corporation. All rights reserved. 9 *Other brands and names are the property of their respective owners RReeggiisstteerr SSttaacckk EEnnggiinnee TThhee ttrraaddiittiioonnaall uussee ooff aa pprroocceedduurree ssttaacckk iinn mmeemmoorryy ffoorr pprroocceedduurree ccaallll mmananaaggeemmeenntt ddeemmananddss aa llaarrggee oovveerrhheeaadd.. TThhee IInntteell®® IIttaanniiuumm™™ pprroocceessssoorr ffamamiillyy uusseess tthhee ggeenneerraall rreeggiisstteerr ssttaacckk ffoorr pprroocceedduurree ccaallll mmaannagageemmeenntt,, tthhuuss eelliimmiinnaattiinngg tthhee ffrreeqquueenntt mmeemmoorryy acaccceesssseess.. Copyright © 2001, Intel Corporation. All rights reserved. 10 *Other brands and names are the property of their respective owners RReeggiisstteerr SSttaacckk GGRRss 00--3311 aarree gglloobbaall ttoo aallll pprroocceedduurreess SSttaacckkeedd rreeggiisstteerrss bbeeggiinn aatt GGRR3322 aanndd aarree 96 Stacked llooccaall ttoo eeaacchh pprroocceedduurree 127 Each procedure’s register stack frame Each procedure’s register stack frame PROC B vvaarriieess ffrroomm 00 ttoo 9966 rreeggiisstteerrss Overlap OOnnllyy GGRRss iimmpplleemmeenntt aa rreeggiisstteerr ssttaacckk • The FRs, PRs, and BRs are glloball to allll procedures PROC A 32 RReeggiisstteerr SSttaacckk EEnnggiinnee ((RRSSEE)) 31 32 Global • Upon stack overfllow/underfllow, regiisters are saved/restored to/from a backiing store transparentlly 0 Optimizes the Call/Return Mechanism Copyright © 2001, Intel Corporation. All rights reserved. 11 *Other brands and names are the property of their respective owners RReeggiisstteerr SSttaacckk EEnnggiinnee aatt WWoorrkk:: CCaallll cchhaannggeess ffrraammee ttoo ccoonnttaaiinn oonnllyy tthhee ccaalllleerr’’ss oouuttppuutt AAlllloocc iinnssttrr.. sseettss tthhee ffrraammee rreeggiioonn ttoo tthhee ddeessiirreedd ssiizzee • Three architecturere parameters: local, ouutpuut, and rotating RReettuurrnn rreessttoorreess tthhee ssttaacckk ffrraammee ooff tthhee ccaalllleerr 56 Outputs 48 Virtual Local 52 52 Outputs Outputs (Inputs) Outputs 46 46 32 32 Local Local (Inputs) (Inputs) 32 Call Alloc Ret 32 PROC A PROC B PROC B PROC A IImmpprroovveedd eexxeeccuuttiioonn ssppeeeedd iinn oooo llaanngguuaaggeess,, ii..ee.. JJaavvaa,, CC++++ Copyright © 2001, Intel Corporation. All rights reserved. 12 *Other brands and names are the property of their respective owners RReeggiisstteerr rroottaattiioonn EExxaammppllee:: 8 general Floaatinng point and pprereddiicaate registers rottating, cocounnted registers loop (br.ctop) • Always rotate the same set off registers Before: After br.ctop taken: – FR 32-127 gr32 123 29 gr32 • Rotate in the same gr33 8189 123 gr33 diirection as general gr34 0 8189 gr34 registers 99 0 gr35 gr35 – highest rotates to lowest gr36 gr36 abc 99 register number gr37 gr37 9ad6 abc – all other values rotate beef 9adc gr38 gr38 towards larger register gr39 29 Wraparound beef gr39 numbers gr40 4567 4567 gr40 • Rotate at the same time as gr41 818 818 gr41 geeneral registers (at the ... ... modulo-scheduled loop ... ... instruction) Copyright © 2001, Intel Corporation. All rights reserved. 13 *Other brands and names are the property of their respective owners SSooffttwwaarree PPiippeelliinniinngg Sequential Loop Software-Pipelined Loop load compute store e e e e m m m m i i i i T T T T Traditional archiitectuuress uussee looopp unrolllinng • Results in code expansion and increased cache misses Itanium™ Software Piipelinning uses rotating registers • Allows overlapping execution of multiple loop instances IIttaanniiuumm™™ pprroovviiddeess ddiirreecctt ssuuppppoorrtt ffoorr SSooffttwwaarree PPiippeelliinniinngg Copyright © 2001, Intel Corporation. All rights reserved. 14 *Other brands and names are the property of their respective owners SSooffttwwaarere ppiippeelliinneedd LLoooopp CCoonnssiiddeerr Pseudo Code: loop: ldfd x[i] C code: fmpy.d y[i] = a, x[i] for (i = 0; i < n; i++) stfd y[i] y[i] = a * x[i]; br.ctop loop AAssssuummee • Instruction Latencies: – ldfd (fp load) 4 cycles* *Cycle counts for – fmpy.d (fp mul) 2 cycles* demonstration – stfd (fp store) 1 cycle* purposes only. – br.ctop (branch counted loop top) 1 cycle* • ldfd, fmpy.d, stfd and br can be issueed in the same instruction group ( only w/o RAW or WAW dependencies) Copyright © 2001, Intel Corporation. All rights reserved. 15 *Other brands and names are the property of their respective owners SSooffttwwaarere ppiippeelliinneedd lloooopp Cycle 1: lld x[1] Cycle 2: lld x[2] For n = 8 Cycle 3: lld x[3] Cycle 4: lld x[4] Prolog Cycle 5: lld x[5] fmpy y[1]=a,x[1] Cycle