<<

IntelIntel®® ItaniumItanium®® 22 ProcessorProcessor ArchitectureArchitecture

www..com/software/college

®

Intel, , and the Intel logo are trademarks or registered trademarks of * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. Intel Corporation or its subsidiaries in the United States or other countries. ItaniumItanium®® ProcessorProcessor ArchitectureArchitecture SelectedSelected KeyKey FeaturesFeatures y 64-bit Addressing Flat Memory Model y Instruction Level Parallelism (6-way) y Large Register Files y Automatic Register Stack Engine y Predication y Software Pipelining Support ( Register Rotation + Loop Control Hardware ) y Sophisticated Branch Architecture y Control & Data Speculation y Powerful 64-bit Integer Architecture y Advanced 82-bit Floating Point Architecture

® 2 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. TraditionalTraditional Architectures:Architectures: LimitedLimited ParallelismParallelism Original Source Sequential Machine Code Code Hardware

Compile parallelizedparallelized codecode

multiple functional units

Execution Units Available- . . . . Used Inefficiently . . . .

TodayTodayToday’s’’ss ProcessorsProcessorsProcessors areareare oftenoftenoften 60%60%60% IdleIdleIdle

®

* Other brands and names may be claimed as the property of others. IntelIntel®® ItaniumItanium®® Architecture:Architecture: ExplicitExplicit ParallelismParallelism

Original Source Parallel Machine Code Code Compile

Compiler Hardware multiple functional units

Itanium Architecture Views More efficient use of . . . . Wider execution resources . . . . Scope IncreasesIncreases ParallelParallel ExecutionExecution

®

* Other brands and names may be claimed as the property of others. EPICEPIC InstructionInstruction ParallelismParallelism

Source Code

Instruction Groups „ No RAW or WAW (series of bundles) dependencies „ Issued in parallel depending on resources

Instruction „ 3 instructions + Bundles template (3 Instructions) „ 3 x 41 bits + 5 bits = 128 bits

Up to 6 instructions executed per clock

® 5 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. InstructionInstruction LevelLevel ParallelismParallelism y Instruction Groups instr 1 // 1st. group – No RAW or WAW dependencies instr 2;; // 1st. group – Delimited by ‘stops’ in assembly code instr 3 // 2nd. group – Instructions in groups issued in parallel, instr 4 // 2nd. group depending on available resources. y Instruction Bundles { .mii ld4 r28=[r8] // load – 3 instructions and 1 template in 128-bit bundle – 3 instructions and 1 template in 128-bit bundle add r9=2,r1 // Int op. – Instruction dependencies by using ‘stops’ add r30=1,r1 // Int op. – Instruction groups can span multiple bundles } 128 bits (bundle)

Instruction 2 Instruction 1 Instruction 0 template 41 bits 41 bits 41 bits 5 bits

Memory (M) Memory (M) Integer (I) (MMI) FlexibleFlexible IssueIssue CapabilityCapability

®

* Other brands and names may be claimed as the property of others. LargeLarge RegisterRegister SetSet General Floating-point Predicate Branch Application Registers Registers Registers Registers Registers NaT 64-bit 82-bit 64-bit 64-bit GR0 0 FR0 + 0.0 PR0 1 BR0 AR0 GR1 FR1 + 1.0 PR1 AR1 BR7 GR31 FR31 PR15 AR31 GR32 FR32 PR16 AR32

PR63 GR127 FR127 AR127

32 Static 32 Static 16 Static

96 Stacked 96 Rotating 48 Rotating

® 7 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. PredicationPredication

yy PredicatePredicate registersregisters activate/inactivateactivate/inactivate instructionsinstructions yy PredicatePredicate RegistersRegisters areare setset byby CompareCompare InstructionsInstructions – Example: cmp.eq p1, p2 = r2, r3 yy (Almost)(Almost) allall instructionsinstructions cancan bebe predicatedpredicated (p1) ldfd f32=[r32],8 (p2) fmpy.d f36=f6,f36 yy Predication:Predication: – eliminates branching in if/else logic blocks – creates larger code blocks for optimization – simplifies start up/shutdown of pipelined loops

® 8 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. PredicationPredication y Code Example: absolute difference of two numbers

C Code if (r2 >= r3) Non-Predicated r4 = r2 - r3; Pseudo Code else r4 = r3 - r2; cmpGE r2, r3 jump_zero P2 P1: sub r4 = r2, r3 Predicated Assembly Code jump end cmp.ge p1,p2 = r2,r3 ;; P2: sub r4 = r3, r2 (p1) sub r4 = r2,r3 end: ... (p2) sub r4 = r3, r2

PredicationPredication RemovesRemoves Branches,Branches, EnablesEnables ParallelParallel ExecutionExecution

® 9 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. RegisterRegister StackStack

yy GeneralGeneral registersregisters (GRs)(GRs) 00-31-31 areare globalglobal toto allall procedures.procedures. 96 Stacked yy StackedStacked registersregisters beginbegin atat GR32GR32 andand areare 127 locallocal toto eacheach procedure.procedure. PROC B yy EachEach procedureprocedure registerregister stackstack frameframe variesvaries fromfrom 00 toto 9696 registers.registers. Overlap yy OnlyOnly GRsGRs implementimplement aa registerregister stack.stack. PROC A 32 –– TheThe FRs,FRs, PRs,PRs, andand BRsBRs areare globalglobal toto allall 31 procedures.procedures. 32 Global yy RegisterRegister stackstack engineengine (RSE)(RSE) 0 –– UponUpon stackstack overflow/underflow,overflow/underflow, aa backingbacking storestore transparentlytransparently savessaves oror restoresrestores thethe registers.registers.

OptimizesOptimizes thethe Call/ReturnCall/Return MechanismMechanism ® 10 Intel, Itanium, and the Intel logo are trademarks or registered trademarks of * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. Intel Corporation or its subsidiaries in the United States or other countries. IntroductionIntroduction RegisterRegister StackStack

yy CallCall changeschanges thethe frameframe toto contcontainain onlyonly thethe callercaller output.output. yy AllocAllocsets sets thethe frameframe regionregion toto thethe desireddesired size.size. –– ThereThere areare threethree architecturearchitecture parameters:parameters: local,local, output,output, andand rotarotating.ting. yy ReturnReturn restoresrestores thethe stackstack frameframe ofof thethe caller.caller. 5656 Outputs 4848

VirtualVirtual Local 5252 5252 Outputs Outputs (Inputs) Outputs 4646 32 3232 4646 Local Local

(Inputs) (Inputs) 3232 Call Alloc Ret 3232 PROCPROC AA PROCPROC BB PROCPROC BB PROCPROC AA

AvoidAvoid Register Register Spill/Fill Spill/Fill Upon Upon Procedure Procedure Call/Return Call/Return

® 11 Intel, Itanium, and the Intel logo are trademarks or registered trademarks of * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. Intel Corporation or its subsidiaries in the United States or other countries. AllocAlloc SemanticsSemantics

Size of Size of local rotating

allocalloc rr1 == ar.pfs,ar.pfs, i,i, l,l, o,o, rr

Gets copy Size of Size of of AR.PFS input output

where:where: ¾¾NewNew stackstack frameframe ofof sizesize (i(i ++ ll ++ o)o) isis allocatedallocated onon thethe generalgeneral registerregister stack.stack. ¾¾PreviousPrevious functionfunction statestate (PFS)(PFS) registerregister isis copiedcopied toto

registerregister specifiedspecified byby rr1.1. ¾¾ThisThis instructioninstruction maymay alsoalso resizeresize thethe currentcurrent stackstack frame.frame.

® 12 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. ApplicationApplication ProgrammingProgramming ModelModel RegisterRegister StackStack CFMCFM PFMPFM InstructionInstruction StackedStacked RegistersRegisters 3232 4646 5252 solsol sofsof solsol sofsof AA frameframe Input A Local A Output A 1414 2121 xxxx xxxx callcall 3232 3838 BB frameframe 00 77 1414 2121 Output B1 allocalloc 3232 4747 5050 BB frameframe Input B Loc B Out B2 1515 1919 1414 2121 returnreturn 3232 4646 5252 AA frameframe Input A Local A Output A 1414 2121 1414 2121

Note:Note: ThisThis isis anan animatedanimated slide;slide; pleaseplease viewview inin slideslide show.show.

® 13 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. ApplicationApplication ProgrammingProgramming ModelModel RegisterRegister StackStack EngineEngine

MemoryMemory RegisterRegister StackStack (Backing(Backing Store)Store)

CurrentCurrent PROC C BSPBSP DirtyDirty PROC B RSE BSPSTOREBSPSTORE CleanClean PROC A PROC A

PROC A’s ancestors Higher Higher register memory addresses addresses

® 14 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. RegisterRegister rotationrotation yy Example:Example: 8 general y Floating point and predicate registers registers rotating, registers – Always rotate the same set counted loop (br.ctop) of registers Before: After br.ctop taken: – FR 32-127 – Rotate in the same gr32 123 29 gr32 direction as general gr33 8189 123 gr33 direction as general registers gr34 0 8189 gr34 gr35 99 0 gr35 – highest rotates to lowest gr36 abc 99 gr36 register number

gr37 9ad6 abc gr37 – all other values rotate towards larger register gr38 beef 9adc gr38 towards larger register numbers gr39 29 Wraparound beef gr39 gr40 4567 4567 gr40 – Rotate at the same time as gr41 818 818 gr41 general registers (at the ...... modulo-scheduled loop instruction) ......

® 15 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. SoftwareSoftware PipeliningPipelining Sequential Loop Software-Pipelined Loop load compute store Time Time Time Time

y Traditional architectures use loop unrolling – Results in code expansion and increased misses y Itanium™ Software Pipelining uses rotating registers – Allows overlapping execution of multiple loop instances IItanium™tanium™ provides provides directdirect supportsupport forfor SoftwareSoftware PipeliningPipelining

® 16 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. SoftwareSoftware pipelinedpipelined LoopLoop y Consider Pseudo Code: loop: ldfd x[i] C code: fmpy.d y[i] = a, x[i] for (i = 0; i < n; i++) stfd y[i] y[i] = a * x[i]; br.ctop loop

y Assume – Instruction Latencies: – ldfd (fp load) 4 cycles* – fmpy.d (fp mul) 2 cycles* *Cycle counts for demonstration – stfd (fp store) 1 cycle* purposes only. – br.ctop (branch counted loop top) 1 cycle* – ldfd, fmpy.d, stfd and br can be issued in the same instruction group ( only w/o RAW or WAW dependencies)

® 17 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. SoftwareSoftware pipelinedpipelined looploop Cycle 1: ld x[1] Cycle 2: ld x[2] For n = 8 Cycle 3: ld x[3] Cycle 4: ld x[4] Prolog Cycle 5: ld x[5] fmpy y[1]=a,x[1] Cycle 6: ld x[6] fmpy y[2]=a,x[2] Cycle 7: ld x[7] fmpy y[3]=a,x[3] stf y[1] Kernel Cycle 8: ld x[8] fmpy y[4]=a,x[4] stf y[2] Cycle 9: fmpy y[5]=a,x[5] stf y[3] Cycle 10: fmpy y[6]=a,x[6] stf y[4] Epilog Cycle 11: fmpy y[7]=a,x[7] stf y[5] Cycle 12: fmpy y[8]=a,x[8] stf y[6] Cycle 13: stf y[7] Cycle* Cycle counts14: for demonstration purposes only. stf y[8] InIn thisthis example,example, oneone iterationiteration takestakes 77 cyclescycles

® 18 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. Software pipelined loop Loop body: Remember our example latencies: (p16) ldf f32,[r32],8 ldf 4 cycles (p20) fmpy f36=f6,f36 fmpy 2 cycles (p22) stf [r33],f38,8 stf 1 cycle br 1 cycle br.ctop

floating point registers predicate registers 32 33 34 35 36 37 38 ... 16 17 18 19 20 21 22 ...

ldf f32, x[1] 1 0 0 0 0 0 0 ldf f32, x[2] 1 1 0 0 0 0 0 ldf f32, x[3] 1 1 1 0 0 0 0 ldf f32, x[4] 1 1 1 1 0 0 0 ldf f32, x[5] fmpy y[1]=a, x[1] 1 1 1 1 1 0 0 ldf f32, x[6] fmpy y[2]=a, x[2] 1 1 1 1 1 1 0 ldf f32, x[7] fmpy y[3]=a, x[3] + stf y[1] 1 1 1 1 1 1 1 ldf f32, x[8] fmpy y[4]=a, x[4] + stf y[2] 1 1 1 1 1 1 1 fmpy y[5]=a, x[5] + stf y[3] 0 1 1 1 1 1 1 fmpy y[6]=a, x[6] + stf y[4] 0 0 1 1 1 1 1 fmpy y[7]=a, x[7] + stf y[5] 0 0 0 1 1 1 1 fmpy y[8]=a, x[8] + stf y[6] 0 0 0 0 1 1 1 stf y[7] 0 0 0 0 0 1 1 stf y[8] 0 0 0 0 0 0 1

® 19 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. SoftwareSoftware pipelinedpipelined looploop y Actual code example: // Initialization mov pr.rot=0 // Clear all rotating predicates cmp.eq p16,p0=r0,r0 // Set p16=1 mov ar.lc=7 // Set loop to n-1 mov ar.ec=7 // Set epilog counter # of stages ... loop: { .mfi (p16) ldfd f32=[r32],8 // Stage 1: Load x (p20) fmpy.d f36=f6,f36 // Stage 5: y=a*x nop.i 0 } { .mfb (p22) stfd [r33]=f38,8 // Stage 7: Store y nop.f 0 br.ctop.sptk.few loop // Branch back }

® 20 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. ControlControl && DataData SpeculationSpeculation

instr 1 instr 1 .instr . . 2 instr 2 br Barrier st [?] Barrier

ld r1= ld r1= use =r1 use =r1

y Control Speculation „ Data Speculation moves loads above moves loads above branches / calls possibly conflicting stores

SpeculationSpeculation reducesreduces thethe impactimpact ofof memorymemory latencylatency

® 21 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. ControlControl SpeculationSpeculation Itanium® Traditional Architecture ld.s r1= Detect exception

instr 1 p instr 1 r o instr 2 p . . . instr 2 a g a Barrier t br Barrier br e

ld r1= chk.s r1 Deliver exception use =r1 use =r1

y Control Speculation moves loads above branches – Detected exception indicated using NaT bit / NaTVal y Check raises detected exceptions

0 clk checks: dependent uses issued in parallel with check

® 22 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. Itanium™ Processor DataData SpeculationSpeculation ld.a r1= Itanium™ Processor instr 1 Traditional Arch ld.a r1= ... instr 1 instr 1 use =r1 Speculative use instr 2 instr 2 ... st [?] Barrier st [?] st [?] Recovery st [?] code ld r1= ld r1= ld.c r1 chk.a r1 use =r1 use =r1 use =r1 ... br

y Data Speculation moves loads above possibly conflicting stores – Keeps track of load addresses used in advance (ALAT) y Advanced-loaded data can be used speculatively DataData andand ControlControl SpeculationsSpeculations cancan bebe combinedcombined ® 23 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. FloatingFloating--PointPoint ArchitectureArchitecture y 128 Floating Point registers (82 bit) – Single, double, double-extended data types y Full IEEE.754 compliance y Arithmetic – FMA – Multiply-Add instruction f = a * b + c – SW Divide / Sqrt, provide high throughput, take advantage of wide FP machine – Max, Min instructions for Floating-point y Data – load, store, GR ⇔ FR conversion; load pair to double data

Excellent 22 independent independent FP FP Units Units Excellent /3D Apps UpUp to to 4 4 DP DP FP FP operations operations per per clock clock Workstation/3D Apps Performance UpUp to to 4 4 DP DP FP FP operands operands loaded loaded per per clock clock Performance

®

* Other brands and names may be claimed as the property of others. FloatingFloating PointPoint ArchitectureArchitecture

z Native 82-bit hardware provides support for multiple numeric models z 2 pipelined FMACs deliver 4 EP / DP FLOPs/cycle z Performance for security, efficient use of hardware: Integer mul-add, s/w divide z Balanced with plenty of operand bandwidth from registers / memory

2 stores/clk 6 x 82-bit operands

Up to even 128 entry 9MB L2 L3 82-bit Cache odd Cache RF

2 DP 4 DP Ops/clk Ops/clk (2 x Fld-pair) 2 x 82-bit results

® 25 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. ItaniumItanium®® 22 CachesCaches

L1I L1D L2 L3 Size 16K 16K 256K 1.5/3/6M/9M on die Line Size 64B 64B 128B 128B Ways 4 4 8 12 Replacement LRU NRU NRU NRU Latency I-Fetch:1 INT:1 INT: 5 12/13 (load to use) FP: 6 Write Policy - WT (RA) WB (WA WB (WA) + RA) Bandwidth R: 32 GBs R: 16 GBs R: 32 GBs R: 32 GBs W: 16 GBs W: 32 GBs W: 32 GBs

All caches are pipelined, and non-blocking

® 26 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. ItaniumItanium®® 22 PipelinesPipelines

FPU FP1 FP2 FP3 FP4 WB Core IPG ROT EXP RENREG EXE DET WB L2 L2N L2I L2A L2M L2D L2C L2W

IPG IP Generate, L1I Cache and TLB access EXE ALU Execute(6), L1D Cache and TLB access + L2 Cache Tag Access ROT Instruction Rotate and Buffer DET Exception Detect, Branch Correction EXP Expand, Port Assignment and Routing WB Writeback, Integer Register update REN Integer and FP Register Rename FP1-WB FP FMAC + reg write REG Integer and FP read L2N-L2I L2 Queue Nominate/Issue L2A-W L2 Access, Rotate, Correct, Write Short 8-stage in-order main pipeline – In-order issue, out-of-order completion – Reduced branch misprediction penalties

Pipelines are designed for very low latency

® 27 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. FunctionalFunctional UnitsUnits

Itanium 2 Processor ˆˆ IntegerInteger ˆˆ 66 ALUsALUs Integer ˆˆ 66 Multi-MediaMulti-Media ˆˆ MemoryMemory PortsPorts FP - 64/82bit ˆˆ 22 Load,Load, 44 FPFP LoadLoad FP -32bit (SIMDFP) ˆˆ 22 StoreStore ˆˆ FloatingFloating PointPoint Multimedia ˆˆ 22 MACsMACs ˆˆ BranchBranch PortsPorts Load/Store ˆˆ 33 BranchesBranches Branch

® 28 Intel, Itanium, and the Intel logo are trademarks or registered trademarks of * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. Intel Corporation or its subsidiaries in the United States or other countries. ItaniumItanium®® 22 BlockBlock DiagramDiagram

L1 Instruction Cache and ECC ECC ITLB Fetch/Pre-fetch Engine IA-32 Branch Instruction Decode Prediction 8 bundles Queue and Control 11 Issue Ports B B B M M M M I I F F

Register Stack Engine / Re-Mapping

Branch & Predicate Registers 128 Integer Registers 128 FP Registers L3 Cache

Branch Integer Quad-Port Units and L1 L2 Cache –L2 Cache Quad Port MM Units Data Floating ALAT

,NaTs, Exceptions Cache Point Scoreboard, Predicate and Units ECC ECC DTLB

ECC ECC ECC Controller

® 29 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. IntelIntel EnterpriseEnterprise MicroMicro--ArchitecturesArchitectures ® Intel® ® Processor Itanium 2 Intel® Xeon® Processor Processor 9M w/ 64-bit Extensions Processor 9M 64 GB Memory Addressing 1024 TB

6.4 GB/s System Bus 6.4 GB/s 1 MB 1 MB On-die Cache 9 MB Hyper-Threading® Hyper-Threading® Technology Technology HyperThreading

Up to 6 Issue Ports 1234567891011 123456 128 GPR, 128 FP 16 GPR, 16 XMM, 8 FP Architectural 16 GPR, 16 XMM, 8 FP Registers + 64 Predicate Registers*

1 FP/ 1 FP Mem Execution Units 2 FP, 2 Load and 2 Memory MMX/SSE 6 Integer, 2 Integer 3 Branch 1 SIMD 2 Store 3.8 GHz Core Frequency 1.6 GHz 3 Instructions / Cycle Instructions / Clk 6 Instructions / Cycle

* Intel’s EPIC technology includes 64 single-bit predicate registers to accelerate loop unrolling and branch-intensive code execution

® 30 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. IAIA--3232 AppApp SupportSupport onon ItaniumItanium®® IA-32 code IPF code IA-32 code IPF code or IA-32 EL IA-32 EL enables increased utilization IA-32 IA-32 of key Itanium® H/W H/W architecture features Native H/W Native H/W

Today w/ IA-32 EL

•• IA-32 applications are supported on all Itanium® processor offerings – Before: IA-32 hardware-based approach – Since 2004: IA-32 Execution Layer enhances 32-bit application support •• IA-32 Execution Layer – What is it? – A software binary that ships with the OS; initiated by the OS when an IA-32 app is launched – IA-32 EL then translates the IA-32 app into a native Itanium®-based app •• Benefits of IA-32 EL – Enables the use of new IA-32 instructions (e.g. SSE2) – Can accelerate IA-32 applications on Itanium®-based systems •• Available for Windows today ( SP2 or patch from www.microsoft.com ) • For Redhat available too; for SUSE in major release • ® 31 * Other•• brandsAvailable and names may be claim fored as theSGI property Altixof others. tooCopyright ( ProPack © 2004 Intel Corporation.3.0 ) All rights reserved. ® 32 Intel and the Intel logo are trademarks or registered trademarks of Intel * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. Corporation or its subsidiaries in the United States or other countries. AA veryvery briefbrief OverviewOverview onon MontecitoMontecito

www.intel.com/software/college

®

* Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. BuildingBuilding onon ItaniumItanium®® 22

y Capability – Same capable micro-architecture supports familiar optimizations – Same system interface allows for leveraged investment y Performance – Continue and expand Itanium 2 performance trends – Introduce new performance opportunities y Efficiency – 2 cores, 2 threads, 26.5MByte of cache, and 1.72 billion transistors at 100W y Flexibility – Dynamic power and frequency management – RAS+M

® 34 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. ParallelismParallelism InstructionInstruction LevelLevel ParallelismParallelism (ILP)(ILP)

NonNon--dependentdependent instructionsinstructions fromfrom aa singlesingle threadthread

ThreadThread LevelLevel ParallelismParallelism (TLP)(TLP)

NonNon--dependentdependent instructionsinstructions fromfrom multiplemultiple threadsthreads

® 35 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. ImprovingImproving ILPILP

L1I Branch Instruction y Improving the execution Cache (16KB) Prediction TLB – Additional shifter and popcount B B B I I M M M M F F – More efficient speculation recovery – New instructions Register Stack Engine / Re-name Branch & Floating y Improving the Integer Predicate Point Registers – Split the L2 cache Registers Registers – Dedicated 1 MByte L2 instruction Integer Memory/ Floating cache Branch Unit – Effectively grows L2 data cache Unit Integer Point Unit – Larger L3 cache L1D Data ALAT – Grow L3 to 12 Mbyte (per core) Cache (16KB) TLB – Maintain L3 latency of Itanium® 2 processor 6M and 9M processor 6M and 9M L2D L2I – Queues and Control Cache (256KB) Cache (1MB) – Additional L3 and L2 victim buffers Queues/ L3 – More efficient control of queues Control Cache (12MB)

System Interface

® 36 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. ExploitingExploiting TLPTLP

L1I Branch Instruction y Interleave orthogonal threads to Cache (16KB) Prediction TLB hide memory latency – Multiple cores in each socket B B B I I M M M M F F – Multiple threads in each core Register Stack Engine / Re-name y multi-threading Branch & Floating Integer Predicate Point – Dynamically allocate resources Registers based on most efficient use Registers Registers Integer Memory/ Floating – Long latency events determine if a Branch Unit can effectively use Unit Integer Point Unit execution resources L1D Data ALAT – Some resources are shared, but a Cache (16KB) TLB thread is given exclusive access for a time period L2D L2I Cache (256KB) Cache (1MB) – Some resources are competitively shared Queues/ L3 Control Cache (12MB)

System Interface

® 37 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. MontecitoMontecito –– 44 contexts,contexts, 11 socket socketL1I Branch Instruction L1I Branch Instruction Cache (16KB) Prediction TLB Cache (16KB) Prediction TLB

B B B I I M M M M F F B B B I I M M M M F F

Register Stack Engine / Re-name Register Stack Engine / Re-name

Branch & Floating Branch & Floating Integer Integer Predicate Point Predicate Point Registers Registers Registers Registers Registers Registers

Integer Memory/ Floating Integer Memory/ Floating Branch Unit Branch Unit Unit Integer Point Unit Unit Integer Point Unit

L1D Data L1D Data ALAT ALAT Cache (16KB) TLB Cache (16KB) TLB

L2D L2I L2D L2I Cache (256KB) Cache (1MB) Cache (256KB) Cache (1MB)

Queues/ L3 Queues/ L3 Control Cache (12MB) Control Cache (12MB)

Synchronizer Synchronizer

® 38 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. MontecitoMontecito MultiMulti--threadingthreading

Serial Execution

Ai Idle Ai+1 Bi Idle Bi+1

Montecito Multi-threaded Execution

Ai Idle Ai+1

Bi Bi+1

Multi-threading decreases stalls and increase performance

® 39 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. DynamicDynamic ThreadThread SwitchingSwitching

yy OptimalOptimal ––DetermineDetermine whenwhen executionexecution isis stalledstalled forfor longlong latencylatency operationsoperations yy PracticalPractical ––PredictPredict thatthat aa longlong latencylatency eventevent willwill stallstall executionexecution ––HysteresisHysteresis toto avoidavoid needlessneedless switchesswitches yy hint@pausehint@pause givesgives softwaresoftware controlcontrol An effective solution allowing streaming and access clumping

® 40 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. SummarySummary

yy ItaniumItanium architecturearchitecture providesprovides aa leapleap inin performanceperformance –– LargeLarge numbernumber of integer andand floatingfloating-point-point registersregisters –– Parallel/flexibleParallel/flexible instructioninstruction issueissue –– UsesUses predication,predication, speculatspeculationion andand advancedadvanced branchbranch architecturearchitecture –– SynergySynergy betweenbetween thethe hardwarehardware andand compilercompiler toto exploitexploit parallelismparallelism yy TheThe ItaniumItanium architecturearchitecture operatingoperating environmentenvironment –– SupportsSupports fullfull compatibilitycompatibility –– ItaniumItanium architecturearchitecture systemsystem environmentenvironment (IA(IA-32-32 andand ItaniumItanium architecturearchitecture applications)applications)

Performance,Performance, Compatibility,Compatibility, ScalabilityScalability

® 41 Intel, Itanium, and the Intel logo are trademarks or registered trademarks of * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. Intel Corporation or its subsidiaries in the United States or other countries.