Intel Itanium 2 Processor Architecture

IntelIntel®® ItaniumItanium®® 22 ProcessorProcessor ArchitectureArchitecture www.intel.com/software/college ® Intel, Itanium, and the Intel logo are trademarks or registered trademarks of * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. Intel Corporation or its subsidiaries in the United States or other countries. ItaniumItanium®® ProcessorProcessor ArchitectureArchitecture SelectedSelected KeyKey FeaturesFeatures y 64-bit Addressing Flat Memory Model y Instruction Level Parallelism (6-way) y Large Register Files y Automatic Register Stack Engine y Predication y Software Pipelining Support ( Register Rotation + Loop Control Hardware ) y Sophisticated Branch Architecture y Control & Data Speculation y Powerful 64-bit Integer Architecture y Advanced 82-bit Floating Point Architecture ® 2 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. TraditionalTraditional Architectures:Architectures: LimitedLimited ParallelismParallelism Original Source Sequential Machine Code Code Hardware Compile parallelizedparallelized codecode multiple functional units Execution Units Available- . Used Inefficiently . TodayTodayToday’s’’ss ProcessorsProcessorsProcessors areareare oftenoftenoften 60%60%60% IdleIdleIdle ® * Other brands and names may be claimed as the property of others. IntelIntel®® ItaniumItanium®® Architecture:Architecture: ExplicitExplicit ParallelismParallelism Original Source Parallel Machine Code Code Compile Compiler Hardware multiple functional units Itanium Architecture Compiler Views More efficient use of . Wider execution resources . Scope IncreasesIncreases ParallelParallel ExecutionExecution ® * Other brands and names may be claimed as the property of others. EPICEPIC InstructionInstruction ParallelismParallelism Source Code Instruction Groups No RAW or WAW (series of bundles) dependencies Issued in parallel depending on resources Instruction 3 instructions + Bundles template (3 Instructions) 3 x 41 bits + 5 bits = 128 bits Up to 6 instructions executed per clock ® 5 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. InstructionInstruction LevelLevel ParallelismParallelism y Instruction Groups instr 1 // 1st. group – No RAW or WAW dependencies instr 2;; // 1st. group – Delimited by ‘stops’ in assembly code instr 3 // 2nd. group – Instructions in groups issued in parallel, instr 4 // 2nd. group depending on available resources. y Instruction Bundles { .mii ld4 r28=[r8] // load – 3 instructions and 1 template in 128-bit bundle – 3 instructions and 1 template in 128-bit bundle add r9=2,r1 // Int op. – Instruction dependencies by using ‘stops’ add r30=1,r1 // Int op. – Instruction groups can span multiple bundles } 128 bits (bundle) Instruction 2 Instruction 1 Instruction 0 template 41 bits 41 bits 41 bits 5 bits Memory (M) Memory (M) Integer (I) (MMI) FlexibleFlexible IssueIssue CapabilityCapability ® * Other brands and names may be claimed as the property of others. LargeLarge RegisterRegister SetSet General Floating-point Predicate Branch Application Registers Registers Registers Registers Registers NaT 64-bit 82-bit 64-bit 64-bit GR0 0 FR0 + 0.0 PR0 1 BR0 AR0 GR1 FR1 + 1.0 PR1 AR1 BR7 GR31 FR31 PR15 AR31 GR32 FR32 PR16 AR32 PR63 GR127 FR127 AR127 32 Static 32 Static 16 Static 96 Stacked 96 Rotating 48 Rotating ® 7 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. PredicationPredication yy PredicatePredicate registersregisters activate/inactivateactivate/inactivate instructionsinstructions yy PredicatePredicate RegistersRegisters areare setset byby CompareCompare InstructionsInstructions – Example: cmp.eq p1, p2 = r2, r3 yy (Almost)(Almost) allall instructionsinstructions cancan bebe predicatedpredicated (p1) ldfd f32=[r32],8 (p2) fmpy.d f36=f6,f36 yy Predication:Predication: – eliminates branching in if/else logic blocks – creates larger code blocks for optimization – simplifies start up/shutdown of pipelined loops ® 8 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. PredicationPredication y Code Example: absolute difference of two numbers C Code if (r2 >= r3) Non-Predicated r4 = r2 - r3; Pseudo Code else r4 = r3 - r2; cmpGE r2, r3 jump_zero P2 P1: sub r4 = r2, r3 Predicated Assembly Code jump end cmp.ge p1,p2 = r2,r3 ;; P2: sub r4 = r3, r2 (p1) sub r4 = r2,r3 end: ... (p2) sub r4 = r3, r2 PredicationPredication RemovesRemoves Branches,Branches, EnablesEnables ParallelParallel ExecutionExecution ® 9 * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. RegisterRegister StackStack yy GeneralGeneral registersregisters (GRs)(GRs) 00-31-31 areare globalglobal toto allall procedures.procedures. 96 Stacked yy StackedStacked registersregisters beginbegin atat GR32GR32 andand areare 127 locallocal toto eacheach procedure.procedure. PROC B yy EachEach procedureprocedure registerregister stackstack frameframe variesvaries fromfrom 00 toto 9696 registers.registers. Overlap yy OnlyOnly GRsGRs implementimplement aa registerregister stack.stack. PROC A 32 –– TheThe FRs,FRs, PRs,PRs, andand BRsBRs areare globalglobal toto allall 31 procedures.procedures. 32 Global yy RegisterRegister stackstack engineengine (RSE)(RSE) 0 –– UponUpon stackstack overflow/underflow,overflow/underflow, aa backingbacking storestore transparentlytransparently savessaves oror restoresrestores thethe registers.registers. OptimizesOptimizes thethe Call/ReturnCall/Return MechanismMechanism ® 10 Intel, Itanium, and the Intel logo are trademarks or registered trademarks of * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. Intel Corporation or its subsidiaries in the United States or other countries. IntroductionIntroduction RegisterRegister StackStack yy CallCall changeschanges thethe frameframe toto contcontainain onlyonly thethe callercaller output.output. yy AllocAllocsets sets thethe frameframe regionregion toto thethe desireddesired size.size. –– ThereThere areare threethree architecturearchitecture parameters:parameters: local,local, output,output, andand rotarotating.ting. yy ReturnReturn restoresrestores thethe stackstack frameframe ofof thethe caller.caller. 5656 Outputs 4848 VirtualVirtual Local 5252 5252 Outputs Outputs (Inputs) Outputs 4646 32 3232 4646 Local Local (Inputs) (Inputs) 3232 Call Alloc Ret 3232 PROCPROC AA PROCPROC BB PROCPROC BB PROCPROC AA AvoidAvoid Register Register Spill/Fill Spill/Fill Upon Upon Procedure Procedure Call/Return Call/Return ® 11 Intel, Itanium, and the Intel logo are trademarks or registered trademarks of * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. Intel Corporation or its subsidiaries in the United States or other countries. AllocAlloc SemanticsSemantics Size of Size of local rotating allocalloc rr1 == ar.pfs,ar.pfs, i,i, l,l, o,o, rr Gets copy Size of Size of of AR.PFS input output where:where: ¾¾NewNew stackstack frameframe ofof sizesize (i(i ++ ll ++ o)o) isis allocatedallocated onon thethe generalgeneral registerregister stack.stack. ¾¾PreviousPrevious functionfunction statestate (PFS)(PFS) registerregister isis copiedcopied toto registerregister specifiedspecified byby rr1.1. ¾¾ThisThis instructioninstruction maymay alsoalso resizeresize thethe currentcurrent stackstack frame.frame. ® 12 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. ApplicationApplication ProgrammingProgramming ModelModel RegisterRegister StackStack CFMCFM PFMPFM InstructionInstruction StackedStacked RegistersRegisters 3232 4646 5252 solsol sofsof solsol sofsof AA frameframe Input A Local A Output A 1414 2121 xxxx xxxx callcall 3232 3838 BB frameframe 00 77 1414 2121 Output B1 allocalloc 3232 4747 5050 BB frameframe Input B Loc B Out B2 1515 1919 1414 2121 returnreturn 3232 4646 5252 AA frameframe Input A Local A Output A 1414 2121 1414 2121 Note:Note: ThisThis isis anan animatedanimated slide;slide; pleaseplease viewview inin slideslide show.show. ® 13 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. ApplicationApplication ProgrammingProgramming ModelModel RegisterRegister StackStack EngineEngine MemoryMemory RegisterRegister StackStack (Backing(Backing Store)Store) CurrentCurrent PROC C BSPBSP DirtyDirty PROC B RSE BSPSTOREBSPSTORE CleanClean PROC A PROC A PROC A’s ancestors Higher Higher register memory addresses addresses ® 14 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names may be claimed as the property of others. Copyright © 2004 Intel Corporation. All rights reserved. RegisterRegister rotationrotation yy Example:Example: 8 general y Floating point and predicate registers registers rotating,

Intel Itanium 2 Processor Architecture

UNIT 8B a Full Adder

With Extreme Scale Computing the Rules Have Changed

Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus

Misleading Performance Reporting in the Supercomputing Field David H

Intel Xeon & Dgpu Update

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Summarizing CPU and GPU Design Trends with Product Data

Theoretical Peak FLOPS Per Instruction Set on Less Conventional Hardware

5 Microprocessors

Opinion Ten Reasons Why HP’S Itanium-Based Servers Have Reached the Point-Of-No-Return

The IA-32 Processor Architecture

Intel Itanium 2 Processors Get Faster Bus Architecture 18 July 2005