IAIA--6464 ArchitectureArchitecture
A High-Performance Computing Architecture
Stefan.Lamberts@intel.com Internet Solutions Group EMEA Technical Marketing Overview Overview July 2000 AgendaAgenda
Motivation and IA-64 feature overview IA-64 features • EPIC • Data types, memory and registers • Register stack • Predication and parallel compares • Software pipelining and register rotation • Control & data speculation • Branch architecture • Integer architecture • Floating point architecture Itanium™ processor overview Itanium™ processor based systems overview Operating systems, tools and programming
Copyright © 2000, Intel Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners IAIA--64:64: ExtendingExtending thethe IntelIntel®® ArchitectureArchitecture Designed for High Performance Computing • Scientific • Technical & Engineering • Business New EPIC Technology IA-64 Architecture uses EPIC Itanium™ processor is the first implementation of IA-64
Copyright © 2000, Intel Corporation. All rights reserved. 3 *Other brands and names are the property of their respective owners PerformancePerformance LimitersLimiters Parallelism not fully utilized • Existing architectures cannot exploit sufficient parallelism in integer code to feed a wide in-order implementation Branches • Even with perfect branch prediction, small basic blocks of code do not fully utilize machine width Procedure Calls • Software modularity is becoming standard resulting call/return overhead Memory latency and address space • Increasing relative to processor cycle time (larger cache miss penalties) and limited address space
IA-64 overcomes these limitations, IA-64 IA-64 overcomes these limitations, and more !
Copyright © 2000, Intel Corporation. All rights reserved. 4 *Other brands and names are the property of their respective owners IAIA--6464 ArchitecturalArchitectural FeaturesFeatures 64-bit Address Flat Memory Model Explicit Parallel Instruction Computing Parallelism Explicit Parallel Instruction Computing Large Register Files Automatic Register Stack Engine Predication Branches Software Pipelining Support Register Rotation Sophisticated Branch Architecture Procedure Calls Loop Control Hardware Control & Data Speculation Cache Control Memory Powerful Integer Architecture Latency Advanced Floating Point Architecture Multimedia Support (MMX™ Technology)
IA-64 It’s more than just 64 bits … Copyright © 2000, Intel Corporation. All rights reserved. 5 *Other brands and names are the property of their respective owners NextNext GenerationGeneration ArchitectureArchitecture EPICEPIC DesignDesign PhilosophyPhilosophy Maximize performance via EPIC hardware & software synergy Advanced features VLIW OOO / SS enhance instruction level parallelism RISC • Predication, Speculation, … Massive hardware CISC resources for parallel Performance Performance execution Time BeyondBeyond TraditionalTraditional ArchitecturesArchitectures
Copyright © 2000, Intel Corporation. All rights reserved. 6 *Other brands and names are the property of their respective owners AgendaAgenda
Motivation and IA-64 feature overview IA-64 features • EPIC • Data types, memory and registers • Register stack • Predication and parallel compares • Software pipelining and register rotation • Control & data speculation • Branch architecture • Integer architecture • Floating point architecture Itanium™ processor overview Itanium™ processor based systems overview Operating systems, tools and programming
Copyright © 2000, Intel Corporation. All rights reserved. 7 *Other brands and names are the property of their respective owners EPICEPIC InstructionInstruction ParallelismParallelism
Source Code
Instruction Groups No RAW or WAW (series of bundles) dependencies Issued in parallel depending on resources
Instruction 3 instructions + Bundles template (3 Instructions) 3 x 41 bits + 5 bits = 128 bits
Up to 6 instructions executed per clock
Copyright © 2000, Intel Corporation. All rights reserved. 8 *Other brands and names are the property of their respective owners CompilerCompiler (HW)(HW) DataData TypesTypes
64-bit Integer
2x32-bit SIMD Integer
4x16-bit SIMD Integer
8x8-bit SIMD Integer
64-bit DP F.P.
2x32-bit SIMD SP-F.P.
IA-64 All common data types are supported
Copyright © 2000, Intel Corporation. All rights reserved. 9 *Other brands and names are the property of their respective owners 6464--bitbit MemoryMemory AccessAccess 18 BILLION Giga Bytes accessible • 264 == 18,446,744,073,709,551,616 Byte addressable access with 64-bit pointers • 64-bit virtual address space • HW support for 32-bit pointers Access granularity and alignmentalignment • 1,2,4,8,10,16 bytes • Alignment on naturally aligned boundaries is recommended • Instructions are always 16-byte aligned Support for both Big and Little endian byte order Memory hierarchy control
2.1 GB/s front-side bus
Byte Addressable 64-bit IA-64 Byte Addressable 64-bit Virtual Address-Space Copyright © 2000, Intel Corporation. All rights reserved. 10 *Other brands and names are the property of their respective owners MemoryMemory HierarchyHierarchy ControlControl Software can explicitly control memory accesses • Specify levels of the memory hierarchy affected by the access • Allocation and Flush resolution is at least 32-bytes Allocation (Prefetch) • Allocation implies bringing the data close to the CPU • Allocation hints indicate at which level allocation takes place • Used in load, store, and explicit pre-fetch instructions De-allocation and Flush • Invalidates the addressed line in all levels of cache hierarchy • Write data back to memory if necessary
Three levels of cache (full speed L2 cache, 2/4MB L3-cache) & Atomic operation support
IA-64 Control over Cache (De)Allocation
Copyright © 2000, Intel Corporation. All rights reserved. 11 *Other brands and names are the property of their respective owners MemoryMemory AccessAccess OrderingOrdering
Explicit control • Memory Fence mf - ensures all prior memory operations are seen prior to all future memory operations • Acquire Load ld.acq - ensure I am seen prior to all future memory operations • Release store st.rel - ensure that all prior memory operations are seen prior to me • Synchronize instruction caches sync.i - Ensure all instruction caches have seen all prior flush cache instructions Implicit - applicable to semaphore instructions • xchg Exchange mem and General Register (GR) • cmpxchg Conditional exchange of mem and GR • fetchadd Add immediate to memory Strong ordering model is compatible with IA-32 Ordering
Copyright © 2000, Intel Corporation. All rights reserved. 12 *Other brands and names are the property of their respective owners LargeLarge RegisterRegister SetSet Predicate Integer Registers FP Registers Branch Registers Registers NaT 63 0 63 0 81 0 63 bit 0 GR0 0 FR0 + 0.0 BR0 GR1 FR1 + 1.0 PR0 1 BR7 PR1 GR31 FR31 GR32 FR32 PR15 PR16
GR127 FR127 PR63
32 Static 32 Static 16 Static
96 Framed, Rotating 96 Rotating 48 Rotating
Remove Resource IA-64 Remove Resource Bottlenecks
Copyright © 2000, Intel Corporation. All rights reserved. 13 *Other brands and names are the property of their respective owners ContextContext SwitchSwitch
“Normal”“Normal” – full context switch saves all registers – GR and FR “Lazy”“Lazy” – saves only GR registers or specific range (GR0-31, GR32-GR127) “Fast”“Fast” – doesn’t save any registers and uses 16 separate banked “shadow” registers ‘(OS only, GR16’-GR31’) – e.g. interrupt and exception handling
Copyright © 2000, Intel Corporation. All rights reserved. 14 *Other brands and names are the property of their respective owners RegisterRegister StackStack GRs 0-31 are global to all procedures Stacked registers begin at GR32 and are local to each procedure 127 Each procedure’s registerregister stackstack frame varies from 0 to 96 registers 96 Stacked Only GRs implement a register stack • The FRs, PRs, and BRs are global to all 32 procedures 31 Register Stack Engine (RSE) 32 Global • Upon stack overflow/underflow, registers 0 are saved/restored to/from a backing store transparently
Optimized CALL/RETURN IA-64 Optimized CALL/RETURN and Parameter Passing
Copyright © 2000, Intel Corporation. All rights reserved. 15 *Other brands and names are the property of their respective owners RegisterRegister StackStack inin WorkWork Call changes frame to contain only the caller’s output Alloc instr. sets the frame region to the desired size • Three architecture parameters: local, output, and rotating Return restores the stack frame of the caller 56 Outputs 48 Virtual Local 52 52 Outputs Outputs (Inputs) Outputs 46 46 32 32
Local Local
(Inputs) (Inputs) 32 Call Alloc Ret 32 PROC APROC B PROC B PROC A
Avoids Register Spill/Fill IA-64 Avoids Register Spill/Fill among Procedure Calls
Copyright © 2000, Intel Corporation. All rights reserved. 16 *Other brands and names are the property of their respective owners RegisterRegister StackStack EngineEngine
allocate 127 Stack frame Stack frame Stack frame 56 D E F Outputs release 48 Stack frame Local C 52 (Inputs) Stack frame 46 Stack frame 32 B Logical save Stack frame A Stack frame Stack frame 32 B A restore 31 Global Register 0 Physical IA-64 GR (Integer) Registers only
Copyright © 2000, Intel Corporation. All rights reserved. 17 *Other brands and names are the property of their respective owners Predication:Predication: ControlControl FlowFlow toto DataData FlowFlow
Traditional Arch. IA-64
cmp a, b if cmp a, b p1,p2 jump.eq p2 X=1 p1 X=2 Load X=1 else Conditional execution based on jump qualifying predicate X=2 then 64 predicate registers Can be combined with logical Load operations
Removes/Reduces Branches and IA-64 Removes/Reduces Branches and Enables Parallel Execution
Copyright © 2000, Intel Corporation. All rights reserved. 18 *Other brands and names are the property of their respective owners PredicationPredication ...... UnpredictableUnpredictable branchesbranches removedremoved – Misprediction penalties eliminated BasicBasic blockblock sizesize increasesincreases – Compiler has a larger scope to find ILP ILPILP withinwithin thethe basicbasic blockblock increasesincreases – Both “then” and “else” executed in parallel WiderWider machinesmachines areare betterbetter utilizedutilized
Predication Enables and IA-64 Predication Enables and Enhances ILP
Copyright © 2000, Intel Corporation. All rights reserved. 19 *Other brands and names are the property of their respective owners ParallelParallel ComparesCompares
Three new types of compares: • AND: both target predicates set FALSE if compare is false • OR: both target predicates set TRUE if compare is true • ANDOR: if true, stets one TRUE, set other FALSE
A A B C
B D
C IA-64 Reduces Critical Path D
Copyright © 2000, Intel Corporation. All rights reserved. 20 *Other brands and names are the property of their respective owners SoftwareSoftware PipeliningPipelining Sequential Loop Software-Pipelined Loop load compute store Time Time Time Time
Traditional architectures use loop unrolling – Results in code expansion and increased cache misses IA-64 Software Pipelining uses rotating registers – Allows overlapping execution of multiple loop instances Predication controls the pipeline stages Provides Direct Support for IA-64 Provides Direct Support for Software Pipelining Copyright © 2000, Intel Corporation. All rights reserved. 21 *Other brands and names are the property of their respective owners SoftwareSoftware PipeliningPipelining
IAIA--6464 featuresfeatures thatthat makemake thisthis possiblepossible – Full predication to define pipeline stages – Special branch handling features – Loop branches – Special loop registers (LC, EC) – Register rotation: removes loop copy overhead – Predicate rotation: removes prologue & epilog TraditionalTraditional architecturesarchitectures useuse looploop unrollingunrolling – High overhead: extra code for loop body, prologue and epilog – Consumes a large number of registers
Copyright © 2000, Intel Corporation. All rights reserved. 22 *Other brands and names are the property of their respective owners SoftwareSoftware PipeliningPipelining
p16 p17 p18 p19 1 0 0 0
1 1 0 0 Prologue 1 1 1 0 1 1 1 1 Kernel 1 1 1 1 0 1 1 1 Loop Iteration Loop Loop Iteration Loop 0 0 1 1 Epilogue 0 0 0 1 load compute store
Copyright © 2000, Intel Corporation. All rights reserved. 23 *Other brands and names are the property of their respective owners SoftwareSoftware PipeliningPipelining
Prologue IA-64 Features • Rotating Registers • Loop branches Kernel • Full predication • Rotating Predicates
Kernel-only code using stage predicates Epilogue
SWP Applicable to many IA-64 SWP Applicable to many Loops in IA-64
Copyright © 2000, Intel Corporation. All rights reserved. 24 *Other brands and names are the property of their respective owners RegisterRegister RotationRotation GR32-127 and FR32-127 can rotate (specified range) Separate rotating register base for each set (GR, FR) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number • physical register # = RRB + virtual register #
0 1 2 3 4 RRB
36 32
35 32
34 32
32
Phys. Register Phys. 33 Phys. Register Phys. 33
32 32
Predicate register range also rotates.
Copyright © 2000, Intel Corporation. All rights reserved. 25 *Other brands and names are the property of their respective owners ControlControl && DataData SpeculationSpeculation
instr. 1 instr. 1 instr. 2 instr. 2 branch Barrier st[?] Barrier
ld r1= ld r1= use = r1 use = r1
Control Speculation Data Speculation moves loads above moves loads above branches / calls possibly conflicting stores Speculation reduces the impact IA-64 Speculation reduces the impact of memory latency
Copyright © 2000, Intel Corporation. All rights reserved. 26 *Other brands and names are the property of their respective owners ControlControl SpeculationSpeculation IA-64 Traditional Arch. ld.s r1= Detect exception P r o instr. 1 instr. 1 p a g instr. 2 instr. 2 a t e e branch branch x Barrier c e p t iio n
ld r1= chk.s r1 Deliver exception use = r1 use = r1
Control Speculation moves loads above branches • Detected exception indicated using NaT bit / NaTVal Check raises detecteddetected exceptionsexceptions Branch barrier broken to minimize memory latency
Copyright © 2000, Intel Corporation. All rights reserved. 27 *Other brands and names are the property of their respective owners HoistingHoisting UsesUses IA-64 ld.s r1= Traditional Arch. instr. 1 instr. 1 use = r1 Speculative use instr. 2 instr. 2 branch Barrier branch
Recovery code ld r1= chk.s r1 ld r1= use = r1 use = r1 branch All computation instructions propagate NaTs to reduce number of checks to allow single check on results Compares also propagates when writing predicates
Copyright © 2000, Intel Corporation. All rights reserved. 28 *Other brands and names are the property of their respective owners DataData SpeculationSpeculation IA-64 Traditional Arch. ld.a r1= instr. 1 instr. 1 instr. 2 instr. 2 st[?] Barrier st[?]
ld r1= ld.c r1 use = r1 use = r1
Data Speculation moves loads above possiblypossibly conflicting stores • Keeps track of load addresses used in advance (ALAT) Advanced-loaded data can be used speculatively
Copyright © 2000, Intel Corporation. All rights reserved. 29 *Other brands and names are the property of their respective owners HoistingHoisting UsesUses IA-64 ld.a r1= Traditional Arch. instr. 1 instr. 1 use = r1 Speculative use instr. 2 instr. 2 st[?] Barrier st[?]
Recovery code ld r1= chk.a r1 ld r1= use = r1 use = r1 branch
Data and Control Speculation IA-64 Data and Control Speculation Can be combined
Copyright © 2000, Intel Corporation. All rights reserved. 30 *Other brands and names are the property of their respective owners AdvancedAdvanced LoadLoad AddressAddress TableTable -- ALATALAT ld.ald.a insertsinserts entriesentries ConflictingConflicting storesstores removeremove entriesentries •• also ld.c.clr, chk.a.clr PresencePresence ofof entryentry indicatesindicates successsuccess •• chk.a branches when no entry is found ld.a reg# = reg# addr reg# addr st[addr] reg# addr chk.a reg# ? : : : : reg# addr
Copyright © 2000, Intel Corporation. All rights reserved. 31 *Other brands and names are the property of their respective owners BranchBranch ArchitectureArchitecture
Branch types Traditional Architecture:
• IP-offset branches (21-bit disp.) compute0 • Indirect branches via 8 branch registers compute1 • HW-supported counted loop control instr. compute2 compare (a==b) Branch Predict hints B: branch_if_eq Target • Advance information on downstream ... branches and branch conditions Target: • Branch hints can be static or dynamic IA-64 Architecture: Multi-way branches hint B,Target (early hint) • Bundle 1-3 branches in a bundle compute0 compute1 • Allow multiple bundles to participate compute2 compare(a==b) Aggressive branch prediction B: branch_if_eq Target Decoupled front end with code prefetch, IA-64 Decoupled front end with code prefetch, ... Branch hints reduce misprediction Target: and overhead
Copyright © 2000, Intel Corporation. All rights reserved. 32 *Other brands and names are the property of their respective owners IntegerInteger ArchitectureArchitecture 128 general registers (64 bit; 1s+63i) Full 64-bit support (as well as 8-16-32-bit) XMA: Integer Multiply-Add instruction (l = i * j + k) Integer multiply is executed in the floating-point unit Data transfer – load, store, GR FR conversion SIMD Integer operations Divide / remainder deferred to software – Based on floating-point operations – High throughput achieved via pipelining
Up to 4 Integer/ALU operations per clock
Excellent Server & Security IA-64 Excellent Server & Security Application Performance
Copyright © 2000, Intel Corporation. All rights reserved. 33 *Other brands and names are the property of their respective owners IAIA--6464 SIMDSIMD -- IntegerInteger Exploits data parallelism with SIMD (Single Instruction Multiple Data) Performance boost for audio, 64 bits video, imaging, streaming etc. 8x8, 4x16, or 2x32 functions GRs treated as 8x8, 4x16, or 2x32 a3 a2 a1 a0 bit elements + Several instruction types b3 b2 b1 b0 • Addition and subtraction, multiply • Pack/Unpack a3+b3 a2+b2 a1+b1 a0+b0 • Left shift, signed/unsigned right shift Compatible with Intel® MMX Technology
IA-64 Performance Boost for all Data Parallel Apps
Copyright © 2000, Intel Corporation. All rights reserved. 34 *Other brands and names are the property of their respective owners FloatingFloating--PointPoint ArchitectureArchitecture Fused Multiply Add Operation – An efficient core computation unit – Greater precision, faster than independent multiply and add Abundant Register resources – 128 registers (32 static, 96 rotating) High Precision Data computations – 82-bit unified internal format for all data types – Full IEEE.754 support Software divide/square-root – High throughput achieved via pipelining Wide (Speculative) Memory Access – Dual Load-Pair support
– Address memory latency 2 independent FP Units – Pre-fetch support Up to 4 DP FP operations per clock Excellent Workstation & HPC IA-64 Excellent Workstation & HPC Up to 4 DP FP operands loaded Application Performance per clock (from L2 cache) Copyright © 2000, Intel Corporation. All rights reserved. 35 *Other brands and names are the property of their respective owners IAIA--6464 SIMDSIMD –– F.P.F.P. Exploits data parallelism with SIMD (Single Instruction Multiple Data) Up to 2x performance boost F.P. Registers treated as two 32 bit 64 bits single precision elements 2x32 bit SP FP elements • Full IEEE.752 compliance • Availability of fast divide (non IEEE) a1 a0 Compatible with Intel® Streaming + SIMD Extensions (SSE) b1 b0
Up to 8 SP FP operations per clock a1+b1 a0+b0 Enables World Class 3D IA-64 Enables World Class 3D Graphics Performance
Copyright © 2000, Intel Corporation. All rights reserved. 36 *Other brands and names are the property of their respective owners FloatingFloating--PointPoint StatusStatus RegisterRegister
FPSR rv sf3 sf2 sf1 sf0 traps 6 13 13 13 13 6 Four Sets for Parallelism & Speculation
Contains dynamic control/status for FP operations Trap/Fault disable bits • trap disables for IEEE exception events • trap disable “D” for denormal operand exception 4 separate status fields 4 computational env. • Each field specifies precision/rounding mode, Trap disables, flush to zero, widest range exponent • Each field reports sticky exception flags
Copyright © 2000, Intel Corporation. All rights reserved. 37 *Other brands and names are the property of their respective owners IntelIntel®® ItaniumItanium™™ ProcessorProcessor
IA-64 starts with Itanium processor Platform with Intel® 460GX chipset Solid progress following first silicon • More than 4 OS running today • Demonstrated real IA-64 Windows 2000 and Linux applications on real hardware • Engineering samples shipping to OEMs, IHVs and ISVs Comprehensive validation underway
Leading-Edge Implementation of IA-64 For World-Class Performance 320M transistors: 25M in CPU, 295M in L3 cache More and better Capacity & Capability
Copyright © 2000, Intel Corporation. All rights reserved. 38 *Other brands and names are the property of their respective owners ExtendingExtending IntelIntel®® ArchitectureArchitecture IA-64 Extends IA Headroom, Madison* ExtendsExtends IAIA Headroom,Headroom, IA-64 perf ScalabilityScalability andand AvailabilityAvailability Deerfield* forfor thethe MostMost IA-64 price/perf DemandingDemanding EnvironmentsEnvironments McKinley*
IA-32 * Intel code names
Future IA-32
Foster* System Performance Performance System System Performance Performance System OutstandingOutstanding PerformancePerformance forfor 3232 BitBit VolumeVolume AppsApps
‘99 ‘00 ‘01 ‘02 .25µ .18µ .13µ
All dates specified are target dates provided for planning purposes only and are subject to change. Copyright © 2000, Intel Corporation. All rights reserved. 39 *Other brands and names are the property of their respective owners ItaniumItanium™™ ProcessorProcessor BlockBlock DiagramDiagram
Copyright © 2000, Intel Corporation. All rights reserved. 40 *Other brands and names are the property of their respective owners ItaniumItanium™™ ProcessorProcessor FeaturesFeatures UpUp toto 66 instructionsinstructions issuedissued perper clockclock 99 instructioninstruction issueissue portsports 22 floatingfloating pointpoint unitsunits 44 integerinteger unitsunits 33 branchbranch unitsunits 33 levelslevels ofof cachecache atat fullfull speedspeed L1L1 andand L2L2 onon--chip,chip, L3L3 (2/4(2/4 MB)MB) onon cartridgecartridge 1010--stagestage inin--orderorder pipelinepipeline
Copyright © 2000, Intel Corporation. All rights reserved. 41 *Other brands and names are the property of their respective owners ItaniumItanium™™ ProcessorProcessor MemoryMemory HierarchyHierarchy Memory L1 Caches (on-chip) Itanium™ Processor • Data Cache Int 2 x 8 bytes/clk – 4-way, 32 byte cache lines FP 2 x 16 bytes/clk – FP loads bypassed to L2
• Instruction Cache L2 L1 D 2 x 8 bytes/clkReg. File – 4-way, 32 byte cache lines 32 bytes/clk L2 Cache (on-chip) L1 I • Unified instr. & data cache – 6-way, 64 byte cache lines 16 bytes/clk L3 Cache (on cartridge) L3 (2/4 MB) • Full speed unified 2.1GB/s – 2/4 MBytes Itanium™ Processor Cartridge – 4-way, 64 byte cache lines Memory • Frontside Bus Non-Blocking Caches – 2.1 GBytes/sec
Copyright © 2000, Intel Corporation. All rights reserved. 42 *Other brands and names are the property of their respective owners ItaniumItanium™™ ProcessorProcessor PipelinePipeline
6-wide EPIC hardware under compiler control – Parallel hardware and control for predication & speculation – Efficient mechanism for enabling register stacking & rotation – Software-enhanced branch prediction 10-stage in-order pipeline designed for: – Single cycle ALU (4 ALUs globally bypassed) – Low latency from data cache Dynamic support for run-time optimization – Decoupled front end with prefetch to hide fetch latency – Non-blocking caches, register scoreboard to hide load latency – Aggressive branch prediction to reduce branch penalty
Copyright © 2000, Intel Corporation. All rights reserved. 43 *Other brands and names are the property of their respective owners 1010 StageStage InIn--OrderOrder CoreCore PipelinePipeline
Front-End Execution • Pre-fetch/fetch up to 6 • 4 single cycle ALUs, 2 instructions/cycle ld/str • Hierarchy of branch • Predicate delivery and prediction branch • Decoupling buffer • NAT / Exception / Retirement
Word-line Exception Rotate Expand Rename Decode Execute Detect IPG FET ROT EXP REN WLD REG EXE DET WRB
Inst. Pointer Fetch Register Write-back Generation Read
Instruction Delivery Operand Delivery • Dispersal of up to 6 • Register read + bypasses instructions on 9 ports • Register scoreboard • Register remapping • Predicated dependencies • Register stack engine
Copyright © 2000, Intel Corporation. All rights reserved. 44 *Other brands and names are the property of their respective owners SelectedSelected InstructionInstruction LatenciesLatencies Instruction Description Latency Class (Cycles) FMAC Floating arithmetic 5 FMISC Floating min, max, … 5 IALU Integer ALU 1 FLD,FLDP FP load (L2 hit) 9 FP load (L3 hit) 24 LD Integer load except ld.c (L1 hit) 2 Integer load except ld.c (L2 hit) 6 Integer load except ld.c (L3 hit) 21
Copyright © 2000, Intel Corporation. All rights reserved. 45 *Other brands and names are the property of their respective owners ItaniumItanium™™ ProcessorProcessor BasedBased PlatformPlatform FeaturesFeatures 1-4 Itanium processor SMP system High clock speed target 800 MHz 6-way instruction issue and execution Up to 64 GB SDRAM memory (460GX) 4.2 GB/s memory bandwidth (peak) 2.1 GB/s system bus 2.1 GB/s I/O bandwidth (peak) 1.0 GB/s AGPAGP ProPro graphics bus 3.2 GFLOPS DP-F.P. peak perf. (6.4 in SP)
SHV Workstation platform: 2-way SHV Server platform: 4-way
Copyright © 2000, Intel Corporation. All rights reserved. 46 *Other brands and names are the property of their respective owners ServerServer PlatformPlatform
528MB/s Address Bus 4x PCI 64/66 Data Bus WXB
WXB 528MB/s SAC SDC WXB
PCI 64/66 PXB Addr/Ctrl Data FWH 64 SDRAM DIMMs (64GB max.) 528MB/s AUDIO LAN PCI 64/66 DMI Intel® 460GX chipset 264MB/s
PCI 64/33
Copyright © 2000, Intel Corporation. All rights reserved. 47 *Other brands and names are the property of their respective owners WorkstationWorkstation PlatformPlatform
Address Bus
Data Bus AGP-4x 1GB/s GXB
SAC SDC WXB
PCI 64/66 PXB Addr/Ctrl Data FWH 16 SDRAM DIMMs (16GB max.) 528MB/s AUDIO LAN PCI 64/66 DMI Intel® 460GX chipset 264MB/s
PCI 64/33
Copyright © 2000, Intel Corporation. All rights reserved. 48 *Other brands and names are the property of their respective owners HPCHPC withwith IntelIntel®® Architecture:Architecture: FromFrom TopTop toto BottomBottom Servers SMP cc:NUMA re tu MPP c e iit h Clusters c r WS A WS n o SMP m m Workstations o Workstations C 4 -6 A II & y Desktops iit 2 iill -3 b IA lla I a c Mobile S
Copyright © 2000, Intel Corporation. All rights reserved. 49 *Other brands and names are the property of their respective owners ItaniumItanium™™ ProcessorProcessor BasedBased SystemSystem DesignsDesigns
256-way cc:NUMA 1-8 way SMP
32 way SMP
16-way cc:NUMA
Copyright © 2000, Intel Corporation. All rights reserved. 50 *Other brands and names are the property of their respective owners HPCHPC MarketMarket SegmentSegment isis ChangingChanging Open Industry Standards using Building Blocks – WindowsNT*, Linux* – OpenMP* – MPI*, PVM*
Proprietary Solutions
Copyright © 2000, Intel Corporation. All rights reserved. 51 *Other brands and names are the property of their respective owners IAIA--6464 OperatingOperating SystemsSystems
HP-UX* Monterey Unix* Win64
Modesto*
Trillian OSVs on track for Itanium™ processor
Copyright © 2000, Intel Corporation. All rights reserved. 52 *Other brands and names are the property of their respective owners IAIA--6464 Linux*Linux* (Trillian*(Trillian* Project)Project)
Team includes VA Linux, IBM*,IBM*, Intel,Intel, HP*,HP*, SGI*,SGI*, Cygnus*, CERN*, Red Hat*, SuSE*, TurboLinux*, and Caldera* Running applications • Demonstrated on Itanium™ processor system at IDF (8/99) • Major applications ported to date include Apache* and Sendmail • Development version release available • Full development OS releases from distributors available Open source OS and compilers available http:/www.linuxia64.org
Copyright © 2000, Intel Corporation. All rights reserved. 53 *Other brands and names are the property of their respective owners C/C++DataC/C++Data ModelsModels OSOS ImplementsImplements thethe DataData ModelsModels
ILP32 ILP32 LP64 P64 – int, long and ptr are 32 bits size size size – int, long and ptr are 32 bits (bits) (bits) (bits) – Used by 32-bit OSs int 32 32 32 LP64 long 32 64 32 – int is 32 bits pointer 32 64 64 – long and pointer are 64 bits – Used by 64-bit UNIX OSs P64 (or LLP64) – int and long are 32 bits; pointer is 64 bits – Used by Win64*Win64* and Modesto*
* Third party names and brands are the property of their respective owners Copyright © 2000, Intel Corporation. All rights reserved. 54 *Other brands and names are the property of their respective owners IntelIntel®® CompilerCompiler forfor IAIA--6464 C++ FORTRAN 90 Front End Front End
Profile Guidance Loop Unrolling (PGOPTI) Register Variable Detection Machine instruction lowering Inter-procedural Software Pipelining Constant Prop. Optimizer (IPO) (rotating registers, loop branches) Convert Opt.
Predication (parallel compares) High-Level Strength Reduction Global scheduling Optimizer (HLO) Copy Propagation (control and data speculation, Reassociation multi-way branches) Machine Redundancy Elim. Block ordering (branch hints) Independent Dead Store Elim. Global register allocation Optimizer (IL0) (register stack, ALAT, UNAT) Dead Code Elim. Function splitting Code Generator (I cache and TLB locality) (ECG)
Copyright © 2000, Intel Corporation. All rights reserved. 55 *Other brands and names are the property of their respective owners IAIA--6464 HPCHPC CompilersCompilers && ToolsTools C/C++,C/C++, FXX,FXX, Java,Java, …… OpenMPOpenMP MPI,MPI, PVMPVM PerformancePerformance LibrariesLibraries VtuneVtune,, …… GNU
® Visual Fortran
MKL
Copyright © 2000, Intel Corporation. All rights reserved. 56 *Other brands and names are the property of their respective owners IAIA--6464 ApplicationApplication BenefitsBenefits
OutstandingOutstanding PerformancePerformance RemovesRemoves performanceperformance bottlenecksbottlenecks Large register files High Parallelism Predication SW pipelining support Memory latency hiding 6464--bitsbits allowsallows biggerbigger addressaddress spacespace IEEEIEEE--accurateaccurate floatingfloating pointpoint AbilityAbility toto runrun IAIA--3232 applicationsapplications
Copyright © 2000, Intel Corporation. All rights reserved. 57 *Other brands and names are the property of their respective owners IAIA--6464 UserUser BenefitsBenefits
Big in-memory data structures and DB Large file system and data files Efficient large integer calculations Fast 64-bit F.P. calculations Fast Security processing More and faster transactions More services Higher throughput Improved availability and manageability
Copyright © 2000, Intel Corporation. All rights reserved. 58 *Other brands and names are the property of their respective owners Intel:Intel: MoreMore ThanThan JustJust MicroprocessorsMicroprocessors
Copyright © 2000, Intel Corporation. All rights reserved. 59 *Other brands and names are the property of their respective owners IAIA--6464 :: TheThe FutureFuture forfor HPCHPC
http://developer.intel.com/
http://developer.intel.com/design/ia64/ http://developer.intel.com/technology/itj/q41999.htm GlossaryGlossary GlossaryGlossary
ALAT (Advanced Load Address Table) - cache used for data speculation which stores the most recent advanced load addresses ALoad/Acheck - advanced load/check (Data Speculation) Basic Block - code which is between two branches; if one instruction in the block of code executes, then all instructions in that block will also execute Control Speculation - the execution of an operation before the branch which guards it; used to hide memory latency Data Speculation - the execution of a memory load prior to a store that precedes it, and that may potentially alias it; used to hide memory latency
Copyright © 2000, Intel Corporation. All rights reserved. 62 *Other brands and names are the property of their respective owners GlossaryGlossary
IA-32 - the name for Intel’s current ISA (32-bit and 16-bit) IA-32 System Environment - the system environment of an IA- 64 processor as defined by the Pentium processor and Pentium Pro processor IA-64 – Intel® 64-bit Architecture is composed of the 64-bit ISA and IA-32; IA-64 integrates the two into a single architectural definition IA-64 Firmware - the Processor Abstraction Layer and the System Abstraction Layer IA-64 System Environment - IA-64 operating system with privileged resources along with capability to support the execution of existing IA-32 applications Instruction Set Architecture (ISA) - defines application level resources which include: user-level instructions, addressing modes, segmentation, and user visible register files
Copyright © 2000, Intel Corporation. All rights reserved. 63 *Other brands and names are the property of their respective owners GlossaryGlossary NaT bit/NaT Value (Not a Thing) - used with control speculation to indicate that a number stored in a general or floating-point register is not valid Predication - the conditional execution of an instruction; used to remove branches from code Processor Abstraction Layer (PAL) - the IA-64 firmware layer which abstracts IA-64 processor features that are implementation dependent Sload/SCheck - speculative load/check (control speculation) System Abstraction Layer (SAL) - the IA-64 firmware layer which abstracts IA-64 system features that are implementation dependent System Environment - defines processor specific operating system resources which include: exception and interruption handling, virtual and physical memory management, system register state, and privileged instructions
Copyright © 2000, Intel Corporation. All rights reserved. 64 *Other brands and names are the property of their respective owners BackupBackup SWSW PipelinedPipelined LoopLoop ExampleExample DAXPYDAXPY innerinner looploop :: dy[i]dy[i] == dy[i]dy[i] ++ (da(da ** dx[i])dx[i]) – 2 loads, 1 fma, 1 store / iteration MachineMachine assumptionsassumptions – can do 2 loads, 1 store, 1 fma, 1 br / cycle – load latency of 2 clocks – fma latency of 1 clock (not realistic, but good for example) SpecialSpecial RegistersRegisters – LC: Loop Counter
Copyright © 2000, Intel Corporation. All rights reserved. 66 *Other brands and names are the property of their respective owners Example:Example: PipelinePipeline
Each column represents 1 source iteration
load dx,dy
dy + da * dx store dy
Copyright © 2000, Intel Corporation. All rights reserved. 67 *Other brands and names are the property of their respective owners ExampleExample CodeCode
.rotf dx[3], dy[3], tmp[2]
mov ar.lc = 3 // #iterations-1 mov ar.ec = 4 // #stages mov pr.rot = 0x10000 ;; looptop: (p16) ldfd dx[0] = [dxsp],8 (p16) ldfd dy[0] = [dysp],8 (p18) fma.d tmp[0] = da, dx[2], dy[2] (p19) stfd [dydp] = tmp[1],8 br.ctop looptop ;;
Copyright © 2000, Intel Corporation. All rights reserved. 68 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution
Execution Sequence ...
19: ? (p19) 18: ? (p18) 17: ? 16: ? (p16) 63: ? (p63) . RRB=0 LC=? EC=? BeforeBeforeBefore InitializationInitializationInitialization Copyright © 2000, Intel Corporation. All rights reserved. 69 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution
Execution Sequence ... (p16) ldx (p16) ldy (p18) fma (p19) st
19: 0 (p19) 18: 0 (p18) 17: 0 16: 1 (p16) 63: 0 (p63) . RRB=0 LC=3 EC=4 InitializationInitializationInitialization Copyright © 2000, Intel Corporation. All rights reserved. 70 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution
Execution Sequence ... (p16) ldx (p16) ldy (p18) fma (p19) st
19: 0 (p19) 18: 0 (p18) 17: 0 16: 1 (p16) 63: 0 (p63) . RRB=0 LC=3 EC=4 PrologueProloguePrologue Copyright © 2000, Intel Corporation. All rights reserved. 71 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution
Execution Sequence (p16) ld (p16) ld (p18) fma (p19) st ...... x y (p16) ldx (p16) ldy (p18) fma (p19) st
18:19:19: 000 (p19) 17:18:18: 000 (p18) 16:17:17: 100 63:16:16: 111 (p16) 1 62:63:63: 010 (p63) ...... LC=3 EC=4 RRB=-1RRB=0 LC=2 EC=4 BranchBranchBranch 111 Copyright © 2000, Intel Corporation. All rights reserved. 72 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution
Execution Sequence (p16) ld (p16) ld (p18) fma (p19) st ...... x y (p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st 17:18:18: 00 (p19) 16:17:17: 100 (p18) 63:16:16: 11 62:63:63: 11 (p16) 1 61:62:62: 010 (p63) ...... RRB=-1RRB=-2 LC=2LC=1 EC=4 BranchBranchBranch 222 Copyright © 2000, Intel Corporation. All rights reserved. 73 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution
Execution Sequence ...... (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st (p16) ld (p16) ld (p18) fma (p19) st 17:16:17: 010 (p19) x y 16:63:16: 11 (p18) 63:62:63: 11 62:61:62: 11 (p16) 1 61:60:61: 010 (p63) .... RRB=-3RRB=-2 LC=1LC=0 EC=4 BranchBranchBranch 333 Copyright © 2000, Intel Corporation. All rights reserved. 74 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution
Execution Sequence (p16) ld (p16) ld (p18) fma (p19) st ...... x y (p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st (p16) ld (p16) ld (p18) fma (p19) st 63:16: 11 (p19) x y (p16) ldx (p16) ldy (p18) fma (p19) st 62:63: 11 (p18) 61:62: 11 60:61: 01 (p16) 0 59:60: 00 (p63) .... RRB=-3RRB=-4 LC=0 EC=4EC=3 BranchBranchBranch 444 Copyright © 2000, Intel Corporation. All rights reserved. 75 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution
Execution Sequence (p16) ld (p16) ld (p18) fma (p19) st ...... x y (p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st (p16) ld (p16) ld (p18) fma (p19) st 62:63: 11 (p19) x y (p16) ldx (p16) ldy (p18) fma (p19) st 61:62: 11 (p18) (p16) ldx (p16) ldy (p18) fma (p19) st 60:61: 01 59:60: 00 (p16) 0 58:59: 00 (p63) .... RRB=-5RRB=-4 LC=0 EC=3EC=2 BranchBranchBranch 555 Copyright © 2000, Intel Corporation. All rights reserved. 76 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution
Execution Sequence ...... (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st (p16) ld (p16) ld (p18) fma (p19) st 61:62: 11 (p19) x y (p16) ldx (p16) ldy (p18) fma (p19) st 60:61: 01 (p18) (p16) ldx (p16) ldy (p18) fma (p19) st 59:60: 0 59:60: 0 (p16) ldx (p16) ldy (p18) fma (p19) st 58:59: 00 (p16) 0 57:58: 00 (p63) .... RRB=-5RRB=-6 LC=0LC=0 EC=2 EC=1 BranchBranchBranch 666 Copyright © 2000, Intel Corporation. All rights reserved. 77 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution
Execution Sequence ...... (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st (p16) ld (p16) ld (p18) fma (p19) st 60:61: 01 (p19) x y (p16) ldx (p16) ldy (p18) fma (p19) st 59:60: 00 (p18) (p16) ldx (p16) ldy (p18) fma (p19) st 58:59: 0 58:59: 0 (p16) ldx (p16) ldy (p18) fma (p19) st 57:58: 00 (p16) fall through 0 56:57: 00 (p63) .... RRB=-7RRB=-6 LC=0 EC=1EC=0 BranchBranchBranch 777 Copyright © 2000, Intel Corporation. All rights reserved. 78 *Other brands and names are the property of their respective owners