<<

IAIA--6464 ArchitectureArchitecture

A High-Performance Computing Architecture

Stefan.Lamberts@.com Internet Solutions Group EMEA Technical Marketing Overview Overview July 2000 AgendaAgenda

 Motivation and IA-64 feature overview  IA-64 features • EPIC • Data types, memory and registers • Register stack • Predication and parallel compares • Software pipelining and register rotation • Control & data speculation • Branch architecture • Integer architecture • Floating point architecture  overview  Itanium™ processor based systems overview  Operating systems, tools and programming

Copyright © 2000, Intel Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners IAIA--64:64: ExtendingExtending thethe IntelIntel®® ArchitectureArchitecture  Designed for High Performance Computing • Scientific • Technical & Engineering • Business  New EPIC Technology  IA-64 Architecture uses EPIC  Itanium™ processor is the first implementation of IA-64

Copyright © 2000, Intel Corporation. All rights reserved. 3 *Other brands and names are the property of their respective owners PerformancePerformance LimitersLimiters  Parallelism not fully utilized • Existing architectures cannot exploit sufficient parallelism in integer code to feed a wide in-order implementation  Branches • Even with perfect branch prediction, small basic blocks of code do not fully utilize machine width  Procedure Calls • Software modularity is becoming standard resulting call/return overhead  Memory latency and address space • Increasing relative to processor cycle time (larger miss penalties) and limited address space

IA-64 overcomes these limitations, IA-64 IA-64 overcomes these limitations, and more !

Copyright © 2000, Intel Corporation. All rights reserved. 4 *Other brands and names are the property of their respective owners IAIA--6464 ArchitecturalArchitectural FeaturesFeatures  64-bit Address Flat Memory Model  Explicit Parallel Instruction Computing Parallelism Explicit Parallel Instruction Computing  Large Register Files  Automatic Register Stack Engine  Predication Branches  Software Pipelining Support  Register Rotation  Sophisticated Branch Architecture Procedure  Calls Loop Control Hardware  Control & Data Speculation  Cache Control Memory  Powerful Integer Architecture Latency  Advanced Floating Point Architecture  Multimedia Support (MMX™ Technology)

IA-64 It’s more than just 64 bits … Copyright © 2000, Intel Corporation. All rights reserved. 5 *Other brands and names are the property of their respective owners NextNext GenerationGeneration ArchitectureArchitecture EPICEPIC DesignDesign PhilosophyPhilosophy  Maximize performance via EPIC hardware & software synergy  Advanced features VLIW OOO / SS enhance instruction level parallelism RISC • Predication, Speculation, …  Massive hardware CISC resources for parallel Performance Performance execution Time BeyondBeyond TraditionalTraditional ArchitecturesArchitectures

Copyright © 2000, Intel Corporation. All rights reserved. 6 *Other brands and names are the property of their respective owners AgendaAgenda

 Motivation and IA-64 feature overview  IA-64 features • EPIC • Data types, memory and registers • Register stack • Predication and parallel compares • Software pipelining and register rotation • Control & data speculation • Branch architecture • Integer architecture • Floating point architecture  Itanium™ processor overview  Itanium™ processor based systems overview  Operating systems, tools and programming

Copyright © 2000, Intel Corporation. All rights reserved. 7 *Other brands and names are the property of their respective owners EPICEPIC InstructionInstruction ParallelismParallelism

Source Code

Instruction Groups  No RAW or WAW (series of bundles) dependencies  Issued in parallel depending on resources

Instruction  3 instructions + Bundles template  (3 Instructions) 3 x 41 bits + 5 bits = 128 bits

Up to 6 instructions executed per clock

Copyright © 2000, Intel Corporation. All rights reserved. 8 *Other brands and names are the property of their respective owners CompilerCompiler (HW)(HW) DataData TypesTypes

64-bit Integer

2x32-bit SIMD Integer

4x16-bit SIMD Integer

8x8-bit SIMD Integer

64-bit DP F.P.

2x32-bit SIMD SP-F.P.

IA-64 All common data types are supported

Copyright © 2000, Intel Corporation. All rights reserved. 9 *Other brands and names are the property of their respective owners 6464--bitbit MemoryMemory AccessAccess  18 BILLION Giga accessible • 264 == 18,446,744,073,709,551,616  addressable access with 64-bit pointers • 64-bit virtual address space • HW support for 32-bit pointers  Access granularity and alignmentalignment • 1,2,4,8,10,16 bytes • Alignment on naturally aligned boundaries is recommended • Instructions are always 16-byte aligned  Support for both Big and Little endian byte order  control

2.1 GB/s front-side

Byte Addressable 64-bit IA-64 Byte Addressable 64-bit Virtual Address-Space Copyright © 2000, Intel Corporation. All rights reserved. 10 *Other brands and names are the property of their respective owners MemoryMemory HierarchyHierarchy ControlControl  Software can explicitly control memory accesses • Specify levels of the memory hierarchy affected by the access • Allocation and Flush resolution is at least 32-bytes  Allocation (Prefetch) • Allocation implies bringing the data close to the CPU • Allocation hints indicate at which level allocation takes place • Used in load, store, and explicit pre-fetch instructions  De-allocation and Flush • Invalidates the addressed line in all levels of • Write data back to memory if necessary

Three levels of cache (full speed L2 cache, 2/4MB L3-cache) & Atomic operation support

IA-64 Control over Cache (De)Allocation

Copyright © 2000, Intel Corporation. All rights reserved. 11 *Other brands and names are the property of their respective owners MemoryMemory AccessAccess OrderingOrdering

 Explicit control • Memory Fence mf - ensures all prior memory operations are seen prior to all future memory operations • Acquire Load ld.acq - ensure I am seen prior to all future memory operations • Release store st.rel - ensure that all prior memory operations are seen prior to me • Synchronize instruction caches sync.i - Ensure all instruction caches have seen all prior flush cache instructions  Implicit - applicable to semaphore instructions • xchg Exchange mem and General Register (GR) • cmpxchg Conditional exchange of mem and GR • fetchadd Add immediate to memory  Strong ordering model is compatible with IA-32 Ordering

Copyright © 2000, Intel Corporation. All rights reserved. 12 *Other brands and names are the property of their respective owners LargeLarge RegisterRegister SetSet Predicate Integer Registers FP Registers Branch Registers Registers NaT 63 0 63 0 81 0 63 bit 0 GR0 0 FR0 + 0.0 BR0 GR1 FR1 + 1.0 PR0 1 BR7 PR1 GR31 FR31 GR32 FR32 PR15 PR16

GR127 FR127 PR63

32 Static 32 Static 16 Static

96 Framed, Rotating 96 Rotating 48 Rotating

Remove Resource IA-64 Remove Resource Bottlenecks

Copyright © 2000, Intel Corporation. All rights reserved. 13 *Other brands and names are the property of their respective owners ContextContext SwitchSwitch

 “Normal”“Normal” – full context saves all registers – GR and FR  “Lazy”“Lazy” – saves only GR registers or specific range (GR0-31, GR32-GR127)  “Fast”“Fast” – doesn’t save any registers and uses 16 separate banked “shadow” registers ‘(OS only, GR16’-GR31’) – e.g. interrupt and exception handling

Copyright © 2000, Intel Corporation. All rights reserved. 14 *Other brands and names are the property of their respective owners RegisterRegister StackStack  GRs 0-31 are global to all procedures  Stacked registers begin at GR32 and are local to each procedure 127  Each procedure’s registerregister stackstack frame varies from 0 to 96 registers 96 Stacked  Only GRs implement a register stack • The FRs, PRs, and BRs are global to all 32 procedures 31  Register Stack Engine (RSE) 32 Global • Upon stack overflow/underflow, registers 0 are saved/restored to/from a backing store transparently

Optimized CALL/RETURN IA-64 Optimized CALL/RETURN and Parameter Passing

Copyright © 2000, Intel Corporation. All rights reserved. 15 *Other brands and names are the property of their respective owners RegisterRegister StackStack inin WorkWork  Call changes frame to contain only the caller’s output  Alloc instr. sets the frame region to the desired size • Three architecture parameters: local, output, and rotating  Return restores the stack frame of the caller 56 Outputs 48 Virtual Local 52 52 Outputs Outputs (Inputs) Outputs 46 46 32 32

Local Local

(Inputs) (Inputs) 32 Call Alloc Ret 32 PROC APROC B PROC B PROC A

Avoids Register Spill/Fill IA-64 Avoids Register Spill/Fill among Procedure Calls

Copyright © 2000, Intel Corporation. All rights reserved. 16 *Other brands and names are the property of their respective owners RegisterRegister StackStack EngineEngine

allocate 127 Stack frame Stack frame Stack frame 56 D E F Outputs release 48 Stack frame Local C 52 (Inputs) Stack frame 46 Stack frame 32 B Logical save Stack frame A Stack frame Stack frame 32 B A restore 31 Global Register 0 Physical IA-64 GR (Integer) Registers only

Copyright © 2000, Intel Corporation. All rights reserved. 17 *Other brands and names are the property of their respective owners Predication:Predication: ControlControl FlowFlow toto DataData FlowFlow

Traditional Arch. IA-64

cmp a, b if cmp a, b  p1,p2 jump.eq p2 X=1 p1 X=2 Load X=1 else  Conditional execution based on jump qualifying predicate X=2 then  64 predicate registers  Can be combined with logical Load operations

Removes/Reduces Branches and IA-64 Removes/Reduces Branches and Enables Parallel Execution

Copyright © 2000, Intel Corporation. All rights reserved. 18 *Other brands and names are the property of their respective owners PredicationPredication ......  UnpredictableUnpredictable branchesbranches removedremoved – Misprediction penalties eliminated  BasicBasic blockblock sizesize increasesincreases – has a larger scope to find ILP  ILPILP withinwithin thethe basicbasic blockblock increasesincreases – Both “then” and “else” executed in parallel  WiderWider machinesmachines areare betterbetter utilizedutilized

Predication Enables and IA-64 Predication Enables and Enhances ILP

Copyright © 2000, Intel Corporation. All rights reserved. 19 *Other brands and names are the property of their respective owners ParallelParallel ComparesCompares

 Three new types of compares: • AND: both target predicates set FALSE if compare is false • OR: both target predicates set TRUE if compare is true • ANDOR: if true, stets one TRUE, set other FALSE

A A B C

B D

C IA-64 Reduces Critical Path D

Copyright © 2000, Intel Corporation. All rights reserved. 20 *Other brands and names are the property of their respective owners SoftwareSoftware PipeliningPipelining Sequential Loop Software-Pipelined Loop load compute store Time Time Time Time

 Traditional architectures use – Results in code expansion and increased cache misses  IA-64 Software Pipelining uses rotating registers – Allows overlapping execution of multiple loop instances  Predication controls the pipeline stages Provides Direct Support for IA-64 Provides Direct Support for Software Pipelining Copyright © 2000, Intel Corporation. All rights reserved. 21 *Other brands and names are the property of their respective owners SoftwareSoftware PipeliningPipelining

 IAIA--6464 featuresfeatures thatthat makemake thisthis possiblepossible – Full predication to define pipeline stages – Special branch handling features – Loop branches – Special loop registers (LC, EC) – Register rotation: removes loop copy overhead – Predicate rotation: removes prologue & epilog  TraditionalTraditional architecturesarchitectures useuse looploop unrollingunrolling – High overhead: extra code for loop body, prologue and epilog – Consumes a large number of registers

Copyright © 2000, Intel Corporation. All rights reserved. 22 *Other brands and names are the property of their respective owners SoftwareSoftware PipeliningPipelining

p16 p17 p18 p19 1 0 0 0

1 1 0 0 Prologue 1 1 1 0 1 1 1 1 Kernel 1 1 1 1 0 1 1 1 Loop Iteration Loop Loop Iteration Loop 0 0 1 1 Epilogue 0 0 0 1 load compute store

Copyright © 2000, Intel Corporation. All rights reserved. 23 *Other brands and names are the property of their respective owners SoftwareSoftware PipeliningPipelining

Prologue IA-64 Features • Rotating Registers • Loop branches Kernel • Full predication • Rotating Predicates

Kernel-only code using stage predicates Epilogue

SWP Applicable to many IA-64 SWP Applicable to many Loops in IA-64

Copyright © 2000, Intel Corporation. All rights reserved. 24 *Other brands and names are the property of their respective owners RegisterRegister RotationRotation  GR32-127 and FR32-127 can rotate (specified range)  Separate rotating register base for each set (GR, FR)  Loop branches decrement all register rotating bases (RRB)  Instructions contain a “virtual” register number • physical register # = RRB + virtual register #

0 1 2 3 4 RRB

36 32

35 32

34 32

32

Phys. Register Phys. 33 Phys. Register Phys. 33

32 32

Predicate register range also rotates.

Copyright © 2000, Intel Corporation. All rights reserved. 25 *Other brands and names are the property of their respective owners ControlControl && DataData SpeculationSpeculation

instr. 1 instr. 1 instr. 2 instr. 2 branch Barrier st[?] Barrier

ld r1= ld r1= use = r1 use = r1

Control Speculation Data Speculation moves loads above moves loads above branches / calls possibly conflicting stores Speculation reduces the impact IA-64 Speculation reduces the impact of memory latency

Copyright © 2000, Intel Corporation. All rights reserved. 26 *Other brands and names are the property of their respective owners ControlControl SpeculationSpeculation IA-64 Traditional Arch. ld.s r1= Detect exception P r o instr. 1 instr. 1 p a g instr. 2 instr. 2 a t e e branch branch x Barrier c e p t iio n

ld r1= chk.s r1 Deliver exception use = r1 use = r1

 Control Speculation moves loads above branches • Detected exception indicated using NaT bit / NaTVal  Check raises detecteddetected exceptionsexceptions  Branch barrier broken to minimize memory latency

Copyright © 2000, Intel Corporation. All rights reserved. 27 *Other brands and names are the property of their respective owners HoistingHoisting UsesUses IA-64 ld.s r1= Traditional Arch. instr. 1 instr. 1 use = r1 Speculative use instr. 2 instr. 2 branch Barrier branch

Recovery code ld r1= chk.s r1 ld r1= use = r1 use = r1 branch  All computation instructions propagate NaTs to reduce number of checks to allow single check on results  Compares also propagates when writing predicates

Copyright © 2000, Intel Corporation. All rights reserved. 28 *Other brands and names are the property of their respective owners DataData SpeculationSpeculation IA-64 Traditional Arch. ld.a r1= instr. 1 instr. 1 instr. 2 instr. 2 st[?] Barrier st[?]

ld r1= ld.c r1 use = r1 use = r1

 Data Speculation moves loads above possiblypossibly conflicting stores • Keeps track of load addresses used in advance (ALAT)  Advanced-loaded data can be used speculatively

Copyright © 2000, Intel Corporation. All rights reserved. 29 *Other brands and names are the property of their respective owners HoistingHoisting UsesUses IA-64 ld.a r1= Traditional Arch. instr. 1 instr. 1 use = r1 Speculative use instr. 2 instr. 2 st[?] Barrier st[?]

Recovery code ld r1= chk.a r1 ld r1= use = r1 use = r1 branch

Data and Control Speculation IA-64 Data and Control Speculation Can be combined

Copyright © 2000, Intel Corporation. All rights reserved. 30 *Other brands and names are the property of their respective owners AdvancedAdvanced LoadLoad AddressAddress TableTable -- ALATALAT  ld.ald.a insertsinserts entriesentries  ConflictingConflicting storesstores removeremove entriesentries •• also ld.c.clr, chk.a.clr  PresencePresence ofof entryentry indicatesindicates successsuccess •• chk.a branches when no entry is found ld.a reg# = reg# addr reg# addr st[addr] reg# addr chk.a reg# ? : : : : reg# addr

Copyright © 2000, Intel Corporation. All rights reserved. 31 *Other brands and names are the property of their respective owners BranchBranch ArchitectureArchitecture

 Branch types Traditional Architecture:

• IP-offset branches (21-bit disp.) compute0 • Indirect branches via 8 branch registers compute1 • HW-supported counted loop control instr. compute2 compare (a==b)  Branch Predict hints B: branch_if_eq  Target • Advance information on downstream ... branches and branch conditions Target: • Branch hints can be static or dynamic IA-64 Architecture:  Multi-way branches hint B,Target (early hint) • Bundle 1-3 branches in a bundle compute0 compute1 • Allow multiple bundles to participate compute2 compare(a==b) Aggressive branch prediction B: branch_if_eq  Target Decoupled front end with code prefetch, IA-64 Decoupled front end with code prefetch, ... Branch hints reduce misprediction Target: and overhead

Copyright © 2000, Intel Corporation. All rights reserved. 32 *Other brands and names are the property of their respective owners IntegerInteger ArchitectureArchitecture  128 general registers (64 bit; 1s+63i)  Full 64-bit support (as well as 8-16-32-bit)  XMA: Integer Multiply-Add instruction (l = i * j + k)  Integer multiply is executed in the floating-point unit  Data – load, store, GR  FR conversion  SIMD Integer operations  Divide / remainder deferred to software – Based on floating-point operations – High throughput achieved via pipelining

Up to 4 Integer/ALU operations per clock

Excellent & Security IA-64 Excellent Server & Security Application Performance

Copyright © 2000, Intel Corporation. All rights reserved. 33 *Other brands and names are the property of their respective owners IAIA--6464 SIMDSIMD -- IntegerInteger  Exploits with SIMD (Single Instruction Multiple Data)  Performance boost for audio, 64 bits video, imaging, streaming etc. 8x8, 4x16, or 2x32 functions  GRs treated as 8x8, 4x16, or 2x32 a3 a2 a1 a0 bit elements +  Several instruction types b3 b2 b1 b0 • Addition and subtraction, multiply • Pack/Unpack a3+b3 a2+b2 a1+b1 a0+b0 • Left shift, signed/unsigned right shift  Compatible with Intel® MMX Technology

IA-64 Performance Boost for all Data Parallel Apps

Copyright © 2000, Intel Corporation. All rights reserved. 34 *Other brands and names are the property of their respective owners FloatingFloating--PointPoint ArchitectureArchitecture  Fused Multiply Add Operation – An efficient core computation unit – Greater precision, faster than independent multiply and add  Abundant Register resources – 128 registers (32 static, 96 rotating)  High Precision Data computations – 82-bit unified internal format for all data types – Full IEEE.754 support  Software divide/square-root – High throughput achieved via pipelining  Wide (Speculative) Memory Access – Dual Load-Pair support

– Address memory latency 2 independent FP Units – Pre-fetch support Up to 4 DP FP operations per clock Excellent & HPC IA-64 Excellent Workstation & HPC Up to 4 DP FP operands loaded Application Performance per clock (from L2 cache) Copyright © 2000, Intel Corporation. All rights reserved. 35 *Other brands and names are the property of their respective owners IAIA--6464 SIMDSIMD –– F.P.F.P.  Exploits data parallelism with SIMD (Single Instruction Multiple Data)  Up to 2x performance boost  F.P. Registers treated as two 32 bit 64 bits single precision elements 2x32 bit SP FP elements • Full IEEE.752 compliance • Availability of fast divide (non IEEE) a1 a0  Compatible with Intel® Streaming + SIMD Extensions (SSE) b1 b0

Up to 8 SP FP operations per clock a1+b1 a0+b0 Enables World Class 3D IA-64 Enables World Class 3D Graphics Performance

Copyright © 2000, Intel Corporation. All rights reserved. 36 *Other brands and names are the property of their respective owners FloatingFloating--PointPoint StatusStatus RegisterRegister

FPSR rv sf3 sf2 sf1 sf0 traps 6 13 13 13 13 6 Four Sets for Parallelism & Speculation

 Contains dynamic control/status for FP operations  Trap/Fault disable bits • trap disables for IEEE exception events • trap disable “D” for denormal operand exception  4 separate status fields  4 computational env. • Each field specifies precision/rounding mode, Trap disables, flush to zero, widest range exponent • Each field reports sticky exception flags

Copyright © 2000, Intel Corporation. All rights reserved. 37 *Other brands and names are the property of their respective owners IntelIntel®® ItaniumItanium™™ ProcessorProcessor

 IA-64 starts with Itanium processor  Platform with Intel® 460GX  Solid progress following first silicon • More than 4 OS running today • Demonstrated real IA-64 and applications on real hardware • Engineering samples shipping to OEMs, IHVs and ISVs  Comprehensive validation underway

Leading-Edge Implementation of IA-64 For World-Class Performance 320M transistors: 25M in CPU, 295M in L3 cache More and better Capacity & Capability

Copyright © 2000, Intel Corporation. All rights reserved. 38 *Other brands and names are the property of their respective owners ExtendingExtending IntelIntel®® ArchitectureArchitecture IA-64 Extends IA Headroom, Madison* ExtendsExtends IAIA Headroom,Headroom, IA-64 perf ScalabilityScalability andand AvailabilityAvailability Deerfield* forfor thethe MostMost IA-64 price/perf DemandingDemanding EnvironmentsEnvironments McKinley*

IA-32 * Intel code names

Future IA-32

Foster* System Performance Performance System System Performance Performance System OutstandingOutstanding PerformancePerformance forfor 3232 BitBit VolumeVolume AppsApps

‘99 ‘00 ‘01 ‘02 .25µ .18µ .13µ

All dates specified are target dates provided for planning purposes only and are subject to change. Copyright © 2000, Intel Corporation. All rights reserved. 39 *Other brands and names are the property of their respective owners ItaniumItanium™™ ProcessorProcessor BlockBlock DiagramDiagram

Copyright © 2000, Intel Corporation. All rights reserved. 40 *Other brands and names are the property of their respective owners ItaniumItanium™™ ProcessorProcessor FeaturesFeatures  UpUp toto 66 instructionsinstructions issuedissued perper clockclock  99 instructioninstruction issueissue portsports  22 floatingfloating pointpoint unitsunits  44 integerinteger unitsunits  33 branchbranch unitsunits  33 levelslevels ofof cachecache atat fullfull speedspeed  L1L1 andand L2L2 onon--chip,chip, L3L3 (2/4(2/4 MB)MB) onon cartridgecartridge  1010--stagestage inin--orderorder pipelinepipeline

Copyright © 2000, Intel Corporation. All rights reserved. 41 *Other brands and names are the property of their respective owners ItaniumItanium™™ ProcessorProcessor MemoryMemory HierarchyHierarchy Memory  L1 Caches (on-chip) Itanium™ Processor • Data Cache Int 2 x 8 bytes/clk – 4-way, 32 byte cache lines FP 2 x 16 bytes/clk – FP loads bypassed to L2

• Instruction Cache L2 L1 D 2 x 8 bytes/clkReg. File – 4-way, 32 byte cache lines 32 bytes/clk  L2 Cache (on-chip) L1 I • Unified instr. & data cache – 6-way, 64 byte cache lines 16 bytes/clk  L3 Cache (on cartridge) L3 (2/4 MB) • Full speed unified 2.1GB/s – 2/4 MBytes Itanium™ Processor Cartridge – 4-way, 64 byte cache lines  Memory • Frontside Bus Non-Blocking Caches – 2.1 GBytes/sec

Copyright © 2000, Intel Corporation. All rights reserved. 42 *Other brands and names are the property of their respective owners ItaniumItanium™™ ProcessorProcessor PipelinePipeline

 6-wide EPIC hardware under compiler control – Parallel hardware and control for predication & speculation – Efficient mechanism for enabling register stacking & rotation – Software-enhanced branch prediction  10-stage in-order pipeline designed for: – Single cycle ALU (4 ALUs globally bypassed) – Low latency from data cache  Dynamic support for run-time optimization – Decoupled front end with prefetch to hide fetch latency – Non-blocking caches, register scoreboard to hide load latency – Aggressive branch prediction to reduce branch penalty

Copyright © 2000, Intel Corporation. All rights reserved. 43 *Other brands and names are the property of their respective owners 1010 StageStage InIn--OrderOrder CoreCore PipelinePipeline

Front-End Execution • Pre-fetch/fetch up to 6 • 4 single cycle ALUs, 2 instructions/cycle ld/str • Hierarchy of branch • Predicate delivery and prediction branch • Decoupling buffer • NAT / Exception / Retirement

Word-line Exception Rotate Expand Rename Decode Execute Detect IPG FET ROT EXP REN WLD REG EXE DET WRB

Inst. Pointer Fetch Register Write-back Generation Read

Instruction Delivery Operand Delivery • Dispersal of up to 6 • Register read + bypasses instructions on 9 ports • Register scoreboard • Register remapping • Predicated dependencies • Register stack engine

Copyright © 2000, Intel Corporation. All rights reserved. 44 *Other brands and names are the property of their respective owners SelectedSelected InstructionInstruction LatenciesLatencies Instruction Description Latency Class (Cycles) FMAC Floating arithmetic 5 FMISC Floating min, max, … 5 IALU Integer ALU 1 FLD,FLDP FP load (L2 hit) 9 FP load (L3 hit) 24 LD Integer load except ld.c (L1 hit) 2 Integer load except ld.c (L2 hit) 6 Integer load except ld.c (L3 hit) 21

Copyright © 2000, Intel Corporation. All rights reserved. 45 *Other brands and names are the property of their respective owners ItaniumItanium™™ ProcessorProcessor BasedBased PlatformPlatform FeaturesFeatures  1-4 Itanium processor SMP system  High clock speed target 800 MHz  6-way instruction issue and execution  Up to 64 GB SDRAM memory (460GX)  4.2 GB/s memory bandwidth (peak)  2.1 GB/s system bus  2.1 GB/s I/O bandwidth (peak)  1.0 GB/s AGPAGP ProPro graphics bus  3.2 GFLOPS DP-F.P. peak perf. (6.4 in SP)

SHV Workstation platform: 2-way SHV Server platform: 4-way

Copyright © 2000, Intel Corporation. All rights reserved. 46 *Other brands and names are the property of their respective owners ServerServer PlatformPlatform

528MB/s Address Bus 4x PCI 64/66 Data Bus WXB

WXB 528MB/s SAC SDC WXB

PCI 64/66 PXB Addr/Ctrl Data FWH 64 SDRAM (64GB max.) 528MB/s AUDIO LAN PCI 64/66 DMI Intel® 460GX chipset 264MB/s

PCI 64/33

Copyright © 2000, Intel Corporation. All rights reserved. 47 *Other brands and names are the property of their respective owners WorkstationWorkstation PlatformPlatform

Address Bus

Data Bus AGP-4x 1GB/s GXB

SAC SDC WXB

PCI 64/66 PXB Addr/Ctrl Data FWH 16 SDRAM DIMMs (16GB max.) 528MB/s AUDIO LAN PCI 64/66 DMI Intel® 460GX chipset 264MB/s

PCI 64/33

Copyright © 2000, Intel Corporation. All rights reserved. 48 *Other brands and names are the property of their respective owners HPCHPC withwith IntelIntel®® Architecture:Architecture: FromFrom TopTop toto BottomBottom Servers SMP cc:NUMA re tu MPP c e iit h Clusters c r WS A WS n o SMP m m o Workstations C 4 -6 A II & y Desktops iit 2 iill -3 b IA lla I a c Mobile S

Copyright © 2000, Intel Corporation. All rights reserved. 49 *Other brands and names are the property of their respective owners ItaniumItanium™™ ProcessorProcessor BasedBased SystemSystem DesignsDesigns

256-way cc:NUMA 1-8 way SMP

32 way SMP

16-way cc:NUMA

Copyright © 2000, Intel Corporation. All rights reserved. 50 *Other brands and names are the property of their respective owners HPCHPC MarketMarket SegmentSegment isis ChangingChanging Open Industry Standards using Building Blocks – WindowsNT*, Linux* – OpenMP* – MPI*, PVM*

Proprietary Solutions

Copyright © 2000, Intel Corporation. All rights reserved. 51 *Other brands and names are the property of their respective owners IAIA--6464 OperatingOperating SystemsSystems

HP-UX* Monterey * Win64

Modesto*

Trillian OSVs on track for Itanium™ processor

Copyright © 2000, Intel Corporation. All rights reserved. 52 *Other brands and names are the property of their respective owners IAIA--6464 Linux*Linux* (Trillian*(Trillian* Project)Project)

 Team includes VA Linux, IBM*,IBM*, Intel,Intel, HP*,HP*, SGI*,SGI*, Cygnus*, CERN*, *, SuSE*, TurboLinux*, and Caldera*  Running applications • Demonstrated on Itanium™ processor system at IDF (8/99) • Major applications ported to date include Apache* and Sendmail • Development version release available • Full development OS releases from distributors available  Open source OS and available  http:/www.linuxia64.org

Copyright © 2000, Intel Corporation. All rights reserved. 53 *Other brands and names are the property of their respective owners C/C++DataC/C++Data ModelsModels OSOS ImplementsImplements thethe DataData ModelsModels

ILP32 ILP32 LP64 P64 – int, long and ptr are 32 bits size size size – int, long and ptr are 32 bits (bits) (bits) (bits) – Used by 32-bit OSs int 32 32 32 LP64 long 32 64 32 – int is 32 bits pointer 32 64 64 – long and pointer are 64 bits – Used by 64-bit UNIX OSs P64 (or LLP64) – int and long are 32 bits; pointer is 64 bits – Used by Win64*Win64* and Modesto*

* Third party names and brands are the property of their respective owners Copyright © 2000, Intel Corporation. All rights reserved. 54 *Other brands and names are the property of their respective owners IntelIntel®® CompilerCompiler forfor IAIA--6464 C++ FORTRAN 90 Front End Front End

Profile Guidance Loop Unrolling (PGOPTI) Register Variable Detection Machine instruction lowering Inter-procedural Software Pipelining Constant Prop. Optimizer (IPO) (rotating registers, loop branches) Convert Opt.

Predication (parallel compares) High-Level Global scheduling Optimizer (HLO) Copy Propagation (control and data speculation, Reassociation multi-way branches) Machine Redundancy Elim. Block ordering (branch hints) Independent Elim. Global Optimizer (IL0) (register stack, ALAT, UNAT) Dead Code Elim. Function splitting Code Generator (I cache and TLB locality) (ECG)

Copyright © 2000, Intel Corporation. All rights reserved. 55 *Other brands and names are the property of their respective owners IAIA--6464 HPCHPC CompilersCompilers && ToolsTools  C/C++,C/C++, FXX,FXX, Java,Java, ……  OpenMPOpenMP  MPI,MPI, PVMPVM  PerformancePerformance LibrariesLibraries  VtuneVtune,, …… GNU

® Visual Fortran

MKL

Copyright © 2000, Intel Corporation. All rights reserved. 56 *Other brands and names are the property of their respective owners IAIA--6464 ApplicationApplication BenefitsBenefits

OutstandingOutstanding PerformancePerformance  RemovesRemoves performanceperformance bottlenecksbottlenecks  Large register files  High Parallelism  Predication  SW pipelining support  Memory latency hiding  6464--bitsbits allowsallows biggerbigger addressaddress spacespace  IEEEIEEE--accurateaccurate floatingfloating pointpoint  AbilityAbility toto runrun IAIA--3232 applicationsapplications

Copyright © 2000, Intel Corporation. All rights reserved. 57 *Other brands and names are the property of their respective owners IAIA--6464 UserUser BenefitsBenefits

 Big in-memory data structures and DB  Large file system and data files  Efficient large integer calculations  Fast 64-bit F.P. calculations  Fast Security processing  More and faster transactions  More services  Higher throughput  Improved availability and manageability

Copyright © 2000, Intel Corporation. All rights reserved. 58 *Other brands and names are the property of their respective owners Intel:Intel: MoreMore ThanThan JustJust MicroprocessorsMicroprocessors

Copyright © 2000, Intel Corporation. All rights reserved. 59 *Other brands and names are the property of their respective owners IAIA--6464 :: TheThe FutureFuture forfor HPCHPC

http://developer.intel.com/

http://developer.intel.com/design/ia64/ http://developer.intel.com/technology/itj/q41999.htm GlossaryGlossary GlossaryGlossary

 ALAT (Advanced Load Address Table) - cache used for data speculation which stores the most recent advanced load addresses  ALoad/Acheck - advanced load/check (Data Speculation)  Basic Block - code which is between two branches; if one instruction in the block of code executes, then all instructions in that block will also execute  Control Speculation - the execution of an operation before the branch which guards it; used to hide memory latency  Data Speculation - the execution of a memory load prior to a store that precedes it, and that may potentially alias it; used to hide memory latency

Copyright © 2000, Intel Corporation. All rights reserved. 62 *Other brands and names are the property of their respective owners GlossaryGlossary

 IA-32 - the name for Intel’s current ISA (32-bit and 16-bit)  IA-32 System Environment - the system environment of an IA- 64 processor as defined by the  processor and Pentium Pro processor  IA-64 – Intel® 64-bit Architecture is composed of the 64-bit ISA and IA-32; IA-64 integrates the two into a single architectural definition  IA-64 - the Processor Abstraction Layer and the System Abstraction Layer  IA-64 System Environment - IA-64 with privileged resources along with capability to support the execution of existing IA-32 applications  Instruction Set Architecture (ISA) - defines application level resources which include: user-level instructions, addressing modes, segmentation, and user visible register files

Copyright © 2000, Intel Corporation. All rights reserved. 63 *Other brands and names are the property of their respective owners GlossaryGlossary  NaT bit/NaT Value (Not a Thing) - used with control speculation to indicate that a number stored in a general or floating-point register is not valid  Predication - the conditional execution of an instruction; used to remove branches from code  Processor Abstraction Layer (PAL) - the IA-64 firmware layer which abstracts IA-64 processor features that are implementation dependent  Sload/SCheck - speculative load/check (control speculation)  System Abstraction Layer (SAL) - the IA-64 firmware layer which abstracts IA-64 system features that are implementation dependent  System Environment - defines processor specific operating system resources which include: exception and interruption handling, virtual and physical memory management, system register state, and privileged instructions

Copyright © 2000, Intel Corporation. All rights reserved. 64 *Other brands and names are the property of their respective owners BackupBackup SWSW PipelinedPipelined LoopLoop ExampleExample  DAXPYDAXPY innerinner looploop :: dy[i]dy[i] == dy[i]dy[i] ++ (da(da ** dx[i])dx[i]) – 2 loads, 1 fma, 1 store / iteration  MachineMachine assumptionsassumptions – can do 2 loads, 1 store, 1 fma, 1 br / cycle – load latency of 2 clocks – fma latency of 1 clock (not realistic, but good for example)  SpecialSpecial RegistersRegisters – LC: Loop

Copyright © 2000, Intel Corporation. All rights reserved. 66 *Other brands and names are the property of their respective owners Example:Example: PipelinePipeline

 Each column represents 1 source iteration

load dx,dy

dy + da * dx store dy

Copyright © 2000, Intel Corporation. All rights reserved. 67 *Other brands and names are the property of their respective owners ExampleExample CodeCode

.rotf dx[3], dy[3], tmp[2]

mov ar.lc = 3 // #iterations-1 mov ar.ec = 4 // #stages mov pr.rot = 0x10000 ;; looptop: (p16) ldfd dx[0] = [dxsp],8 (p16) ldfd dy[0] = [dysp],8 (p18) fma.d tmp[0] = da, dx[2], dy[2] (p19) stfd [dydp] = tmp[1],8 br.ctop looptop ;;

Copyright © 2000, Intel Corporation. All rights reserved. 68 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution

Execution Sequence ...

19: ? (p19) 18: ? (p18) 17: ? 16: ? (p16) 63: ? (p63) . RRB=0 LC=? EC=? BeforeBeforeBefore InitializationInitializationInitialization Copyright © 2000, Intel Corporation. All rights reserved. 69 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution

Execution Sequence ... (p16) ldx (p16) ldy (p18) fma (p19) st

19: 0 (p19) 18: 0 (p18) 17: 0 16: 1 (p16) 63: 0 (p63) . RRB=0 LC=3 EC=4 InitializationInitializationInitialization Copyright © 2000, Intel Corporation. All rights reserved. 70 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution

Execution Sequence ... (p16) ldx (p16) ldy (p18) fma (p19) st

19: 0 (p19) 18: 0 (p18) 17: 0 16: 1 (p16) 63: 0 (p63) . RRB=0 LC=3 EC=4 PrologueProloguePrologue Copyright © 2000, Intel Corporation. All rights reserved. 71 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution

Execution Sequence (p16) ld (p16) ld (p18) fma (p19) st ...... x y (p16) ldx (p16) ldy (p18) fma (p19) st

18:19:19: 000 (p19) 17:18:18: 000 (p18) 16:17:17: 100 63:16:16: 111 (p16) 1 62:63:63: 010 (p63) ...... LC=3 EC=4 RRB=-1RRB=0 LC=2 EC=4 BranchBranchBranch 111 Copyright © 2000, Intel Corporation. All rights reserved. 72 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution

Execution Sequence (p16) ld (p16) ld (p18) fma (p19) st ...... x y (p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st 17:18:18: 00 (p19) 16:17:17: 100 (p18) 63:16:16: 11 62:63:63: 11 (p16) 1 61:62:62: 010 (p63) ...... RRB=-1RRB=-2 LC=2LC=1 EC=4 BranchBranchBranch 222 Copyright © 2000, Intel Corporation. All rights reserved. 73 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution

Execution Sequence ...... (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st (p16) ld (p16) ld (p18) fma (p19) st 17:16:17: 010 (p19) x y 16:63:16: 11 (p18) 63:62:63: 11 62:61:62: 11 (p16) 1 61:60:61: 010 (p63) .... RRB=-3RRB=-2 LC=1LC=0 EC=4 BranchBranchBranch 333 Copyright © 2000, Intel Corporation. All rights reserved. 74 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution

Execution Sequence (p16) ld (p16) ld (p18) fma (p19) st ...... x y (p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st (p16) ld (p16) ld (p18) fma (p19) st 63:16: 11 (p19) x y (p16) ldx (p16) ldy (p18) fma (p19) st 62:63: 11 (p18) 61:62: 11 60:61: 01 (p16) 0 59:60: 00 (p63) .... RRB=-3RRB=-4 LC=0 EC=4EC=3 BranchBranchBranch 444 Copyright © 2000, Intel Corporation. All rights reserved. 75 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution

Execution Sequence (p16) ld (p16) ld (p18) fma (p19) st ...... x y (p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st (p16) ld (p16) ld (p18) fma (p19) st 62:63: 11 (p19) x y (p16) ldx (p16) ldy (p18) fma (p19) st 61:62: 11 (p18) (p16) ldx (p16) ldy (p18) fma (p19) st 60:61: 01 59:60: 00 (p16) 0 58:59: 00 (p63) .... RRB=-5RRB=-4 LC=0 EC=3EC=2 BranchBranchBranch 555 Copyright © 2000, Intel Corporation. All rights reserved. 76 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution

Execution Sequence ...... (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st (p16) ld (p16) ld (p18) fma (p19) st 61:62: 11 (p19) x y (p16) ldx (p16) ldy (p18) fma (p19) st 60:61: 01 (p18) (p16) ldx (p16) ldy (p18) fma (p19) st 59:60: 0 59:60: 0 (p16) ldx (p16) ldy (p18) fma (p19) st 58:59: 00 (p16) 0 57:58: 00 (p63) .... RRB=-5RRB=-6 LC=0LC=0 EC=2 EC=1 BranchBranchBranch 666 Copyright © 2000, Intel Corporation. All rights reserved. 77 *Other brands and names are the property of their respective owners LoopLoop ExecutionExecution

Execution Sequence ...... (p16) ldx (p16) ldy (p18) fma (p19) st (p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st (p16) ld (p16) ld (p18) fma (p19) st 60:61: 01 (p19) x y (p16) ldx (p16) ldy (p18) fma (p19) st 59:60: 00 (p18) (p16) ldx (p16) ldy (p18) fma (p19) st 58:59: 0 58:59: 0 (p16) ldx (p16) ldy (p18) fma (p19) st 57:58: 00 (p16) fall through 0 56:57: 00 (p63) .... RRB=-7RRB=-6 LC=0 EC=1EC=0 BranchBranchBranch 777 Copyright © 2000, Intel Corporation. All rights reserved. 78 *Other brands and names are the property of their respective owners