<<

Overview ® Itanium™ Processor Microarchitecture Overview

Harsh Sharangpani Principal and IA-64 Microarchitecture Manager Intel Corporation

®

Microprocessor Forum October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Unveiling the Intel® Itanium™

l Leading-edge implementation of IA-64 architecture for world-class performance

l New capabilities for systems that fuel the Economy

l Strong progress on initial

®

Microprocessor Forum 2 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Itanium™ Processor Goals

l World-class performance on high-end applications – High performance for commercial servers

-level floating point for technical

l Large memory management with 64-bit addressing

l Robust support for mission critical environments – Enhanced error correction, detection & containment

l Full IA-32 instruction set compatibility in hardware

l Deliver across a broad range of industry requirements – Flexible for a variety of OEM designs and operating systems

DeliverDeliver world-classworld-class performanceperformance andand featuresfeatures forfor serversservers && workstationsworkstations andand emergingemerging internetinternet applicationsapplications

®

Microprocessor Forum 3 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview EPIC Design Philosophy

ì Maximize performance via EPICEPIC hardware & synergy ì Advanced features enhance instruction level parallelism ìPredication, Speculation, ... ì Massive hardware resources for parallel execution VLIW OOO / SuperScalar ì High performance EPIC building block RISC

CISC Performance Time AchievingAchieving performanceperformance atat thethe mostmost

® fundamentalfundamental levellevel Microprocessor Forum 4 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Itanium™ EPIC Design Maximizes SW-HW Synergy

Architecture Features programmed by : Explicit Register Branch Parallelism Stack Predication Hints & Rotation Data & Control Speculation

Micro-architecture Features in hardware:

Fetch Issue Register Control Parallel Resources MemoryMemory Handling & Dependencies Bypasses SubsystemHints 4 Integer + Fast, Simple 6-Issue 4 MMX Units 128 GR & Instruction 128 FR, 2 FMACs Three Register (4 for SSE) levels of & Branch Remap cache: & Predictors 2 LD/ST units L1, L2, L3 Stack Engine 32 entry ALAT

® Speculation Deferral Management Microprocessor Forum 5 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Breakthrough Levels of Parallelism

M F I M F I 6 instructions provides 12 parallel ops/ (SP: 20 parallel ops/clock) •Load 4 DP (8 SP) 2 ALU ops for digital content creation ops via 2 ld-pair & scientific •2 ALU oper 4 DP FLOPS (post incr) (8 SP FLOPS) 6 instructions M I B M I B provides 8 parallel ops / clock for enterprise & Internet applications 2 Loads + 1 Branch Hint + 2 ALU ops 2 ALU ops 1 Branch instr (post incr) ItaniumItanium parallelismparallelism™™ delivers deliversthanthan anyany greatergreater contemporarycontemporary instructioninstruction processorprocessor levellevel ®

Microprocessor Forum 6 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Highlights of the Itanium™

l 6-Wide EPIC hardware under precise compiler control – Parallel hardware and control for predication & speculation – Efficient mechanism for enabling register stacking & rotation – Software-enhanced branch prediction l 10-stage in-order pipeline with cycle time designed for: – Single cycle ALU (4 ALUs globally bypassed) – Low latency from data cache l Dynamic support for run-time optimization – Decoupled front end with prefetch to hide fetch latency – Aggressive branch prediction to reduce branch penalty – Non-blocking caches and register scoreboard to hide load latency

Parallel,Parallel, deep,deep, andand dynamicdynamic pipelinepipeline

® designeddesigned forfor maximummaximum throughputthroughput Microprocessor Forum 7 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview 10 Stage In-Order Core Pipeline

Front End Execution •Pre-fetch/Fetch of up • 4 single cycle ALUs, 2 ld/str to 6 instructions/cycle • Advanced load control •Hierarchy of branch • Predicate delivery & branch predictors • Nat/Exception//Retirement •Decoupling buffer

WORD-LINE EXPAND RENAME DECODE REGISTER READ

IPG FET ROT EXP RENWLD REG EXE DET WRB

INST POINTER FETCH ROTATE EXECUTEEXCEPTION WRITE-BACK GENERATION DETECT

Instruction Delivery Operand Delivery •Dispersal of up to 6 • Reg read + Bypasses instructions on 9 ports • Register scoreboard •Reg. remapping • Predicated •Reg. stack engine dependencies

®

Microprocessor Forum 8 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Frontend: Prefetch & Fetch IPG FET ROT l SW-triggered prefetch loads target code early using br hints l Streaming prefetch of large blocks via hint on branch l Early prefetch of small blocks via BRP instruction l I-Fetch of 32 /clock feeds an 8-bundle decoupling buffer l Buffer allows front-end to fetch even when back-end is stalled l Hides instruction cache misses and branch bubbles

8 bundle buffer IP

MUX I-Cache & ITLB Fetch bubble Feed

Branch Predictor Structures & Resteers

IPG FET ROT EXP AggressiveAggressive instructioninstruction fetchfetch hardwarehardware toto feedfeed aa ® highlyhighly parallel,parallel, highhigh performanceperformance machinemachine Microprocessor Forum 9 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Front End: Branch Prediction IPG FET ROT l Branch hints combine with predictor hierarchy to improve branch prediction: four progressive resteers l 4 TARs programmed by “importance” hints l 512-entry 2-level predictor provides dynamic direction prediction l 64-entry BTAC contains footprint of upcoming branch targets (programmed by branch hints, and allocated dynamically)

IP I-Cache & ITLB 8 bundle buffer MUX Loop Exit Return Stack Buffer Corrector

Target Adaptive Br Target Branch Branch Address 2-Level Address Address Address Registers Predictor Cache Calc 1 Calc 2

IPG FET ROT EXP IntelligentIntelligent branchbranch predictionprediction improvesimproves ® performanceperformance acrossacross allall workloadsworkloads Microprocessor Forum 10 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview

Instruction Delivery: Dispersal EXP l Stop bits eliminate dependency checking l Templates simplify routing l 1st available dispersal from 6 syllables to 9 issue ports – Keep issuing until stop bit, resource oversubscription, or asymmetry

M0 S0 M1 S1 S2 I0 I1 Dispersal Network F0 F1 S3 S4 S5 B0 B1 B2 ROT EXP

®AchievesAchieves highlyhighly parallelparallel executionexecution withwith simplesimple hardwarehardware Microprocessor Forum 11 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview

Instruction Delivery: Stacking REN l Massive 128 accommodates multiple variable sized procedures via stacking l Eliminates most register spill / fill at procedure interfaces l Achieved transparently to the compiler l Using register remapping via parallel adders l Stack engine performs the few required spill/fills Stall Stack Engine M0 Spill/Fill Injection M1 Integer, FP, I0 & Predicate I1 Renamers F0 EXP F1 REN WLD UniqueUnique registerregister modelmodel enablesenables fasterfaster ® executionexecution ofof object-orientedobject-oriented codecode Microprocessor Forum 12 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview

Operand Delivery WLD REG l Multiported register file + mux hierarchy delivers operands in REG l Unique “Delayed Stall” mechanism used for register dependencies l Avoids pipeline flush or replay on unavailable data l Stall computed in REG, but core pipeline stalls in EXE l Special Operand Latch Manipulation (OLM) captures data returns into operand latches, to mimic register file read Bypass Muxes 128 Entry Integer ALUs Src Register File 8R / 6W WLD REG EXE Src Src Dependency Control OLM comparators Scoreboard Comparators Delayed Stall Dst Preds AvoidsAvoids pipelinepipeline flushflush toto enableenable aa moremore ® effective,effective, higherhigher throughputthroughput pipelinepipeline Microprocessor Forum 13 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Predicate Delivery EXE l All instructions read operands and execute l Canceled at retirement if predicates off l Predicates generated in EXE (by cmps), delivered in DET, & feed into: retirement, branch execution and dependency detection l Smart control network cancels false stalls on predicated dependencies l Dependency detection for cancelled producer/consumer (REG)

REG EXE DET Bypass Muxes Predicate To Dependency Detect (x6) Register To Branch Execution (x3) File Read To Retirement (x6) I-Cmps

F-Cmps

HigherHigher performanceperformance throughthrough removalremoval ofof branchbranch

® penaltiespenalties inin serverserver andand workstationworkstation applicationsapplications Microprocessor Forum 14 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Parallel Branch Execution DET WRB l Speculation + predication result in clusters of branches l Execution of 3 branches/clock optimizes for clustered branches l Branch execution in DET allows cmps-->branches in same issue group

REG EXE DET WRB 3 BR Read Pred. Delivery Retirement of branch bundle IP relative Address Direction Resteer Validation Validation IP

Most recent branch prediction info

ParallelParallel branchbranch hardwarehardware extendsextends performanceperformance

®

Microprocessor Forum benefitsbenefits ofof EPICEPIC15 technologytechnology October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Speculation Hardware DET WRB l Control Speculation support requires minimal hardware l Computed memory exception delivered with data as tokens (NaTs) l NaTs propagate through subsequent executions like source data l Data Speculation enabled efficiently via ALAT structure l 32 outstanding advanced loads l Indexed by reg-ids, keeps partial physical address tag l 0 clk checks: dependent use can be issued in parallel with check

EXE DET Physical WRB Address 32-entry Exception

Address TLB Logic ALAT Adv Ld Status & Memory Spec. Ld. Status (NaT) Subsystem

Check Instruction Check Exception

® EfficientEfficient eliminationelimination ofof memorymemory bottlenecksbottlenecks Microprocessor Forum 16 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Floating Point Features l Native 82-bit hardware provides support for multiple numeric models l 2 pipelined FMACs deliver 4 EP / DP FLOPs/cycle l Performance for security and 3-D graphics l 2 Additional single-precision FMACs for 8 SP FLOPs/cycle (SIMD) l Efficient use of hardware: Integer multiply-add and s/w divide l Balanced with plenty of operand bandwidth from registers / memory 2 stores/clk 6 x 82-bit operands

even 4Mbyte 128 entry L3 L2 82-bit Cache Cache odd RF

2 DP 4 DP Ops/clk Ops/clk (2 x Fld-pair) 2 x 82-bit results ItaniumItanium ® ™™ processorprocessor deliversdelivers industry-leadingindustry-leading Microprocessor Forum floatingfloating pointpoint17 performanceperformance October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Reliability & Availability Features

l Extensive Parity/ECC coverage on processor and – L3 MESI bits sparsely encoded to protect the M-state

– Frontside bus uses special ECC encoding for consecutive 4-bit errors

ITLB Front Side Bus Back Side Bus Data Data L1I L1I L2 L3 Data Data L1D L2 Tag L3 Tag Front Side Bus Back Side Bus Command/Address DTLB Command/Address Command/Address

1x ECC Correction, 2 x ECC detection Parity coverage w/ enhanced MCA

ComprehensiveComprehensive integrityintegrity forfor high-endhigh-end applicationsapplications ®

Microprocessor Forum 18 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Enhanced Machine Check Architecture

Error Type Signaling Example Benefit

Enhanced Corrected by CPU; 1xECC L2 data Reliability & current continues CMCI Availability CONTINUE Corrected by ; Enhanced current process CMCI I-cache parity Reliability & continues Availability

Affected process Enhanced RECOVER terminated by f/w to OS; LMCA Poisoned data OS is stable Availability

Error is contained, System Bus Enhanced CONTAIN GMCA affected node is taken Address parity Reliability off-line

ItaniumItanium mission-criticalmission-critical™™ reliabilityreliability processorprocessor requiredrequired deliversdelivers byby thethe E-businessE-business ®

Microprocessor Forum 19 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview IA-32 Compatibility

l Itanium™ directly executes IA-32 binary code – Shared caches & execution core increases area efficiency – Dynamic scheduler optimizes performance on legacy binaries

l Seamless Architecture allows full Itanium performance on IA-32 system functions

Compatibility IA-32 Dynamic IA-32 Retirement & Fetch & Decode Scheduler Exceptions

Shared Shared I-Cache Execution Core

Full,Full, efficientefficient IA-32IA-32 instructioninstruction

® compatibilitycompatibility inin hardwarehardware Microprocessor Forum 20 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Intel® Itanium™ Processor Block Diagram

L1 Instruction Cache and ECC ITLB Fetch/Pre-fetch Engine IA-32 Branch Instruction Decode Prediction 8 bundles Queue and Control 9 Issue Ports B B B M M I I F F

Register Stack Engine / Re-Mapping

Branch & Predicate Registers 128 Integer Registers 128 FP Registers L2 Cache L2 Cache L3 Cache Branch Integer Dual-Port L3 Cache

, Exceptions and , Exceptions Units and L1 MM Units Data Floating ALAT ALAT NaTs NaTs , , Cache Point Scoreboard, Predicate Scoreboard, Scoreboard, Predicate Scoreboard, and Units DTLB DTLB SIMD ECC SIMD ECC FMAC FMAC

ECC Bus Controller ECC ECC ECC

®

Microprocessor Forum 21 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Itanium™ Processor Status

l Solid progress in weeks following Itanium™ first silicon – More than 4 operating systems running today – Demonstrated 64-bit and running apps – Initial engineering samples shipped to OEMs l Comprehensive functional validation underway – Thorough pre-silicon functional testing included OS kernel on Itanium logic model – Testing including 7 OS’s & many key enterprise and scientific apps – Multiple Intel and OEM test platform configurations (from 2 - 64 processors)

l Planned steps to production in mid 2000 – Completion of functional testing phase through end of 1999 – Performance testing/tuning accelerates in 1H’00 – Broad prototype system deployment by Intel and OEM’s early 2000

®

Microprocessor Forum 22 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Intel® Itanium™ Processor Summary

l High performance leading-edge design – EPIC provides a breakthrough in hardware/software synergy – Predication, speculation, register stacking, & large L3 for High-End servers – Supercomputer-level GFLOPs performance for technical workstations – 64-bit memory addressability for large data sets l Mission-critical reliability and availability – Machine check implementation maximizes error containment and correction – Comprehensive data integrity for e-Business, Internet and enterprise servers l Full IA-32 instruction level compatibility in hardware l Strong Itanium™ silicon progress and industry support

®

Microprocessor Forum 23 October 5-6, 1999