Intel® Itanium™ Processor Microarchitecture Overview
Total Page:16
File Type:pdf, Size:1020Kb
Itanium™ Processor Microarchitecture Overview Intel® Itanium™ Processor Microarchitecture Overview Harsh Sharangpani Principal Engineer and IA-64 Microarchitecture Manager Intel Corporation ® Microprocessor Forum October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Unveiling the Intel® Itanium™ Processor Design l Leading-edge implementation of IA-64 architecture for world-class performance l New capabilities for systems that fuel the Internet Economy l Strong progress on initial silicon ® Microprocessor Forum 2 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Itanium™ Processor Goals l World-class performance on high-end applications – High performance for commercial servers – Supercomputer-level floating point for technical workstations l Large memory management with 64-bit addressing l Robust support for mission critical environments – Enhanced error correction, detection & containment l Full IA-32 instruction set compatibility in hardware l Deliver across a broad range of industry requirements – Flexible for a variety of OEM designs and operating systems Deliver world-class performance and features for servers & workstations and emerging internet applications ® Microprocessor Forum 3 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview EPIC Design Philosophy ì Maximize performance via EPICEPIC hardware & software synergy ì Advanced features enhance instruction level parallelism ìPredication, Speculation, ... ì Massive hardware resources for parallel execution VLIW OOO / SuperScalar ì High performance EPIC building block RISC CISC Performance Time Achieving performance at the most fundamental level ® Microprocessor Forum 4 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Itanium™ EPIC Design Maximizes SW-HW Synergy Architecture Features programmed by compiler: Register Branch Explicit Data & Control Memory Stack Predication Hints Parallelism Speculation Hints & Rotation Micro-architecture Features in hardware: Fetch Issue Register Control Parallel Resources Memory Handling Bypasses & Dependencies Subsystem 4 Integer + Fast, Simple 6-Issue Simple Fast, 4 MMX Units 128 GR & Instruction 128 FR, 2 FMACs Three Cache Register (4 for SSE) levels of & Branch Remap cache: & Predictors 2 LD/ST units L1, L2, L3 Stack Engine 32 entry ALAT ® Speculation Deferral Management Microprocessor Forum 5 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Breakthrough Levels of Parallelism M F I M F I 6 instructions provides 12 parallel ops/clock (SP: 20 parallel ops/clock) •Load 4 DP (8 SP) 2 ALU ops for digital content creation ops via 2 ld-pair & scientific computing •2 ALU oper 4 DP FLOPS (post incr) (8 SP FLOPS) 6 instructions M I B M I B provides 8 parallel ops / clock for enterprise & Internet applications 2 Loads + 1 Branch Hint + 2 ALU ops 2 ALU ops 1 Branch instr (post incr) Itanium™ delivers greater instruction level ® parallelism than any contemporary processor Microprocessor Forum 6 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Highlights of the Itanium™ Pipeline l 6-Wide EPIC hardware under precise compiler control – Parallel hardware and control for predication & speculation – Efficient mechanism for enabling register stacking & rotation – Software-enhanced branch prediction l 10-stage in-order pipeline with cycle time designed for: – Single cycle ALU (4 ALUs globally bypassed) – Low latency from data cache l Dynamic support for run-time optimization – Decoupled front end with prefetch to hide fetch latency – Aggressive branch prediction to reduce branch penalty – Non-blocking caches and register scoreboard to hide load latency Parallel, deep, and dynamic pipeline ® designed for maximum throughput Microprocessor Forum 7 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview 10 Stage In-Order Core Pipeline Front End Execution •Pre-fetch/Fetch of up • 4 single cycle ALUs, 2 ld/str to 6 instructions/cycle • Advanced load control •Hierarchy of branch • Predicate delivery & branch predictors • Nat/Exception//Retirement •Decoupling buffer WORD-LINE EXPAND RENAME DECODE REGISTER READ IPG FET ROT EXP RENWLD REG EXE DET WRB INST POINTER FETCH ROTATE EXECUTEEXCEPTION WRITE-BACK GENERATION DETECT Instruction Delivery Operand Delivery •Dispersal of up to 6 • Reg read + Bypasses instructions on 9 ports • Register scoreboard •Reg. remapping • Predicated •Reg. stack engine dependencies ® Microprocessor Forum 8 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Frontend: Prefetch & Fetch IPG FET ROT l SW-triggered prefetch loads target code early using br hints l Streaming prefetch of large blocks via hint on branch l Early prefetch of small blocks via BRP instruction l I-Fetch of 32 Bytes/clock feeds an 8-bundle decoupling buffer l Buffer allows front-end to fetch even when back-end is stalled l Hides instruction cache misses and branch bubbles 8 bundle buffer IP IP MUX I-Cache & ITLB Fetch bubble Feed Branch Predictor Structures & Resteers IPG FET ROT EXP Aggressive instruction fetch hardware to feed a ® highly parallel, high performance machine Microprocessor Forum 9 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Front End: Branch Prediction IPG FET ROT l Branch hints combine with predictor hierarchy to improve branch prediction: four progressive resteers l 4 TARs programmed by “importance” hints l 512-entry 2-level predictor provides dynamic direction prediction l 64-entry BTAC contains footprint of upcoming branch targets (programmed by branch hints, and allocated dynamically) IP IP I-Cache & ITLB 8 bundle buffer MUX Loop Exit Return Stack Buffer Corrector Target Adaptive Br Target Branch Branch Address 2-Level Address Address Address Registers Predictor Cache Calc 1 Calc 2 IPG FET ROT EXP Intelligent branch prediction improves ® performance across all workloads Microprocessor Forum 10 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Instruction Delivery: Dispersal EXP l Stop bits eliminate dependency checking l Templates simplify routing l 1st available dispersal from 6 syllables to 9 issue ports – Keep issuing until stop bit, resource oversubscription, or asymmetry M0 S0 M1 S1 S2 I0 I1 Dispersal Network F0 F1 S3 S4 S5 B0 B1 B2 ROT EXP ®Achieves highly parallel execution with simple hardware Microprocessor Forum 11 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Instruction Delivery: Stacking REN l Massive 128 register file accommodates multiple variable sized procedures via stacking l Eliminates most register spill / fill at procedure interfaces l Achieved transparently to the compiler l Using register remapping via parallel adders l Stack engine performs the few required spill/fills Stall Stack Engine M0 Spill/Fill Injection M1 Integer, FP, I0 & Predicate I1 Renamers F0 EXP F1 REN WLD Unique register model enables faster ® execution of object-oriented code Microprocessor Forum 12 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Operand Delivery WLD REG l Multiported register file + mux hierarchy delivers operands in REG l Unique “Delayed Stall” mechanism used for register dependencies l Avoids pipeline flush or replay on unavailable data l Stall computed in REG, but core pipeline stalls in EXE l Special Operand Latch Manipulation (OLM) captures data returns into operand latches, to mimic register file read Bypass Muxes 128 Entry Integer ALUs Src Register File 8R / 6W WLD REG EXE Src Src Dependency Control OLM comparators Scoreboard Comparators Delayed Stall Dst Preds Avoids pipeline flush to enable a more ® effective, higher throughput pipeline Microprocessor Forum 13 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Predicate Delivery EXE l All instructions read operands and execute l Canceled at retirement if predicates off l Predicates generated in EXE (by cmps), delivered in DET, & feed into: retirement, branch execution and dependency detection l Smart control network cancels false stalls on predicated dependencies l Dependency detection for cancelled producer/consumer (REG) REG EXE DET Bypass Muxes Predicate To Dependency Detect (x6) Register To Branch Execution (x3) File Read To Retirement (x6) I-Cmps F-Cmps Higher performance through removal of branch ® penalties in server and workstation applications Microprocessor Forum 14 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Parallel Branch Execution DET WRB l Speculation + predication result in clusters of branches l Execution of 3 branches/clock optimizes for clustered branches l Branch execution in DET allows cmps-->branches in same issue group REG EXE DET WRB 3 BR Read Pred. Delivery Retirement of branch bundle IP relative Address Direction Resteer Validation Validation IP Most recent branch prediction info Parallel branch hardware extends performance ® benefits of EPIC technology Microprocessor Forum 15 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Speculation Hardware DET WRB l Control Speculation support requires minimal hardware l Computed memory exception delivered with data as tokens (NaTs) l NaTs propagate through subsequent executions like source data l Data Speculation enabled efficiently via ALAT structure l 32 outstanding advanced loads l Indexed by reg-ids, keeps partial physical address tag l 0 clk checks: dependent use can be issued in parallel with check EXE DET Physical WRB Address 32-entry Adv Ld Status Exception Address Logic TLB ALAT Check & Exception Memory Spec. Ld. Status (NaT) Subsystem Check Instruction ® Efficient elimination of memory