Intel® Itanium™ Processor Microarchitecture Overview

Intel® Itanium™ Processor Microarchitecture Overview

Itanium™ Processor Microarchitecture Overview Intel® Itanium™ Processor Microarchitecture Overview Harsh Sharangpani Principal Engineer and IA-64 Microarchitecture Manager Intel Corporation ® Microprocessor Forum October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Unveiling the Intel® Itanium™ Processor Design l Leading-edge implementation of IA-64 architecture for world-class performance l New capabilities for systems that fuel the Internet Economy l Strong progress on initial silicon ® Microprocessor Forum 2 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Itanium™ Processor Goals l World-class performance on high-end applications – High performance for commercial servers – Supercomputer-level floating point for technical workstations l Large memory management with 64-bit addressing l Robust support for mission critical environments – Enhanced error correction, detection & containment l Full IA-32 instruction set compatibility in hardware l Deliver across a broad range of industry requirements – Flexible for a variety of OEM designs and operating systems Deliver world-class performance and features for servers & workstations and emerging internet applications ® Microprocessor Forum 3 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview EPIC Design Philosophy ì Maximize performance via EPICEPIC hardware & software synergy ì Advanced features enhance instruction level parallelism ìPredication, Speculation, ... ì Massive hardware resources for parallel execution VLIW OOO / SuperScalar ì High performance EPIC building block RISC CISC Performance Time Achieving performance at the most fundamental level ® Microprocessor Forum 4 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Itanium™ EPIC Design Maximizes SW-HW Synergy Architecture Features programmed by compiler: Register Branch Explicit Data & Control Memory Stack Predication Hints Parallelism Speculation Hints & Rotation Micro-architecture Features in hardware: Fetch Issue Register Control Parallel Resources Memory Handling Bypasses & Dependencies Subsystem 4 Integer + Fast, Simple 6-Issue Simple Fast, 4 MMX Units 128 GR & Instruction 128 FR, 2 FMACs Three Cache Register (4 for SSE) levels of & Branch Remap cache: & Predictors 2 LD/ST units L1, L2, L3 Stack Engine 32 entry ALAT ® Speculation Deferral Management Microprocessor Forum 5 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Breakthrough Levels of Parallelism M F I M F I 6 instructions provides 12 parallel ops/clock (SP: 20 parallel ops/clock) •Load 4 DP (8 SP) 2 ALU ops for digital content creation ops via 2 ld-pair & scientific computing •2 ALU oper 4 DP FLOPS (post incr) (8 SP FLOPS) 6 instructions M I B M I B provides 8 parallel ops / clock for enterprise & Internet applications 2 Loads + 1 Branch Hint + 2 ALU ops 2 ALU ops 1 Branch instr (post incr) Itanium™ delivers greater instruction level ® parallelism than any contemporary processor Microprocessor Forum 6 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Highlights of the Itanium™ Pipeline l 6-Wide EPIC hardware under precise compiler control – Parallel hardware and control for predication & speculation – Efficient mechanism for enabling register stacking & rotation – Software-enhanced branch prediction l 10-stage in-order pipeline with cycle time designed for: – Single cycle ALU (4 ALUs globally bypassed) – Low latency from data cache l Dynamic support for run-time optimization – Decoupled front end with prefetch to hide fetch latency – Aggressive branch prediction to reduce branch penalty – Non-blocking caches and register scoreboard to hide load latency Parallel, deep, and dynamic pipeline ® designed for maximum throughput Microprocessor Forum 7 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview 10 Stage In-Order Core Pipeline Front End Execution •Pre-fetch/Fetch of up • 4 single cycle ALUs, 2 ld/str to 6 instructions/cycle • Advanced load control •Hierarchy of branch • Predicate delivery & branch predictors • Nat/Exception//Retirement •Decoupling buffer WORD-LINE EXPAND RENAME DECODE REGISTER READ IPG FET ROT EXP RENWLD REG EXE DET WRB INST POINTER FETCH ROTATE EXECUTEEXCEPTION WRITE-BACK GENERATION DETECT Instruction Delivery Operand Delivery •Dispersal of up to 6 • Reg read + Bypasses instructions on 9 ports • Register scoreboard •Reg. remapping • Predicated •Reg. stack engine dependencies ® Microprocessor Forum 8 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Frontend: Prefetch & Fetch IPG FET ROT l SW-triggered prefetch loads target code early using br hints l Streaming prefetch of large blocks via hint on branch l Early prefetch of small blocks via BRP instruction l I-Fetch of 32 Bytes/clock feeds an 8-bundle decoupling buffer l Buffer allows front-end to fetch even when back-end is stalled l Hides instruction cache misses and branch bubbles 8 bundle buffer IP IP MUX I-Cache & ITLB Fetch bubble Feed Branch Predictor Structures & Resteers IPG FET ROT EXP Aggressive instruction fetch hardware to feed a ® highly parallel, high performance machine Microprocessor Forum 9 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Front End: Branch Prediction IPG FET ROT l Branch hints combine with predictor hierarchy to improve branch prediction: four progressive resteers l 4 TARs programmed by “importance” hints l 512-entry 2-level predictor provides dynamic direction prediction l 64-entry BTAC contains footprint of upcoming branch targets (programmed by branch hints, and allocated dynamically) IP IP I-Cache & ITLB 8 bundle buffer MUX Loop Exit Return Stack Buffer Corrector Target Adaptive Br Target Branch Branch Address 2-Level Address Address Address Registers Predictor Cache Calc 1 Calc 2 IPG FET ROT EXP Intelligent branch prediction improves ® performance across all workloads Microprocessor Forum 10 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Instruction Delivery: Dispersal EXP l Stop bits eliminate dependency checking l Templates simplify routing l 1st available dispersal from 6 syllables to 9 issue ports – Keep issuing until stop bit, resource oversubscription, or asymmetry M0 S0 M1 S1 S2 I0 I1 Dispersal Network F0 F1 S3 S4 S5 B0 B1 B2 ROT EXP ®Achieves highly parallel execution with simple hardware Microprocessor Forum 11 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Instruction Delivery: Stacking REN l Massive 128 register file accommodates multiple variable sized procedures via stacking l Eliminates most register spill / fill at procedure interfaces l Achieved transparently to the compiler l Using register remapping via parallel adders l Stack engine performs the few required spill/fills Stall Stack Engine M0 Spill/Fill Injection M1 Integer, FP, I0 & Predicate I1 Renamers F0 EXP F1 REN WLD Unique register model enables faster ® execution of object-oriented code Microprocessor Forum 12 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Operand Delivery WLD REG l Multiported register file + mux hierarchy delivers operands in REG l Unique “Delayed Stall” mechanism used for register dependencies l Avoids pipeline flush or replay on unavailable data l Stall computed in REG, but core pipeline stalls in EXE l Special Operand Latch Manipulation (OLM) captures data returns into operand latches, to mimic register file read Bypass Muxes 128 Entry Integer ALUs Src Register File 8R / 6W WLD REG EXE Src Src Dependency Control OLM comparators Scoreboard Comparators Delayed Stall Dst Preds Avoids pipeline flush to enable a more ® effective, higher throughput pipeline Microprocessor Forum 13 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Predicate Delivery EXE l All instructions read operands and execute l Canceled at retirement if predicates off l Predicates generated in EXE (by cmps), delivered in DET, & feed into: retirement, branch execution and dependency detection l Smart control network cancels false stalls on predicated dependencies l Dependency detection for cancelled producer/consumer (REG) REG EXE DET Bypass Muxes Predicate To Dependency Detect (x6) Register To Branch Execution (x3) File Read To Retirement (x6) I-Cmps F-Cmps Higher performance through removal of branch ® penalties in server and workstation applications Microprocessor Forum 14 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Parallel Branch Execution DET WRB l Speculation + predication result in clusters of branches l Execution of 3 branches/clock optimizes for clustered branches l Branch execution in DET allows cmps-->branches in same issue group REG EXE DET WRB 3 BR Read Pred. Delivery Retirement of branch bundle IP relative Address Direction Resteer Validation Validation IP Most recent branch prediction info Parallel branch hardware extends performance ® benefits of EPIC technology Microprocessor Forum 15 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Speculation Hardware DET WRB l Control Speculation support requires minimal hardware l Computed memory exception delivered with data as tokens (NaTs) l NaTs propagate through subsequent executions like source data l Data Speculation enabled efficiently via ALAT structure l 32 outstanding advanced loads l Indexed by reg-ids, keeps partial physical address tag l 0 clk checks: dependent use can be issued in parallel with check EXE DET Physical WRB Address 32-entry Adv Ld Status Exception Address Logic TLB ALAT Check & Exception Memory Spec. Ld. Status (NaT) Subsystem Check Instruction ® Efficient elimination of memory

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    23 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us