Netburst Microarchitecture Overview

The Intel® Pentium® 4 Processor

DougDoug CarmeanCarmean PrincipalPrincipal ArchitectArchitect IntelIntel ArchitectureArchitecture GroupGroup

SpringSpring 20022002

Intel®

z ReviewReview z PipelinePipeline DepthDepth z ExecutionExecution TraceTrace CacheCache z DataData SpeculationSpeculation z SpecSpec PerformancePerformance z SummarySummary

Intel®

Copyright © 2002 Intel Corporation. PDX Information in this document is provided in connection with Intel® products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel’s Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel® products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. This document contains information on products in the design phase of development. Do not finalize a design with this information. Revised information will be published when the product is available. Verify with your local sales office that you have the latest datasheet before finalizing a design. Intel processors may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel, Pentium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and foreign countries. Copyright © (2001) Intel Corporation. Intel®

BasicBasic P6P6 PipelinePipeline

1 2 3 4 5 6 7 8 9 10 Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec

BasicBasic PentiumPentium® 44 ProcessorProcessor PipelinePipeline

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex Flgs Br Ck Drive

Deeper Pipelines enable higher frequency and performance Intel®

Today 2.2GHz 20 1.8GHz Netburst Micro-Architecture Intro 1.5GHz 1.2GHz 10 P6 Micro-Architecture Frequency Frequency

120% 2 MB 100% 1 MB

80% 512 KB

60% 256 KB 40%

Performance Improvement Performance 20%

0% 10 15 20 25 30 Pipeline Depth Intel®

z IncreasesIncreases complexitycomplexity ––HarderHarder toto balancebalance ––MoreMore challengeschallenges toto architectarchitect aroundaround ––MoreMore algorithmsalgorithms ––GreaterGreater validationvalidation efforteffort ––NeedNeed toto pipelinepipeline thethe wireswires

OverallOverall EngineeringEngineering EffortEffort IncreasesIncreases QuicklyQuickly

asas PipelinePipeline depthdepth increasesincreases Intel®

z HighHigh bandwidthbandwidth frontfront endend z LowLow latencylatency corecore z LowerLower memorymemory latencylatency HighHighHigh BandwidthBandwidthBandwidth FrontFrontFront EndEndEnd

Intel®

z BranchBranch predictionprediction isis moremore importantimportant ––SoSo wewe improvedimproved itit z NeedNeed greatergreater uopuop bandwidthbandwidth ––BranchesBranches constantlyconstantly changechange thethe flowflow ––NeedNeed toto decodedecode moremore instructionsinstructions inin parallelparallel

Intel®

Dynamic BTB Instruction TLB System 64-bit wide 4K Entries Bus Instruction Decoder

Bus Execution Trace Micro Execution Trace Trace Cache BTB Interface Instruction Cache 12K uopsµops 512 Entries Unit Sequencer Allocator / Register Renamer Quad Memory µop Queue Integer/Floating Point µop Queue Pumped Integer Schedulers Floating Point Schedulers 3.2 GB/s Memory Slow Fast Int Fast Int FP Move FP Gen

Integer Register File / Bypass Network FP Register / Bypass

L2 Slow ALU 2xAGU 2xALU 2xALU Ld/St FP Fmul Fmul Fmul Cache Complex Simple Simple FAdd FAdd FAdd Address Move FAdd Instr. Instr. Instr. unit 64GB/s 256-bit wide L1 Data Cache 8Kbyte 4-way Intel®

1 cmp 2 br -> T1 .. ... (unused code) TraceTrace CacheCache DeliveryDelivery T1: 3 sub 4 br -> T2 1 cmp 2 br T1 3 T1: sub .. ... (unused code) 4 br T2 5 mov 6 sub T2: 5 mov 7 br T3 8 T3:add 9 sub 6 sub 7 br -> T3 10 mul 11 cmp 12 br T4 .. ... (unused code) T3: 8 add 9 sub 10 mul 11 cmp 12 br -> T4

Intel®

P6P6 MicroarchitectureMicroarchitecture TraceTrace CacheCache DeliveryDelivery

1 cmp 2 br T1 1 cmp 2 br T1 3 T1: sub 4 br T2 5 mov 6 sub 3 T1: sub 4 br T2 7 br T3 8 T3:add 9 sub 10 mul 11 cmp 12 br T4 5 mov 6 sub 7 br T3

8 T3:add 9 sub 10 mul 11 cmp 12 br T4 BW = 1.5 uops/ns BW = 6 uops/ns

Intel®

Way 0 Way 1 Way 2 Way 3 Set 0 Instruction Set 1 head Pointer 0x0900 Set 2 body 1 Set 3 body 2 Set 4 tail

head cmp, br T1, T1:sub, br T2, mov, sub body 1 br T3, T3:add, sub, mul, cmp, br T4

body 2 T4:add, sub, mov, add, add, mov

tail add, sub, mov, add, add, mov

Intel®

z ProgramsPrograms thatthat modifymodify thethe instructioninstruction streamstream thatthat isis beingbeing executedexecuted z VeryVery commoncommon inin Java*Java* codecode fromfrom JITsJITs z RequiresRequires hardwarehardware mechanismsmechanisms toto maintainmaintain consistencyconsistency

z TheThe hardwarehardware needsneeds toto handlehandle twotwo basicbasic cases:cases: ––StoresStores thatthat writewrite toto instructionsinstructions inin thethe TraceTrace CacheCache ––InstructionInstruction fetchesfetches thatthat hithit pendingpending storesstores – Speculative – Committed

Intel®

“in use” bits addr addr

Instruction TLB (128 entries)

Execution Core Trace Cache addr

Data TLB Store’s Physical Address

Intel®

addr addr

Instruction Instruction Pointer TLB (128 entries)

Committed Write Store addr Combining Buffer Please Speculative Buffer Re-Fetch Please Flush Pipeline Please Re-Fetch Execution Core

Intel®

z ProvidesProvides higherhigher bandwidthbandwidth forfor higherhigher frequencyfrequency corecore z ReducesReduces fetchfetch latencylatency z RequiresRequires newnew fundamentallyfundamentally newnew algorithmsalgorithms

Intel®

z HighHigh bandwidthbandwidth frontfront endend z LowLow latencylatency corecore z LowerLower memorymemory latencylatency LowLowLow LatencyLatencyLatency CoreCoreCore

Intel®

z UseUse datadata beforebefore wewe areare suresure itit isis validvalid – Lowers effective LD latency – Fast ALUs in Pentium 4 want fast LDs – Ratio of LD latency to ADD latency is important if 1 in 5 uops is a LD z AsAs pipelinespipelines getget deeper,deeper, datadata speculationspeculation getsgets moremore importantimportant – Number of cycles saved /w data speculation increases as pipeline depth increases

Intel®

Dynamic BTB Instruction TLB System 64-bit wide 4K Entries Bus Instruction Decoder

Bus Micro Execution Trace Trace Cache BTB Interface Instruction Cache 12K µops 512 Entries Unit Sequencer Allocator / Register Renamer Quad Memory µop Queue Integer/Floating Point µop Queue Pumped Integer Schedulers Floating Point Schedulers 3.2 GB/s Memory Slow Fast Int Fast Int FP Move FP Gen

Integer Register File / Bypass Network FP Register / Bypass

L2 Slow ALU 2xAGU 2xALU 2xALU Ld/St FP Fmul Fmul Fmul Cache Complex Simple Simple FAdd FAdd FAdd Address Move FAdd Instr. Instr. Instr. unit 64GB/s 256-bit wide L1 Data Cache 8Kbyte 4-way L1 Data Cache 8Kbyte 4-way Intel®

z P6:P6: ––33 clocksclocks @@ 1GHz1GHz

3ns z P4P:P4P: ––22 clocksclocks @@ 2GHz2GHz

1ns

LowerLowerLower LatencyLatencyLatency isisis HigherHigherHigher PerformancePerformancePerformance Intel®

Pipeline Stages

VA tag 15:0 data VA Data 31:16 Fast SB 2x Clock

VA TLB TAG 31:0

Slow SB

replay replay replay Intel®

VA tag 15:0 data VA Data 31:16 Fast SB 2x Clock

VA TLB TAG 31:0

Slow SB

Way Predictor (Tag array) Data array

Way Hit

… select … (Replay)

10:6 15:11 19:16 Intel®

z TwoTwo componentscomponents toto aa store:store: – STA: address computation – STD: data piece z HybridHybrid uOPuOP – Single uOP in the front, back ends – Two uOPs in the middle

Intel®

Ld EAX <- AddrA STA AddrB, STD DataB store

z IfIf thethe storestore isis olderolder ––AndAnd AddrAAddrA == AddrBAddrB ––ThenThen thethe loadload mustmust getget DataBDataB z DependenciesDependencies cancan notnot bebe resolvedresolved untiluntil executionexecution

Intel®

predictor

LD A LD A LD A

STA B STD B STA B STD B STA B STD B

Loads wait for: Loads wait for: Predict All older STAs AND All older STAs Specific older STAs All older STDs Specific older STDs No Recovery STD Recovery Complex Recovery K7?K7? P4P4 EV8EV8 Intel®

LD1 XOR z STST writeswrites newnew valuevalue STA, STD ML=1 z LD2LD2 forwardsforwards ? z BranchBranch resolvesresolves LD2 basedbased onon newnew valuevalue

BR JMP if ML=0

Intel®

LD1 XOR LD misses, replays

STA, STD ML=1 STA dependent, replays

LD2 LD hits, gets old value

BR mispredicts BR JMP if ML=0

Intel®

LD1 XOR z IfIf LD2LD2 dependsdepends onon thethe STA,STA, theythey areare usuallyusually partpart ofof thethe samesame STA, STD dataflowdataflow graphgraph z IfIf thethe STASTA replays,replays, thethe LD2 LDLD usuallyusually hashas anan addressaddress dependencedependence

Intel®

z Normally,Normally, aggressivelyaggressively scheduleschedule z IfIf aa largelarge numbernumber ofof problemsproblems occuroccur enterenter CautiousCautious ModeMode z InIn CautiousCautious Mode,Mode, branchesbranches waitwait forfor datadata toto bebe nonnon--speculativespeculative ––IncreasesIncreases branchbranch mispredictionmisprediction latencylatency ––CompletelyCompletely eliminateseliminates problemsproblems

Intel®

z A simple state machine cleans up the outliers – Out of 2200 traces, 3 traces speedup >20% – The other traces are unaffected – Average performance improvement < 0.1%

Intel®

z HighHigh bandwidthbandwidth frontfront endend z LowLow latencylatency corecore z LowerLower memorymemory latencylatency

LowerLowerLower MemoryMemoryMemory LatencyLatencyLatency

Intel®

z AsAs frequencyfrequency increases,increases, itit isis importantimportant toto improveimprove thethe performanceperformance ofof thethe memorymemory subsystemsubsystem z DataData PrefetchPrefetch LogicLogic ––WatchesWatches processorprocessor memorymemory traffictraffic ––LooksLooks forfor patternspatterns ––InitiatesInitiates accessesaccesses

Intel®

Dynamic BTB Instruction TLB System 64-bit wide 4K Entries Bus Instruction Decoder

Bus Micro Execution Trace Trace Cache BTB Interface Instruction Cache 12K µops 512 Entries Unit Sequencer Allocator / Register Renamer QuadData Memory µop Queue Integer/Floating Point µop Queue PumpePrefetchd Integer Schedulers Floating Point Schedulers 3.2Logic GB/s Memory Slow Fast Int Fast Int FP Move FP Gen

Integer Register File / Bypass Network FP Register / Bypass

L2 Slow ALU 2xAGU 2xALU 2xALU Ld/St FP Fmul Fmul Fmul Cache Complex Simple Simple FAdd FAdd FAdd Address Move FAdd Instr. Instr. Instr. unit 64GB/s 256-bit wide L1 Data Cache 8Kbyte 4-way Intel®

Instruction 64 Buffers

Instruction Fetch L2 Data Advanced Bus Prefetch Transfer Queue Logic Cache

L1 Data Cache 256

Prefetch logic first checks L2 cache and then fetches lines from memory that miss L2 cache. Intel®

z WatchesWatches forfor streamingstreaming memorymemory accessaccess patternspatterns – Can track 8 multiple independent streams – Loads, Stores or Instruction – Forward or Backward z AnalysisAnalysis onon 3232 bytebyte cachecache lineline granularitygranularity z LooksLooks forfor “mostly”“mostly” completecomplete streams:streams: – Access to cache lines 1,2,3,4,5,6 will prefetch – Access to cache lines 1,2, 4,5,6 will prefetch – 1, ,3, , ,6, , ,9 will not prefetch

Intel®

z HighHigh bandwidthbandwidth frontfront endend z LowLow latencylatency corecore z LowerLower memorymemory latencylatency

Intel®