The Intel® Pentium® 4 Processor
DougDoug CarmeanCarmean PrincipalPrincipal ArchitectArchitect IntelIntel ArchitectureArchitecture GroupGroup
SpringSpring 20022002
Intel®
Copyright © 2002 Intel Corporation. PDX AgendaAgenda
z ReviewReview z PipelinePipeline DepthDepth z ExecutionExecution TraceTrace CacheCache z DataData SpeculationSpeculation z SpecSpec PerformancePerformance z SummarySummary
Intel®
Copyright © 2002 Intel Corporation. PDX Information in this document is provided in connection with Intel® products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel’s Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel® products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. This document contains information on products in the design phase of development. Do not finalize a design with this information. Revised information will be published when the product is available. Verify with your local sales office that you have the latest datasheet before finalizing a design. Intel processors may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel, Pentium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and foreign countries. Copyright © (2001) Intel Corporation. Intel®
Copyright © 2002 Intel Corporation. PDX IntelIntel® NetburstNetburstTM MicroMicro--architecturearchitecture vsvs P6P6
BasicBasic P6P6 PipelinePipeline
1 2 3 4 5 6 7 8 9 10 Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec
BasicBasic PentiumPentium® 44 ProcessorProcessor PipelinePipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex Flgs Br Ck Drive
Deeper Pipelines enable higher frequency and performance Intel®
Copyright © 2002 Intel Corporation. PDX HyperHyper PipelinedPipelined TechnologyTechnology
Today 2.2GHz 20 1.8GHz Netburst Micro-Architecture Intro 1.5GHz 1.2GHz 10 P6 Micro-Architecture Frequency Frequency
233MHz 166MHz 5 60MHz P5 Micro-Architecture Introduction Time Intel® Copyright © 2002 Intel Corporation. PDX DeeperDeeper PipelinesPipelines areare BetterBetter
120% 2 MB 100% 1 MB
80% 512 KB
60% 256 KB 40%
Performance Improvement Performance 20%
0% 10 15 20 25 30 Pipeline Depth Intel®
Source: Average of 2000 application segments from performance simulations Copyright © 2002 Intel Corporation. PDX WhyWhy notnot deeperdeeper pipelines?pipelines?
z IncreasesIncreases complexitycomplexity ––HarderHarder toto balancebalance ––MoreMore challengeschallenges toto architectarchitect aroundaround ––MoreMore algorithmsalgorithms ––GreaterGreater validationvalidation efforteffort ––NeedNeed toto pipelinepipeline thethe wireswires
OverallOverall EngineeringEngineering EffortEffort IncreasesIncreases QuicklyQuickly
asas PipelinePipeline depthdepth increasesincreases Intel®
Copyright © 2002 Intel Corporation. PDX PerformancePerformance
z HighHigh bandwidthbandwidth frontfront endend z LowLow latencylatency corecore z LowerLower memorymemory latencylatency HighHighHigh BandwidthBandwidthBandwidth FrontFrontFront EndEndEnd
Intel®
Copyright © 2002 Intel Corporation. PDX HigherHigher FrequencyFrequency increasesincreases requirementsrequirements ofof frontfront endend
z BranchBranch predictionprediction isis moremore importantimportant ––SoSo wewe improvedimproved itit z NeedNeed greatergreater uopuop bandwidthbandwidth ––BranchesBranches constantlyconstantly changechange thethe flowflow ––NeedNeed toto decodedecode moremore instructionsinstructions inin parallelparallel
Intel®
Copyright © 2002 Intel Corporation. PDX BlockBlock DiagramDiagram
Dynamic BTB Instruction TLB System 64-bit wide 4K Entries Bus Instruction Decoder
Bus Execution Trace Micro Execution Trace Trace Cache BTB Interface Instruction Cache 12K uopsµops 512 Entries Unit Sequencer Allocator / Register Renamer Quad Memory µop Queue Integer/Floating Point µop Queue Pumped Integer Schedulers Floating Point Schedulers 3.2 GB/s Memory Slow Fast Int Fast Int FP Move FP Gen
Integer Register File / Bypass Network FP Register / Bypass
L2 Slow ALU 2xAGU 2xALU 2xALU Ld/St FP Fmul Fmul Fmul Cache Complex Simple Simple FAdd FAdd FAdd Address Move FAdd Instr. Instr. Instr. unit 64GB/s 256-bit wide L1 Data Cache 8Kbyte 4-way Intel®
Copyright © 2002 Intel Corporation. PDX Execution Trace Cache
1 cmp 2 br -> T1 .. ... (unused code) TraceTrace CacheCache DeliveryDelivery T1: 3 sub 4 br -> T2 1 cmp 2 br T1 3 T1: sub .. ... (unused code) 4 br T2 5 mov 6 sub T2: 5 mov 7 br T3 8 T3:add 9 sub 6 sub 7 br -> T3 10 mul 11 cmp 12 br T4 .. ... (unused code) T3: 8 add 9 sub 10 mul 11 cmp 12 br -> T4
Intel®
Copyright © 2002 Intel Corporation. PDX ExecutionExecution TraceTrace CacheCache
P6P6 MicroarchitectureMicroarchitecture TraceTrace CacheCache DeliveryDelivery
1 cmp 2 br T1 1 cmp 2 br T1 3 T1: sub 4 br T2 5 mov 6 sub 3 T1: sub 4 br T2 7 br T3 8 T3:add 9 sub 10 mul 11 cmp 12 br T4 5 mov 6 sub 7 br T3
8 T3:add 9 sub 10 mul 11 cmp 12 br T4 BW = 1.5 uops/ns BW = 6 uops/ns
Intel®
Copyright © 2002 Intel Corporation. PDX InsideInside thethe ExecutionExecution TraceTrace CacheCache
Way 0 Way 1 Way 2 Way 3 Set 0 Instruction Set 1 head Pointer 0x0900 Set 2 body 1 Set 3 body 2 Set 4 tail
head cmp, br T1, T1:sub, br T2, mov, sub body 1 br T3, T3:add, sub, mul, cmp, br T4
body 2 T4:add, sub, mov, add, add, mov
tail add, sub, mov, add, add, mov
Intel®
Copyright © 2002 Intel Corporation. PDX SelfSelf ModifyingModifying CodeCode
z ProgramsPrograms thatthat modifymodify thethe instructioninstruction streamstream thatthat isis beingbeing executedexecuted z VeryVery commoncommon inin Java*Java* codecode fromfrom JITsJITs z RequiresRequires hardwarehardware mechanismsmechanisms toto maintainmaintain consistencyconsistency
Intel® *Other names and brands may be claimed as the property of others. Copyright © 2002 Intel Corporation. PDX SelfSelf ModifyingModifying CodeCode
z TheThe hardwarehardware needsneeds toto handlehandle twotwo basicbasic cases:cases: ––StoresStores thatthat writewrite toto instructionsinstructions inin thethe TraceTrace CacheCache ––InstructionInstruction fetchesfetches thatthat hithit pendingpending storesstores – Speculative – Committed
Intel®
Copyright © 2002 Intel Corporation. PDX CaseCase 1:1: StoresStores toto cachedcached instructionsinstructions
“in use” bits addr addr
Instruction TLB (128 entries)
Execution Core Trace Cache addr
Data TLB Store’s Physical Address
Intel®
Copyright © 2002 Intel Corporation. PDX CaseCase 2:2: FetchesFetches toto pendingpending storesstores
addr addr
Instruction Instruction Pointer TLB (128 entries)
Committed Write Store addr Combining Buffer Please Speculative Buffer Re-Fetch Please Flush Pipeline Please Re-Fetch Execution Core
Intel®
Copyright © 2002 Intel Corporation. PDX ExecutionExecution TraceTrace CacheCache
z ProvidesProvides higherhigher bandwidthbandwidth forfor higherhigher frequencyfrequency corecore z ReducesReduces fetchfetch latencylatency z RequiresRequires newnew fundamentallyfundamentally newnew algorithmsalgorithms
Intel®
Copyright © 2002 Intel Corporation. PDX PerformancePerformance
z HighHigh bandwidthbandwidth frontfront endend z LowLow latencylatency corecore z LowerLower memorymemory latencylatency LowLowLow LatencyLatencyLatency CoreCoreCore
Intel®
Copyright © 2002 Intel Corporation. PDX DataData SpeculationSpeculation
z UseUse datadata beforebefore wewe areare suresure itit isis validvalid – Lowers effective LD latency – Fast ALUs in Pentium 4 want fast LDs – Ratio of LD latency to ADD latency is important if 1 in 5 uops is a LD z AsAs pipelinespipelines getget deeper,deeper, datadata speculationspeculation getsgets moremore importantimportant – Number of cycles saved /w data speculation increases as pipeline depth increases
Intel®
Copyright © 2002 Intel Corporation. PDX L1L1 DataData CacheCache
Dynamic BTB Instruction TLB System 64-bit wide 4K Entries Bus Instruction Decoder
Bus Micro Execution Trace Trace Cache BTB Interface Instruction Cache 12K µops 512 Entries Unit Sequencer Allocator / Register Renamer Quad Memory µop Queue Integer/Floating Point µop Queue Pumped Integer Schedulers Floating Point Schedulers 3.2 GB/s Memory Slow Fast Int Fast Int FP Move FP Gen
Integer Register File / Bypass Network FP Register / Bypass
L2 Slow ALU 2xAGU 2xALU 2xALU Ld/St FP Fmul Fmul Fmul Cache Complex Simple Simple FAdd FAdd FAdd Address Move FAdd Instr. Instr. Instr. unit 64GB/s 256-bit wide L1 Data Cache 8Kbyte 4-way L1 Data Cache 8Kbyte 4-way Intel®
Copyright © 2002 Intel Corporation. PDX L1L1 CacheCache isis >3x>3x FasterFaster
z P6:P6: ––33 clocksclocks @@ 1GHz1GHz
3ns z P4P:P4P: ––22 clocksclocks @@ 2GHz2GHz
1ns
LowerLowerLower LatencyLatencyLatency isisis HigherHigherHigher PerformancePerformancePerformance Intel®
Copyright © 2002 Intel Corporation. PDX L1L1 DataData CacheCache
Pipeline Stages
VA tag 15:0 data VA Data 31:16 Fast SB 2x Clock
VA TLB TAG 31:0
Slow SB
replay replay replay Intel®
Copyright © 2002 Intel Corporation. PDX L1L1 DataData CacheCache
VA tag 15:0 data VA Data 31:16 Fast SB 2x Clock
VA TLB TAG 31:0
Slow SB
Way Predictor (Tag array) Data array
Way Hit
… select … (Replay)
10:6 15:11 19:16 Intel®
Copyright © 2002 Intel Corporation. PDX AA DigressionDigression onon StoresStores
z TwoTwo componentscomponents toto aa store:store: – STA: address computation – STD: data piece z HybridHybrid uOPuOP – Single uOP in the front, back ends – Two uOPs in the middle
Intel®
Copyright © 2002 Intel Corporation. PDX MemoryMemory DisambiguationDisambiguation
Ld EAX <- AddrA STA AddrB, STD DataB store
z IfIf thethe storestore isis olderolder ––AndAnd AddrAAddrA == AddrBAddrB ––ThenThen thethe loadload mustmust getget DataBDataB z DependenciesDependencies cancan notnot bebe resolvedresolved untiluntil executionexecution
Intel®
Copyright © 2002 Intel Corporation. PDX MemoryMemory DisambiguationDisambiguation Option 1 Option 2 Option 3
predictor
LD A LD A LD A
STA B STD B STA B STD B STA B STD B
Loads wait for: Loads wait for: Predict All older STAs AND All older STAs Specific older STAs All older STDs Specific older STDs No Recovery STD Recovery Complex Recovery K7?K7? P4P4 EV8EV8 Intel®
Copyright © 2002 Intel Corporation. PDX ExampleExample ML=0
LD1 XOR z STST writeswrites newnew valuevalue STA, STD ML=1 z LD2LD2 forwardsforwards ? z BranchBranch resolvesresolves LD2 basedbased onon newnew valuevalue
BR JMP if ML=0
Intel®
Copyright © 2002 Intel Corporation. PDX ExampleExample ML=0
LD1 XOR LD misses, replays
STA, STD ML=1 STA dependent, replays
?
LD2 LD hits, gets old value
BR mispredicts BR JMP if ML=0
Intel®
Copyright © 2002 Intel Corporation. PDX ExampleExample
LD1 XOR z IfIf LD2LD2 dependsdepends onon thethe STA,STA, theythey areare usuallyusually partpart ofof thethe samesame STA, STD dataflowdataflow graphgraph z IfIf thethe STASTA replays,replays, thethe LD2 LDLD usuallyusually hashas anan addressaddress dependencedependence
BR
Intel®
Copyright © 2002 Intel Corporation. PDX CautiousCautious ModeMode
z Normally,Normally, aggressivelyaggressively scheduleschedule z IfIf aa largelarge numbernumber ofof problemsproblems occuroccur enterenter CautiousCautious ModeMode z InIn CautiousCautious Mode,Mode, branchesbranches waitwait forfor datadata toto bebe nonnon--speculativespeculative ––IncreasesIncreases branchbranch mispredictionmisprediction latencylatency ––CompletelyCompletely eliminateseliminates problemsproblems
Intel®
Copyright © 2002 Intel Corporation. PDX CautiousCautious Mode:Mode: ImplementationImplementation
z A simple state machine cleans up the outliers – Out of 2200 traces, 3 traces speedup >20% – The other traces are unaffected – Average performance improvement < 0.1%
Intel®
Copyright © 2002 Intel Corporation. PDX PerformancePerformance
z HighHigh bandwidthbandwidth frontfront endend z LowLow latencylatency corecore z LowerLower memorymemory latencylatency
LowerLowerLower MemoryMemoryMemory LatencyLatencyLatency
Intel®
Copyright © 2002 Intel Corporation. PDX ReducingReducing LatencyLatency
z AsAs frequencyfrequency increases,increases, itit isis importantimportant toto improveimprove thethe performanceperformance ofof thethe memorymemory subsystemsubsystem z DataData PrefetchPrefetch LogicLogic ––WatchesWatches processorprocessor memorymemory traffictraffic ––LooksLooks forfor patternspatterns ––InitiatesInitiates accessesaccesses
Intel®
Copyright © 2002 Intel Corporation. PDX DataData PrefetchPrefetch LogicLogic
Dynamic BTB Instruction TLB System 64-bit wide 4K Entries Bus Instruction Decoder
Bus Micro Execution Trace Trace Cache BTB Interface Instruction Cache 12K µops 512 Entries Unit Sequencer Allocator / Register Renamer QuadData Memory µop Queue Integer/Floating Point µop Queue PumpePrefetchd Integer Schedulers Floating Point Schedulers 3.2Logic GB/s Memory Slow Fast Int Fast Int FP Move FP Gen
Integer Register File / Bypass Network FP Register / Bypass
L2 Slow ALU 2xAGU 2xALU 2xALU Ld/St FP Fmul Fmul Fmul Cache Complex Simple Simple FAdd FAdd FAdd Address Move FAdd Instr. Instr. Instr. unit 64GB/s 256-bit wide L1 Data Cache 8Kbyte 4-way Intel®
Copyright © 2002 Intel Corporation. PDX DataData PrefetchPrefetch LogicLogic
Instruction 64 Buffers
Instruction Fetch L2 Data Advanced Bus Prefetch Transfer Queue Logic Cache
L1 Data Cache 256
Prefetch logic first checks L2 cache and then fetches lines from memory that miss L2 cache. Intel®
Copyright © 2002 Intel Corporation. PDX DataData PrefetchPrefetch LogicLogic
z WatchesWatches forfor streamingstreaming memorymemory accessaccess patternspatterns – Can track 8 multiple independent streams – Loads, Stores or Instruction – Forward or Backward z AnalysisAnalysis onon 3232 bytebyte cachecache lineline granularitygranularity z LooksLooks forfor “mostly”“mostly” completecomplete streams:streams: – Access to cache lines 1,2,3,4,5,6 will prefetch – Access to cache lines 1,2, 4,5,6 will prefetch – 1, ,3, , ,6, , ,9 will not prefetch
Intel®
Copyright © 2002 Intel Corporation. PDX PerformancePerformance
z HighHigh bandwidthbandwidth frontfront endend z LowLow latencylatency corecore z LowerLower memorymemory latencylatency
Intel®
Copyright © 2002 Intel Corporation. PDX Intel®
Copyright © 2002 Intel Corporation. PDX