The ® ® 4 Processor

DougDoug CarmeanCarmean PrincipalPrincipal ArchitectArchitect IntelIntel ArchitectureArchitecture GroupGroup

SpringSpring 20022002

Intel®

Copyright © 2002 Intel Corporation. PDX AgendaAgenda

z ReviewReview z PipelinePipeline DepthDepth z ExecutionExecution TraceTrace CacheCache z DataData SpeculationSpeculation z SpecSpec PerformancePerformance z SummarySummary

Intel®

Copyright © 2002 Intel Corporation. PDX Information in this document is provided in connection with Intel® products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel’s Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel® products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. This document contains information on products in the design phase of development. Do not finalize a design with this information. Revised information will be published when the product is available. Verify with your local sales office that you have the latest datasheet before finalizing a design. Intel processors may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel, Pentium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and foreign countries. Copyright © (2001) Intel Corporation. Intel®

Copyright © 2002 Intel Corporation. PDX IntelIntel® NetburstNetburstTM MicroMicro--architecturearchitecture vsvs P6P6

BasicBasic P6P6 PipelinePipeline

1 2 3 4 5 6 7 8 9 10 Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec

BasicBasic PentiumPentium® 44 ProcessorProcessor PipelinePipeline

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex Flgs Br Ck Drive

Deeper Pipelines enable higher frequency and performance Intel®

Copyright © 2002 Intel Corporation. PDX HyperHyper PipelinedPipelined TechnologyTechnology

Today 2.2GHz 20 1.8GHz Netburst Micro-Architecture Intro 1.5GHz 1.2GHz 10 Micro-Architecture Frequency Frequency

233MHz 166MHz 5 60MHz Micro-Architecture Introduction Time Intel® Copyright © 2002 Intel Corporation. PDX DeeperDeeper PipelinesPipelines areare BetterBetter

120% 2 MB 100% 1 MB

80% 512 KB

60% 256 KB 40%

Performance Improvement Performance 20%

0% 10 15 20 25 30 Pipeline Depth Intel®

Source: Average of 2000 application segments from performance simulations Copyright © 2002 Intel Corporation. PDX WhyWhy notnot deeperdeeper pipelines?pipelines?

z IncreasesIncreases complexitycomplexity ––HarderHarder toto balancebalance ––MoreMore challengeschallenges toto architectarchitect aroundaround ––MoreMore algorithmsalgorithms ––GreaterGreater validationvalidation efforteffort ––NeedNeed toto pipelinepipeline thethe wireswires

OverallOverall EngineeringEngineering EffortEffort IncreasesIncreases QuicklyQuickly

asas PipelinePipeline depthdepth increasesincreases Intel®

Copyright © 2002 Intel Corporation. PDX PerformancePerformance

z HighHigh bandwidthbandwidth frontfront endend z LowLow latencylatency corecore z LowerLower memorymemory latencylatency HighHighHigh BandwidthBandwidthBandwidth FrontFrontFront EndEndEnd

Intel®

Copyright © 2002 Intel Corporation. PDX HigherHigher FrequencyFrequency increasesincreases requirementsrequirements ofof frontfront endend

z BranchBranch predictionprediction isis moremore importantimportant ––SoSo wewe improvedimproved itit z NeedNeed greatergreater uopuop bandwidthbandwidth ––BranchesBranches constantlyconstantly changechange thethe flowflow ––NeedNeed toto decodedecode moremore instructionsinstructions inin parallelparallel

Intel®

Copyright © 2002 Intel Corporation. PDX BlockBlock DiagramDiagram

Dynamic BTB Instruction TLB System 64-bit wide 4K Entries Instruction Decoder

Bus Execution Trace Micro Execution Trace Trace Cache BTB Interface Instruction Cache 12K uopsµops 512 Entries Unit Sequencer Allocator / Register Renamer Quad Memory µop Queue Integer/Floating Point µop Queue Pumped Integer Schedulers Floating Point Schedulers 3.2 GB/s Memory Slow Fast Int Fast Int FP Move FP Gen

Integer Register File / Bypass Network FP Register / Bypass

L2 Slow ALU 2xAGU 2xALU 2xALU Ld/St FP Fmul Fmul Fmul Cache Complex Simple Simple FAdd FAdd FAdd Address Move FAdd Instr. Instr. Instr. unit 64GB/s 256-bit wide L1 Data Cache 8Kbyte 4-way Intel®

Copyright © 2002 Intel Corporation. PDX Execution Trace Cache

1 cmp 2 br -> T1 .. ... (unused code) TraceTrace CacheCache DeliveryDelivery T1: 3 sub 4 br -> T2 1 cmp 2 br T1 3 T1: sub .. ... (unused code) 4 br T2 5 mov 6 sub T2: 5 mov 7 br T3 8 T3:add 9 sub 6 sub 7 br -> T3 10 mul 11 cmp 12 br T4 .. ... (unused code) T3: 8 add 9 sub 10 mul 11 cmp 12 br -> T4

Intel®

Copyright © 2002 Intel Corporation. PDX ExecutionExecution TraceTrace CacheCache

P6P6 MicroarchitectureMicroarchitecture TraceTrace CacheCache DeliveryDelivery

1 cmp 2 br T1 1 cmp 2 br T1 3 T1: sub 4 br T2 5 mov 6 sub 3 T1: sub 4 br T2 7 br T3 8 T3:add 9 sub 10 mul 11 cmp 12 br T4 5 mov 6 sub 7 br T3

8 T3:add 9 sub 10 mul 11 cmp 12 br T4 BW = 1.5 uops/ns BW = 6 uops/ns

Intel®

Copyright © 2002 Intel Corporation. PDX InsideInside thethe ExecutionExecution TraceTrace CacheCache

Way 0 Way 1 Way 2 Way 3 Set 0 Instruction Set 1 head Pointer 0x0900 Set 2 body 1 Set 3 body 2 Set 4 tail

head cmp, br T1, T1:sub, br T2, mov, sub body 1 br T3, T3:add, sub, mul, cmp, br T4

body 2 T4:add, sub, mov, add, add, mov

tail add, sub, mov, add, add, mov

Intel®

Copyright © 2002 Intel Corporation. PDX SelfSelf ModifyingModifying CodeCode

z ProgramsPrograms thatthat modifymodify thethe instructioninstruction streamstream thatthat isis beingbeing executedexecuted z VeryVery commoncommon inin Java*Java* codecode fromfrom JITsJITs z RequiresRequires hardwarehardware mechanismsmechanisms toto maintainmaintain consistencyconsistency

Intel® *Other names and brands may be claimed as the property of others. Copyright © 2002 Intel Corporation. PDX SelfSelf ModifyingModifying CodeCode

z TheThe hardwarehardware needsneeds toto handlehandle twotwo basicbasic cases:cases: ––StoresStores thatthat writewrite toto instructionsinstructions inin thethe TraceTrace CacheCache ––InstructionInstruction fetchesfetches thatthat hithit pendingpending storesstores – Speculative – Committed

Intel®

Copyright © 2002 Intel Corporation. PDX CaseCase 1:1: StoresStores toto cachedcached instructionsinstructions

“in use” bits addr addr

Instruction TLB (128 entries)

Execution Core Trace Cache addr

Data TLB Store’s Physical Address

Intel®

Copyright © 2002 Intel Corporation. PDX CaseCase 2:2: FetchesFetches toto pendingpending storesstores

addr addr

Instruction Instruction Pointer TLB (128 entries)

Committed Write Store addr Combining Buffer Please Speculative Buffer Re-Fetch Please Flush Pipeline Please Re-Fetch Execution Core

Intel®

Copyright © 2002 Intel Corporation. PDX ExecutionExecution TraceTrace CacheCache

z ProvidesProvides higherhigher bandwidthbandwidth forfor higherhigher frequencyfrequency corecore z ReducesReduces fetchfetch latencylatency z RequiresRequires newnew fundamentallyfundamentally newnew algorithmsalgorithms

Intel®

Copyright © 2002 Intel Corporation. PDX PerformancePerformance

z HighHigh bandwidthbandwidth frontfront endend z LowLow latencylatency corecore z LowerLower memorymemory latencylatency LowLowLow LatencyLatencyLatency CoreCoreCore

Intel®

Copyright © 2002 Intel Corporation. PDX DataData SpeculationSpeculation

z UseUse datadata beforebefore wewe areare suresure itit isis validvalid – Lowers effective LD latency – Fast ALUs in want fast LDs – Ratio of LD latency to ADD latency is important if 1 in 5 uops is a LD z AsAs pipelinespipelines getget deeper,deeper, datadata speculationspeculation getsgets moremore importantimportant – Number of cycles saved /w data speculation increases as pipeline depth increases

Intel®

Copyright © 2002 Intel Corporation. PDX L1L1 DataData CacheCache

Dynamic BTB Instruction TLB System 64-bit wide 4K Entries Bus Instruction Decoder

Bus Micro Execution Trace Trace Cache BTB Interface Instruction Cache 12K µops 512 Entries Unit Sequencer Allocator / Register Renamer Quad Memory µop Queue Integer/Floating Point µop Queue Pumped Integer Schedulers Floating Point Schedulers 3.2 GB/s Memory Slow Fast Int Fast Int FP Move FP Gen

Integer Register File / Bypass Network FP Register / Bypass

L2 Slow ALU 2xAGU 2xALU 2xALU Ld/St FP Fmul Fmul Fmul Cache Complex Simple Simple FAdd FAdd FAdd Address Move FAdd Instr. Instr. Instr. unit 64GB/s 256-bit wide L1 Data Cache 8Kbyte 4-way L1 Data Cache 8Kbyte 4-way Intel®

Copyright © 2002 Intel Corporation. PDX L1L1 CacheCache isis >3x>3x FasterFaster

z P6:P6: ––33 clocksclocks @@ 1GHz1GHz

3ns z P4P:P4P: ––22 clocksclocks @@ 2GHz2GHz

1ns

LowerLowerLower LatencyLatencyLatency isisis HigherHigherHigher PerformancePerformancePerformance Intel®

Copyright © 2002 Intel Corporation. PDX L1L1 DataData CacheCache

Pipeline Stages

VA tag 15:0 data VA Data 31:16 Fast SB 2x Clock

VA TLB TAG 31:0

Slow SB

replay replay replay Intel®

Copyright © 2002 Intel Corporation. PDX L1L1 DataData CacheCache

VA tag 15:0 data VA Data 31:16 Fast SB 2x Clock

VA TLB TAG 31:0

Slow SB

Way Predictor (Tag array) Data array

Way Hit

… select … (Replay)

10:6 15:11 19:16 Intel®

Copyright © 2002 Intel Corporation. PDX AA DigressionDigression onon StoresStores

z TwoTwo componentscomponents toto aa store:store: – STA: address computation – STD: data piece z HybridHybrid uOPuOP – Single uOP in the front, back ends – Two uOPs in the middle

Intel®

Copyright © 2002 Intel Corporation. PDX MemoryMemory DisambiguationDisambiguation

Ld EAX <- AddrA STA AddrB, STD DataB store

z IfIf thethe storestore isis olderolder ––AndAnd AddrAAddrA == AddrBAddrB ––ThenThen thethe loadload mustmust getget DataBDataB z DependenciesDependencies cancan notnot bebe resolvedresolved untiluntil executionexecution

Intel®

Copyright © 2002 Intel Corporation. PDX MemoryMemory DisambiguationDisambiguation Option 1 Option 2 Option 3

predictor

LD A LD A LD A

STA B STD B STA B STD B STA B STD B

Loads wait for: Loads wait for: Predict All older STAs AND All older STAs Specific older STAs All older STDs Specific older STDs No Recovery STD Recovery Complex Recovery K7?K7? P4P4 EV8EV8 Intel®

Copyright © 2002 Intel Corporation. PDX ExampleExample ML=0

LD1 XOR z STST writeswrites newnew valuevalue STA, STD ML=1 z LD2LD2 forwardsforwards ? z BranchBranch resolvesresolves LD2 basedbased onon newnew valuevalue

BR JMP if ML=0

Intel®

Copyright © 2002 Intel Corporation. PDX ExampleExample ML=0

LD1 XOR LD misses, replays

STA, STD ML=1 STA dependent, replays

?

LD2 LD hits, gets old value

BR mispredicts BR JMP if ML=0

Intel®

Copyright © 2002 Intel Corporation. PDX ExampleExample

LD1 XOR z IfIf LD2LD2 dependsdepends onon thethe STA,STA, theythey areare usuallyusually partpart ofof thethe samesame STA, STD dataflowdataflow graphgraph z IfIf thethe STASTA replays,replays, thethe LD2 LDLD usuallyusually hashas anan addressaddress dependencedependence

BR

Intel®

Copyright © 2002 Intel Corporation. PDX CautiousCautious ModeMode

z Normally,Normally, aggressivelyaggressively scheduleschedule z IfIf aa largelarge numbernumber ofof problemsproblems occuroccur enterenter CautiousCautious ModeMode z InIn CautiousCautious Mode,Mode, branchesbranches waitwait forfor datadata toto bebe nonnon--speculativespeculative ––IncreasesIncreases branchbranch mispredictionmisprediction latencylatency ––CompletelyCompletely eliminateseliminates problemsproblems

Intel®

Copyright © 2002 Intel Corporation. PDX CautiousCautious Mode:Mode: ImplementationImplementation

z A simple state machine cleans up the outliers – Out of 2200 traces, 3 traces speedup >20% – The other traces are unaffected – Average performance improvement < 0.1%

Intel®

Copyright © 2002 Intel Corporation. PDX PerformancePerformance

z HighHigh bandwidthbandwidth frontfront endend z LowLow latencylatency corecore z LowerLower memorymemory latencylatency

LowerLowerLower MemoryMemoryMemory LatencyLatencyLatency

Intel®

Copyright © 2002 Intel Corporation. PDX ReducingReducing LatencyLatency

z AsAs frequencyfrequency increases,increases, itit isis importantimportant toto improveimprove thethe performanceperformance ofof thethe memorymemory subsystemsubsystem z DataData PrefetchPrefetch LogicLogic ––WatchesWatches processorprocessor memorymemory traffictraffic ––LooksLooks forfor patternspatterns ––InitiatesInitiates accessesaccesses

Intel®

Copyright © 2002 Intel Corporation. PDX DataData PrefetchPrefetch LogicLogic

Dynamic BTB Instruction TLB System 64-bit wide 4K Entries Bus Instruction Decoder

Bus Micro Execution Trace Trace Cache BTB Interface Instruction Cache 12K µops 512 Entries Unit Sequencer Allocator / Register Renamer QuadData Memory µop Queue Integer/Floating Point µop Queue PumpePrefetchd Integer Schedulers Floating Point Schedulers 3.2Logic GB/s Memory Slow Fast Int Fast Int FP Move FP Gen

Integer Register File / Bypass Network FP Register / Bypass

L2 Slow ALU 2xAGU 2xALU 2xALU Ld/St FP Fmul Fmul Fmul Cache Complex Simple Simple FAdd FAdd FAdd Address Move FAdd Instr. Instr. Instr. unit 64GB/s 256-bit wide L1 Data Cache 8Kbyte 4-way Intel®

Copyright © 2002 Intel Corporation. PDX DataData PrefetchPrefetch LogicLogic

Instruction 64 Buffers

Instruction Fetch L2 Data Advanced Bus Prefetch Transfer Queue Logic Cache

L1 Data Cache 256

Prefetch logic first checks L2 cache and then fetches lines from memory that miss L2 cache. Intel®

Copyright © 2002 Intel Corporation. PDX DataData PrefetchPrefetch LogicLogic

z WatchesWatches forfor streamingstreaming memorymemory accessaccess patternspatterns – Can track 8 multiple independent streams – Loads, Stores or Instruction – Forward or Backward z AnalysisAnalysis onon 3232 bytebyte cachecache lineline granularitygranularity z LooksLooks forfor “mostly”“mostly” completecomplete streams:streams: – Access to cache lines 1,2,3,4,5,6 will prefetch – Access to cache lines 1,2, 4,5,6 will prefetch – 1, ,3, , ,6, , ,9 will not prefetch

Intel®

Copyright © 2002 Intel Corporation. PDX PerformancePerformance

z HighHigh bandwidthbandwidth frontfront endend z LowLow latencylatency corecore z LowerLower memorymemory latencylatency

Intel®

Copyright © 2002 Intel Corporation. PDX Intel®

Copyright © 2002 Intel Corporation. PDX