EW N IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture
MarchMarch 8,8, 20062006
Stephen L. Smith Bob Valentine Vice President Architect Digital Enterprise Group Intel Architecture Group Agenda
• Multi-core Update and New Microarchitecture Level Set • New Intel® Core™ Microarchitecture • Wrap Up
2 Intel Multi-core Roadmap – Updates since Fall IDF
3 Ramping Multi-core Everywhere
4 All products and dates are preliminary and subject to change without notice. Refresher: What is Multi-Core?
Two or more independent execution cores in the same processor
Specific implementations will vary over time - driven by product implementation and manufacturing efficiencies • Best mix of product architecture and volume mfg capabilities – Architecture: Shared Caches vs. Independent Caches – Mfg capabilities: volume packaging technology • Designed to deliver performance, OEM and end user experience
Single die (Monolithic) based processor Multi-Chip Processor Example: 90nm Pentium® D Example: Intel Core™ Duo Example: 65nm Pentium D Processor (Smithfield) Processor (Yonah) Processor (Presler)
Core0 Core1 Core0 Core1 Core0 Core1
Front Side Bus Front Side Bus Front Side Bus
*Not representative of actual die photos or relative size 5 Intel® Core™ Micro-architecture
*Not representative of actual die photo or relative size 6 Intel Multi-core Roadmap
7 Intel Multi-core Roadmap
8 Intel® Core™ Microarchitecture Based Platforms Platform 2006 20072007
Caneland Platform (2007) MP Servers Tigerton (QC) (2007)
Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) DP Servers/ Woodcrest (Q3’06) DP Workstation Clovertown (QC) (Q1’07)
Kaylo Platform (Q3’06)/ Wyloway Platform (Q3 ’06) UP Servers/ Conroe (Q3’06) UP Workstation Kentsfield (QC) (Q1’07)
Bridge Creek Platform (Mid’06) Desktop -Home Conroe (Q3’06) Kentsfield (QC) (Q1’07)
Desktop -Office Averill Platform (Mid’06) Conroe (Q3’06)
Mobile Client Napa Platform (Q1’06) Merom (2H’06)
All products and dates are preliminary 9 Note: only Intel® Core™ microarchitecture QC refers to Quad-Core and subject to change without notice. based processors listed Intel® Core™ Microarchitecture Performance
Delivering both industry leading performance and performance/watt • Conroe: >40% improvement in performance1 & >40% reduction in power2 – As compared to today’s high-end Pentium® D processor 950 (formerly Presler)
• Woodcrest: >80% improvement in performance1 and > 35% reduction in power2 – As compared to today’s high-end Dual-Core Intel® Xeon® processor 2.8GHz (formerly Paxville DP)
• Merom: Extends the already significant performance and performance/watt leadership delivered with today's Intel® Core™ Duo processor with greater than 20% additional performance1 improvement – As compared to today’s high-end Intel® Core™ Duo processor (formerly Yonah)
1 - Estimated SPECint*_rate_base2000
2 – Expected reduction in TDP 10 Agenda
• Multi-core Update and New Micro-architecture level set • New Intel® Core™ Microarchitecture • Summary
11 InsideInside thethe IntelIntel® CoreCore™™ MicroarchitectureMicroarchitecture
12 AgendaAgenda
– Multi-core Update and New Micro-architecture level set – New Intel® Core™ Microarchitecture – Intel Microarchitecture History – Intel® Core™ Microarchitecture Design Goals and Roadmap – Processor Architecture 101 – Intel® Core™ Microarchitecture – Software Implications – Wrap Up
13 MicroarchitectureMicroarchitecture HistoryHistory
14 NewNew MicroarchitectureMicroarchitecture ComingComing inin 20062006
15 AgendaAgenda
– Multi-core Update and New Micro-architecture level set – New Intel® Core™ Microarchitecture – Intel Microarchitecture History – Intel® Core™ Microarchitecture Design Goals – Processor Architecture 101 – Intel® Core™ Microarchitecture – Software Implications – Wrap Up
16 IntelIntel® CoreCore™™ Microarchitecture:Microarchitecture: DesignDesign GoalsGoals y Deliver world class performance combined with superior energy/power efficiency – Existing and emerging applications and uses – Greater performance and performance/watt – Optimized for Intel Multi-core platforms y Deliver single foundation for optimized processors across each segment and power envelope – Optimized for mobile, desktop and server segments
Driving Performance and Performance/Watt Leadership 17 AgendaAgenda
– Multi-core Update and New Micro-architecture level set – New Intel® Core™ Microarchitecture – Intel Microarchitecture History – Intel® Core™ Microarchitecture Design Goals – Processor Architecture 101 – Intel® Core™ Microarchitecture – Software Implications – Wrap Up
18 ProcessorProcessor ArchitectureArchitecture 101101
DeliveredDeliveredDelivered PerformancePerformancePerformance === FrequencyFrequencyFrequency *** InstructionsInstructionsInstructions PerPerPer CycleCycleCycle (IPC)(IPC)(IPC)
Goal is higher performance and lower power
PowerPower αα CC ** VV ** VV ** FrequencyFrequency Power α Cdyndynaamicmic * V * V * Frequency
Cdynamic is roughly a product of area and activity “how many bits” * “how much do they toggle”
19 ProcessorProcessor ArchitectureArchitecture 101101
DeliveredDeliveredDelivered PerformancePerformancePerformance === FrequencyFrequencyFrequency *** InstructionsInstructionsInstructions PerPerPer CycleCycleCycle (IPC)(IPC)(IPC)
Frequency is proportional to voltage, so frequency reduction coupled with voltage reduction results in cubic reduction in power.
PowerPower αα CC ** VV ** VV ** FrequencyFrequency Power α Cdyndynaamicmic * V * V * Frequency
20 ProcessorProcessor ArchitectureArchitecture 101101
DeliveredDeliveredDelivered PerformancePerformancePerformance === FrequencyFrequencyFrequency *** InstructionsInstructionsInstructions PerPerPer CycleCycleCycle (IPC)(IPC)(IPC)
Higher IPC usually results in wider data paths and/or more speculation : directly increasing C dynamic
PowerPower αα CC ** VV ** VV ** FrequencyFrequency Power α Cdyndynaamicmic * V * V * Frequency
21 AgendaAgenda
– Multi-core Update and New Micro-architecture level set – New Intel® Core™ Microarchitecture – Intel Microarchitecture History – Intel® Core™ Microarchitecture Design Goals – Processor Architecture 101 – Intel® Core™ Microarchitecture – Software Implications – Wrap Up
22 Instructionon Fetch IntelIntel® CoreCore™™ and PreDecode and PreDecode MicroarchitectureMicroarchitecture Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache 4 up to Rename/Alloc 10.4 Gb/s FSB Block Diagram Retirement Unit Walkthrough 4 (ReOrder Buffer)er) Schedulers ALU ALU ALU Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB
23 Instructionon Fetch IntelIntel® CoreCore™™ and PreDecodPreDecodee MicroarchitectureMicroarchitecture
Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache inin orderorder 4 up to Rename/Alloc 10.4 Gb/s instructioninstruction fetchfetch FSB instructioninstruction decodedecode Retirement Unit 4 (ReOrder Buffer)er) micromicro--opop renamerename Schedulers micromicro--opop allocateallocate ALU ALU ALU Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB
24 Instructionon Fetch IntelIntel® CoreCore™™ and PreDecode and PreDecode MicroarchitectureMicroarchitecture Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache outout ofof orderorder 4 up to Rename/Alloc 10.4 Gb/s micromicro--opop scheduleschedule FSB micromicro--opop executeexecute Retirement Unit 4 (ReOrder Buffer)er) Schedulers ALU ALU ALU Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB
25 Instructionon Fetch IntelIntel® CoreCore™™ and PreDecodPreDecodee MicroarchitectureMicroarchitecture Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache outout ofof orderorder 4 up to Rename/Alloc 10.4 Gb/s memorymemory pipelinespipelines FSB Retirement Unit 4 (ReOrder Buffer)er) memorymemory orderorder unitunit maintains architectural Schedulers maintains architectural ALU ALU ALU orderingordering requirementsrequirements Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB
26 Instructionon Fetch IntelIntel® CoreCore™™ and PreDecodPreDecodee MicroarchitectureMicroarchitecture Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache inin orderorder 4 up to Rename/Alloc 10.4 Gb/s micromicro--opop retirementretirement FSB faultfault handlinghandling Retirement Unit 4 (ReOrder Buffer)er) Schedulers RetirementRetirement UnitUnit ALU ALU ALU maintainsmaintains illusionillusion Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove ofof inin orderorder L1 D-Cache and D-TLB instructioninstruction retirementretirement
27 Instructionon Fetch IntelIntel® CoreCore™™ and PreDecodPreDecodee MicroarchitectureMicroarchitecture
Instruction Queue 2M/4M 5 shared L2 uCodeuCode Wide Dynamic Execution ROMROM Decodeode Cache 4 up to Advanced Digital Rename/Alloc 10.4 Gb/s Media Boost FSB Retirement Unit 4 Smart Memory Access (ReOrder Buffer)er) Schedulers ALU ALU ALU Advanced Smart Cache Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove Intelligent Power Capability L1 D-Cache and D-TLB
New, State-of-the-Art, Microarchitecture 28 Instructionon Fetch and PreDecodPreDecodee WideWide DynamicDynamic ExecutionExecution Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache Start with Instruction Fetch 4 up to Rename/Alloc four(+) instructions / cycle 10.4 Gb/s FSB >33% increase over Retirement Unit 4 other x86 processors (ReOrder Buffer)er) Schedulers Instructions converted to ALU ALU ALU Branch FAdd FMuull micro-ops (uops) MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB ~1 uop per x86 instruction
29 MicroMicro--opop ReductionReduction (recall Processor 101)
DeliveredDeliveredDelivered PerformancePerformancePerformance === FrequencyFrequencyFrequency *** InstructionsInstructionsInstructions PerPerPer CycleCycleCycle (IPC)(IPC)(IPC)
Fewer uops per instruction allows IPC to be increased while lowering C dynamic (less bits and less toggling)
PowerPower == CC ** VV ** VV ** FrequencyFrequency Power = Cdynamicdynamic * V * V * Frequency
30 TechniquesTechniques forfor MicroMicro--opop ReductionReduction
y ESP Tracker (Extended Stack Pointer) – Execute Stack Pointer updates in dedicated hardware – Intel® Core™ microarchitecture increases BW by 33%*
y Micro-Op Micro-Fusion – Single Uop representation of “multi-uop” instruction – Intel® Core™ microarchitecture increase # instructions*
y Macro-Fusion – New technique in Intel® Core™ microarchitecture (more on next pages)
* Techniques pioneered on Intel® Pentium® M processors
31 New:New: MacroMacro--FusionFusion yy RepresentRepresent commoncommon x86x86 instructioninstruction pairspairs inin singlesingle micromicro--opop –CMP or TEST + Conditional Branch (Jcc) yy EnhancedEnhanced ArithmeticArithmetic LogicLogic UnitUnit (ALU)(ALU) forfor macromacro--fusionfusion –Single dispatch - efficiency –Single cycle execution - performance
32 Instruction Queue WithoutWithout inc ecx MacroMacro--FusionFusion store [mem3], ebx jne targ Read four instructions from cmp eax, [mem2] Instruction Queue load eax, [mem1] Each instruction gets decoded into separate uops
dec0 dec1 dec2 dec3
inc ecx store [mem3], ebx Cycle 1 jne targ Cycle 2 cmp eax, [mem2] load eax, [mem1]
33 Instruction Queue WithWith IntelIntel’’ss NewNew inc ecx MacroMacro--FusionFusion store [mem3], ebx jne targ Read five Instructions from cmp eax, [mem2] Instruction Queue load eax, [mem1] Send fusable pair to single decoder
dec0 dec1 dec2 dec3 Single uop represents two instructions inc ecx Cycle 1 store [mem3], ebx cmpjne eax, [mem2], targ load eax, [mem1]
34 Scheduler MacroMacro--FusionFusion cmpjne eax, mem2, targ (cont)
Lower latency Execution Increased bandwidth “virtually” increase storage
Macro-fusion makes the machine behave as if it is wider and deeper, without Branch the additional cost Eval flags and target to Write back
Enabling Greater Performance & Efficiency 35 Instructionon Fetch and PreDecodPreDecodee WideWide DynamicDynamic ExecutionExecution Instruction Queue 2M/4M 5 shared L2 uCodeuCode 4 wide rename ROMROM Decodeode Cache 4 4 wide micro-op execution up to 4 wide retire Rename/Alloc 10.4 Gb/s FSB Deeper out of order storage Retirement Unit 4 (ReOrder Buffer)er) 32 discontiguous micro-ops Schedulers considered for dispatch per ALU ALU ALU cycle Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB
33% Wider Than Previous Generation 36 Instructionon Fetch and PreDecodPreDecodee AdvancedAdvanced DigitalDigital MediaMedia BoostBoost Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache 128-bit packed Multiply 4 plus up to Rename/Alloc 128-bit packed Add 10.4 Gb/s plus FSB Retirement Unit 128-bit packed Load 4 (ReOrder Buffer)er) plus 128-bit packed Store Schedulers ALU ALU ALU plus Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere (how about a CMPJCC) FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB
2x Compute Throughput / Clock 37 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost
Lets scale a vector: B[i] := A[i] * C
A
Existing Intel® Core™ uarch Processor Advanced Digital Media Boost
B
2x Compute Throughput / Clock 38 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost Assume both Microarchitectures have 128-bit path from L1 to Processor
A
Existing Intel® Core™ uarch Processor Advanced Digital Media Boost
B
2x Compute Throughput / Clock 39 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost
...handles all the memory data
A
Multiply can’t multiplier keep up with Existing Intel® Core™ uarch operates load bandwidth Processor Advanced Digital Media Boost on all data
B
2x Compute Throughput / Clock 40 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost Existing implementations eventually stall the load pipe waiting for multiplier
A
Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance
B
2x Compute Throughput / Clock 41 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost
...keeps pipeline free for computations
A
Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance
B
2x Compute Throughput / Clock 42 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost ...maintains 2X throughput compared to prior implementations
A
Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance
B
2x Compute Throughput / Clock 43 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost 8 Single Precision Flops/cycle
A
Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance
B
2x Compute Throughput / Clock 44 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost 44 DoubleDouble PrecisionPrecision Flops/cycleFlops/cycle
A
Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance
B
2x Compute Throughput / Clock 45 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost
A
Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance
B
2x Compute Throughput / Clock 46 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost
A
Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance
B
2x Compute Throughput / Clock 47 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost
A
Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance
B
2x Compute Throughput / Clock 48 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost
A
Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance
B
2x Compute Throughput / Clock 49 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost
A
Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance
B
2x Compute Throughput / Clock 50 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost
A
Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance
B
Leading Compute Density 2x Compute Throughput / Clock 51 Instructionon Fetch and PreDecodPreDecodee SmartSmart MemoryMemory AccessAccess Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache 4 yy MemoryMemory up to Rename/Alloc 10.4 Gb/s DisambiguationDisambiguation FSB Retirement Unit 4 (ReOrder Buffer)er) yy ImprovedImproved Schedulers PrefetchersPrefetchers ALU ALU ALU Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB
Hiding Latency to Memory
Subsystem 52 SmartSmart MemoryMemory AccessAccess –– GoalGoal System Bus
L1 L1 Data Data Cache Smart-Shared Cache L2 Cache CORE1 CORE2
WHEN Ensure data can be used as early as possible
WHERE Ensure user of data has it as close as possible
Hiding Latency to Memory
Subsystem 53 WithoutWithout MemoryMemory DisambiguationDisambiguation
Memory Load4 must WAIT Data W until previous 3 Load4 X stores complete
Store3 W 4 Data Z Waits for Data X Load2 Y before can execute 2 Store1 Y 1 Data Y
Data X SubsequentSubsequent LoadsLoads MustMust WaitWait
54 SmartSmart MemoryMemory AccessAccess WithWith IntelIntel’’ss NewNew MemoryMemory DisambiguationDisambiguation Memory Loads can Data W decouple from Stores Load4 X 4
Store3 W 1 Data Z Load4 can get its Load2 Y 3 data FIRST Store1 Y 2 Data Y
Data X
SolvingSolving thethe ProblemProblem ofof WhenWhen
55 SmartSmart MemoryMemory AccessAccess MemoryMemory DisambiguationDisambiguation yy MemoryMemory DisambiguationDisambiguation predictorpredictor –Loads that are predicted NOT to forward from preceding store are allowed to schedule as early as possible – increasing the performance of OOO memory pipelines yy DisambiguatedDisambiguated loadsloads checkedchecked atat retirementretirement –Extension to existing coherency mechanism –Invisible to software and system
Hiding Latency to Memory
Subsystem 56 SmartSmart MemoryMemory AccessAccess PrefetchersPrefetchers
Shared youngest L1 L2 Data Data Load4 Cache Cache Load3 Load2 Load1 oldest
57 SmartSmart MemoryMemory Access:Access: PrefetchersPrefetchers
Shared youngest L1 L2 Data Data Load4 Cache Cache Load3 Load2 Load1 oldest
MemoryMemory isis tootoo farfar awayaway
58 SmartSmart MemoryMemory Access:Access: PrefetchersPrefetchers
Shared youngest L1 L2 Data Data Load4 Cache Cache Load3 Load2 Load1 oldest
CachesCaches areare closercloser whenwhen theythey havehave thethe datadata
59 SmartSmart MemoryMemory Access:Access: PrefetchersPrefetchers
Shared youngest L1 L2 Data Data Load4 Cache Cache Load3 Load2 Load1 oldest
PrefetchersPrefetchers detectdetect applicationsapplications datadata referencereference patternspatterns 60 SmartSmart MemoryMemory Access:Access: PrefetchersPrefetchers
Shared youngest L1 L2 Data Data Load4 Cache Cache Load3 Load2 Load1 oldest
AndAnd bringbring thethe datadata closercloser toto datadata consumerconsumer
61 SmartSmart MemoryMemory Access:Access: PrefetchersPrefetchers
Shared youngest L1 L2 Data Data Load4 Cache Cache Load3 Load2 Load1 oldest
Solving the Problem of Where Minimizing Memory Latency 62 PrefetchersPrefetchers andand MultiMulti--CoreCore Instrustructionction Fetch and PreDecode
InstructionInstruction Queue 2M/4M 5 shared L2 uCuCodeode ROM DeDeccode Cache 4 up to Rename/AAlloclloc 10.4 Gb/s FSB RRetiremeetirementnt Unit 4 (ReOrReOrderder Buffer) ScScheheddulersulers ALU ALU ALU Branch FAdd FMulul MMX/SSE MMX/SSE MMX/SSEMMX/SSE Load Store FPmove FPmove FPmoFPmovvee L1 D-Cache and D-TLB
63 PrefetchersPrefetchers andand MultiMulti--CoreCore
64 PrefetchersPrefetchers andand MultiMulti--CoreCore Instrustructionction Fetch and PreDecode
InstructionInstruction Queue 2M/4M 5 shared L2 uCuCodeode ROM DeDeccode Cache 4 up to Rename/AAlloclloc 10.4 Gb/s FSB RRetiremeetirementnt Unit 4 (ReOrReOrderder Buffer) ScScheheddulersulers ALU ALU ALU Branch FAdd FMulul MMX/SSE MMX/SSE MMX/SSEMMX/SSE Load Store FPmove FPmove FPmoFPmovvee L1 D-Cache and D-TLB Three Individual PrefetchersPrefetchers perper CoreCore Two L2 Prefetchers dynamically shared 65 SmartSmart MemoryMemory AccessAccess y 8 Prefetchers per two-core processor – 2 data and 1 instruction prefetcher per core – able to handle multiple simultaneous patterns – 2 prefetchers in the L2 cache – tracking multiple patterns per core y Prefetchers monitor demand traffic and regulate “aggression” y Implementation “knobs” allow platform and segment specific settings tailored to applications and usage models
Data Is Where You Need It, When You Need It
66 AdvancedAdvanced SmartSmart CacheCache MultiMulti--corecore OptimizedOptimized
All the Smart Cache benefits: • L2 can adapt to each core’s load Core Core • Fast data sharing 1 2 • No replicated data
L2 Cache Plus: • 2X BW to L1 caches
Shared & Multi-Core Optimized, with 2x Bandwidth
67 AdvancedAdvanced SmartSmart CacheCache DynamicDynamic CacheCache AllocationAllocation
IndependentIndependent AdvanAdvancceded CacheCache (today) SmartSmart CacheCache
Core1 Core2 Core1 Core2
L2 L2 L2 Cache Cache Cache
Shared Cache adapts to mismatched loads. Independent Cache can thrash heavy app even when other cache is under-utilized
68 AdvancedAdvanced SmartSmart CacheCache EfficientEfficient DataData SharingSharing
Advanced Smart Independent Cache Cache
Core1 Core2 Core1 Core2
L2 Cache L2 Cache L2 Cache
FSBFSB FSBFSB Main memory Main memory
2X L2 to L1 Bandwidth 69 IntelligentIntelligent PowerPower CapabilityCapability
Extending the power management architecture • Intel® Pentium® M processor innovated a new power management architecture • Intel® Core™ Duo extended the Pentium® M processor capability to multi-core
New Power Features within each processor core • Ultra fine-grained power control • Split Busses • Platformization of Power Management Architecture
Enhancing Energy Efficiency 70 IntelligentIntelligent PowerPower CapabilityCapability Ultra Fine Grained Power Control Instruction FetchFetch and PPreDecreDecodeode Even during periods of high performance 2M/4M execution, many parts InstructiInstructionon QueQueueue 5 shared L2 of the chip core can be uCodeuCode shut off. ROM DecDecodeode Cache 4 Example could be a up to ReRenname/Allocame/Alloc SW memory initialization 10.4 Gb/s executing from front end with IQ operating FSB RetiremeRetirementnt Unit as loop cache. 4 (ReReOrOrderder BuffeBuffer)r) ScSchehedulerslers ALU ALU ALU Branch FAdd FMul MMXFP/SSE MMXFP/SSE MMFPX/SSE Load Store FPmove FPmove FPFPmomove
L1 D-Cache and D-TLB 71 IntelligentIntelligent PowerPower CapabilityCapability SplitSplit BussesBusses (core(core powerpower feature)feature)
Many buses are sized for worst-case data
(x86 instruction of 15 bytes) (ALU can write-back 128 bits)
Improved Energy Efficiency 72 IntelligentIntelligent PowerPower CapabilityCapability SplitSplit BussesBusses (core(core powerpower feature)feature)
By splitting buses to deal with varying data widths, we can gain the performance benefit of bus width while maintaining C dynamic closer to thinner buses
Improved Energy Efficiency 73 PlatformizationPlatformization ofof PowerPower ManagementManagement ArchitectureArchitecture yyIntegratingIntegrating bestbest featuresfeatures fromfrom ServerServer andand MobileMobile productsproducts yyExposingExposing moremore toto thethe systemsystem yy PSIPSI--22 Power Status Indicator (Mobile) yy DTSDTS Digital Thermal Sensors yy PECIPECI Platform Environment Control Interface
74 PowerPower StatusStatus IndicatorIndicator (Mobile) yy ProcessorProcessor communicatescommunicates powerpower consumptionconsumption toto externalexternal platformplatform componentscomponents –Optimization of voltage regulator efficiency –Load line and power delivery efficiency
PSI-2 / VID
VR
75 EnablingEnabling EfficientEfficient ProcessorProcessor andand PlatformPlatform ThermalThermal ControlControl……
DTSDTS –– DigitalDigital ThermalThermal SensorSensor y Several thermal sensors are located within the Processor to cover all possible hot spots y Dedicated logic scans the thermal sensors and measures the maximum temperature on the die at any given time y Accurately reporting Processor temperature enables advanced thermal control schemes
Core 1 LPF DTS Logic DTS control and status
LPF Core 2 DTS Logic
76 PlatformPlatform EnvironmentEnvironment ControlControl InterfaceInterface (PECI)(PECI) yy ProcessorProcessor providesprovides itsits temperaturetemperature readingreading overover aa multimulti dropdrop singlesingle wirewire busbus allowingallowing efficientefficient platformplatform thermalthermal controlcontrol Processor Fan
PROC #1 Auxiliary Fan PROC #2 Manager Chassis Fan 1 PROC #3
PECI Chassis Fan 2
77 Instructionon Fetch MicroarchitectureMicroarchitecture and PreDecodPreDecodee FeatureFeature SummarySummary
33% Instructiwider pipeson Que (4 vs.ue 3) 2M/4Mand 5 shared L2 Wide Dynamic Execution uCodeuCode greater efficiency ROMROM Decodeode Cache 4 Advanced Digital 2x compute throughput / clockup to Rename/Alloc Media Boost 10.4 Gb/s Minimizing latency – DataFSB Retirement Unit Smart Memory Access Where & When needed4 (ReOrder Buffer)er) Multi-CoreSchedu optimized,lers ALU ALU ALU Advanced Smart Cache Brsharedanch FA withdd 2xFMuul lbandwidth MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove Improved energy efficient L1 D-Cache and D-TLB Intelligent Power Capability performance New, State-of-the-Art, Microarchitecture 78 AgendaAgenda
– Multi-core Update and New Micro-architecture level set – New Intel® Core™ Microarchitecture – Intel Microarchitecture History – Intel® Core™ Microarchitecture Design Goals and Roadmap – Processor Architecture 101 – Intel® Core™ Microarchitecture – Software Implications – Summary
79 IntelIntel® CoreCore™™ MicroarchitectureMicroarchitecture andand SoftwareSoftware yy SoftwareSoftware consistencyconsistency acrossacross applicationapplication spacespace – Wide Dynamic Execution will provide generic performance gains – Smart Memory Access targets memory intensive apps – Advanced Digital Media Boost provides a leap in capability for media and floating point apps – Multi-Core and Advanced Smart Cache further improve the growing number of multi-threaded applications yy SoftwareSoftware consistencyconsistency acrossacross marketsmarkets segmentssegments –New apps and optimizations can target single microarchitecture Immediate Performance Increase Across Applications and Segments 80 Agenda
• Multi-core Update and New Micro-architecture level set • New Intel® Core™ Microarchitecture • Summary
81 Summary
Continuing to drive aggressive multi-core ramp – Dual-core ramp in 2006, quad-core starts in early 2007
Intel® Core™ microarchitecture delivers leading performance and performance/watt – Conroe – >40% performance increase1 / >40% less power – Woodcrest - >80% performance increase1 / >35% less power – Mobile - Extending leadership delivered with Intel® Core™ Duo with >20% performance increase1
On track for product introductions starting in Q3’06 – Based upon new Intel® Core™ microarchitecture
1 - Estimated SPECint* rate 82