EW N IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture

MarchMarch 8,8, 20062006

Stephen L. Smith Bob Valentine Vice President Architect Digital Enterprise Group Architecture Group Agenda

• Multi-core Update and New Level Set • New Intel® Core™ Microarchitecture • Wrap Up

2 Intel Multi-core Roadmap – Updates since Fall IDF

3 Ramping Multi-core Everywhere

4 All products and dates are preliminary and subject to change without notice. Refresher: What is Multi-Core?

Two or more independent execution cores in the same processor

Specific implementations will vary over time - driven by product implementation and manufacturing efficiencies • Best mix of product architecture and volume mfg capabilities – Architecture: Shared Caches vs. Independent Caches – Mfg capabilities: volume packaging technology • Designed to deliver performance, OEM and end user experience

Single die (Monolithic) based processor Multi-Chip Processor Example: 90nm ® D Example: ™ Duo Example: 65nm Processor (Smithfield) Processor () Processor (Presler)

Core0 Core1 Core0 Core1 Core0 Core1

Front Side Front Side Bus Front Side Bus

*Not representative of actual die photos or relative size 5 Intel® Core™ Micro-architecture

*Not representative of actual die photo or relative size 6 Intel Multi-core Roadmap

7 Intel Multi-core Roadmap

8 Intel® Core™ Microarchitecture Based Platforms Platform 2006 20072007

Caneland Platform (2007) MP Servers Tigerton (QC) (2007)

Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) DP Servers/ Woodcrest (Q3’06) DP Clovertown (QC) (Q1’07)

Kaylo Platform (Q3’06)/ Wyloway Platform (Q3 ’06) UP Servers/ (Q3’06) UP Workstation Kentsfield (QC) (Q1’07)

Bridge Creek Platform (Mid’06) Desktop -Home Conroe (Q3’06) Kentsfield (QC) (Q1’07)

Desktop -Office Averill Platform (Mid’06) Conroe (Q3’06)

Mobile Client Napa Platform (Q1’06) (2H’06)

All products and dates are preliminary 9 Note: only Intel® Core™ microarchitecture QC refers to Quad-Core and subject to change without notice. based processors listed Intel® Core™ Microarchitecture Performance

Delivering both industry leading performance and performance/watt • Conroe: >40% improvement in performance1 & >40% reduction in power2 – As compared to today’s high-end Pentium® D processor 950 (formerly Presler)

• Woodcrest: >80% improvement in performance1 and > 35% reduction in power2 – As compared to today’s high-end Dual-Core Intel® ® processor 2.8GHz (formerly Paxville DP)

• Merom: Extends the already significant performance and performance/watt leadership delivered with today's Intel® Core™ Duo processor with greater than 20% additional performance1 improvement – As compared to today’s high-end Intel® Core™ Duo processor (formerly Yonah)

1 - Estimated SPECint*_rate_base2000

2 – Expected reduction in TDP 10 Agenda

• Multi-core Update and New Micro-architecture level set • New Intel® Core™ Microarchitecture • Summary

11 InsideInside thethe IntelIntel® CoreCore™™ MicroarchitectureMicroarchitecture

12 AgendaAgenda

– Multi-core Update and New Micro-architecture level set – New Intel® Core™ Microarchitecture – Intel Microarchitecture History – Intel® Core™ Microarchitecture Design Goals and Roadmap – Processor Architecture 101 – Intel® Core™ Microarchitecture – Software Implications – Wrap Up

13 MicroarchitectureMicroarchitecture HistoryHistory

14 NewNew MicroarchitectureMicroarchitecture ComingComing inin 20062006

15 AgendaAgenda

– Multi-core Update and New Micro-architecture level set – New Intel® Core™ Microarchitecture – Intel Microarchitecture History – Intel® Core™ Microarchitecture Design Goals – Processor Architecture 101 – Intel® Core™ Microarchitecture – Software Implications – Wrap Up

16 IntelIntel® CoreCore™™ Microarchitecture:Microarchitecture: DesignDesign GoalsGoals y Deliver world class performance combined with superior energy/power efficiency – Existing and emerging applications and uses – Greater performance and performance/watt – Optimized for Intel Multi-core platforms y Deliver single foundation for optimized processors across each segment and power envelope – Optimized for mobile, desktop and segments

Driving Performance and Performance/Watt Leadership 17 AgendaAgenda

– Multi-core Update and New Micro-architecture level set – New Intel® Core™ Microarchitecture – Intel Microarchitecture History – Intel® Core™ Microarchitecture Design Goals – Processor Architecture 101 – Intel® Core™ Microarchitecture – Software Implications – Wrap Up

18 ProcessorProcessor ArchitectureArchitecture 101101

DeliveredDeliveredDelivered PerformancePerformancePerformance === FrequencyFrequencyFrequency *** InstructionsInstructionsInstructions PerPerPer CycleCycleCycle (IPC)(IPC)(IPC)

Goal is higher performance and lower power

PowerPower αα CC ** VV ** VV ** FrequencyFrequency Power α Cdyndynaamicmic * V * V * Frequency

Cdynamic is roughly a product of area and activity “how many bits” * “how much do they toggle”

19 ProcessorProcessor ArchitectureArchitecture 101101

DeliveredDeliveredDelivered PerformancePerformancePerformance === FrequencyFrequencyFrequency *** InstructionsInstructionsInstructions PerPerPer CycleCycleCycle (IPC)(IPC)(IPC)

Frequency is proportional to voltage, so frequency reduction coupled with voltage reduction results in cubic reduction in power.

PowerPower αα CC ** VV ** VV ** FrequencyFrequency Power α Cdyndynaamicmic * V * V * Frequency

20 ProcessorProcessor ArchitectureArchitecture 101101

DeliveredDeliveredDelivered PerformancePerformancePerformance === FrequencyFrequencyFrequency *** InstructionsInstructionsInstructions PerPerPer CycleCycleCycle (IPC)(IPC)(IPC)

Higher IPC usually results in wider data paths and/or more speculation : directly increasing C dynamic

PowerPower αα CC ** VV ** VV ** FrequencyFrequency Power α Cdyndynaamicmic * V * V * Frequency

21 AgendaAgenda

– Multi-core Update and New Micro-architecture level set – New Intel® Core™ Microarchitecture – Intel Microarchitecture History – Intel® Core™ Microarchitecture Design Goals – Processor Architecture 101 – Intel® Core™ Microarchitecture – Software Implications – Wrap Up

22 Instructionon Fetch IntelIntel® CoreCore™™ and PreDecode and PreDecode MicroarchitectureMicroarchitecture Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache 4 up to Rename/Alloc 10.4 Gb/s FSB Block Diagram Retirement Unit Walkthrough 4 (ReOrder Buffer)er) Schedulers ALU ALU ALU Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB

23 Instructionon Fetch IntelIntel® CoreCore™™ and PreDecodPreDecodee MicroarchitectureMicroarchitecture

Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache inin orderorder 4 up to Rename/Alloc 10.4 Gb/s instructioninstruction fetchfetch FSB instructioninstruction decodedecode Retirement Unit 4 (ReOrder Buffer)er) micromicro--opop renamerename Schedulers micromicro--opop allocateallocate ALU ALU ALU Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB

24 Instructionon Fetch IntelIntel® CoreCore™™ and PreDecode and PreDecode MicroarchitectureMicroarchitecture Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache outout ofof orderorder 4 up to Rename/Alloc 10.4 Gb/s micromicro--opop scheduleschedule FSB micromicro--opop executeexecute Retirement Unit 4 (ReOrder Buffer)er) Schedulers ALU ALU ALU Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB

25 Instructionon Fetch IntelIntel® CoreCore™™ and PreDecodPreDecodee MicroarchitectureMicroarchitecture Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache outout ofof orderorder 4 up to Rename/Alloc 10.4 Gb/s memorymemory pipelinespipelines FSB Retirement Unit 4 (ReOrder Buffer)er) memorymemory orderorder unitunit maintains architectural Schedulers maintains architectural ALU ALU ALU orderingordering requirementsrequirements Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB

26 Instructionon Fetch IntelIntel® CoreCore™™ and PreDecodPreDecodee MicroarchitectureMicroarchitecture Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache inin orderorder 4 up to Rename/Alloc 10.4 Gb/s micromicro--opop retirementretirement FSB faultfault handlinghandling Retirement Unit 4 (ReOrder Buffer)er) Schedulers RetirementRetirement UnitUnit ALU ALU ALU maintainsmaintains illusionillusion Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove ofof inin orderorder L1 D-Cache and D-TLB instructioninstruction retirementretirement

27 Instructionon Fetch IntelIntel® CoreCore™™ and PreDecodPreDecodee MicroarchitectureMicroarchitecture

Instruction Queue 2M/4M 5 shared L2 uCodeuCode Wide Dynamic Execution ROMROM Decodeode Cache 4 up to Advanced Digital Rename/Alloc 10.4 Gb/s Media Boost FSB Retirement Unit 4 Smart Memory Access (ReOrder Buffer)er) Schedulers ALU ALU ALU Advanced Smart Cache Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove Intelligent Power Capability L1 D-Cache and D-TLB

New, State-of-the-Art, Microarchitecture 28 Instructionon Fetch and PreDecodPreDecodee WideWide DynamicDynamic ExecutionExecution Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache Start with Instruction Fetch 4 up to Rename/Alloc four(+) instructions / cycle 10.4 Gb/s FSB >33% increase over Retirement Unit 4 other processors (ReOrder Buffer)er) Schedulers Instructions converted to ALU ALU ALU Branch FAdd FMuull micro-ops (uops) MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB ~1 uop per x86 instruction

29 MicroMicro--opop ReductionReduction (recall Processor 101)

DeliveredDeliveredDelivered PerformancePerformancePerformance === FrequencyFrequencyFrequency *** InstructionsInstructionsInstructions PerPerPer CycleCycleCycle (IPC)(IPC)(IPC)

Fewer uops per instruction allows IPC to be increased while lowering C dynamic (less bits and less toggling)

PowerPower == CC ** VV ** VV ** FrequencyFrequency Power = Cdynamicdynamic * V * V * Frequency

30 TechniquesTechniques forfor MicroMicro--opop ReductionReduction

y ESP Tracker (Extended Stack Pointer) – Execute Stack Pointer updates in dedicated hardware – Intel® Core™ microarchitecture increases BW by 33%*

y Micro-Op Micro-Fusion – Single Uop representation of “multi-uop” instruction – Intel® Core™ microarchitecture increase # instructions*

y Macro-Fusion – New technique in Intel® Core™ microarchitecture (more on next pages)

* Techniques pioneered on Intel® Pentium® M processors

31 New:New: MacroMacro--FusionFusion yy RepresentRepresent commoncommon x86x86 instructioninstruction pairspairs inin singlesingle micromicro--opop –CMP or TEST + Conditional Branch (Jcc) yy EnhancedEnhanced ArithmeticArithmetic LogicLogic UnitUnit (ALU)(ALU) forfor macromacro--fusionfusion –Single dispatch - efficiency –Single cycle execution - performance

32 Instruction Queue WithoutWithout inc ecx MacroMacro--FusionFusion store [mem3], ebx jne targ Read four instructions from cmp eax, [mem2] Instruction Queue load eax, [mem1] Each instruction gets decoded into separate uops

dec0 dec1 dec2 dec3

inc ecx store [mem3], ebx Cycle 1 jne targ Cycle 2 cmp eax, [mem2] load eax, [mem1]

33 Instruction Queue WithWith IntelIntel’’ss NewNew inc ecx MacroMacro--FusionFusion store [mem3], ebx jne targ Read five Instructions from cmp eax, [mem2] Instruction Queue load eax, [mem1] Send fusable pair to single decoder

dec0 dec1 dec2 dec3 Single uop represents two instructions inc ecx Cycle 1 store [mem3], ebx cmpjne eax, [mem2], targ load eax, [mem1]

34 Scheduler MacroMacro--FusionFusion cmpjne eax, mem2, targ (cont)

Lower latency Execution Increased bandwidth “virtually” increase storage

Macro-fusion makes the machine behave as if it is wider and deeper, without Branch the additional cost Eval flags and target to Write back

Enabling Greater Performance & Efficiency 35 Instructionon Fetch and PreDecodPreDecodee WideWide DynamicDynamic ExecutionExecution Instruction Queue 2M/4M 5 shared L2 uCodeuCode 4 wide rename ROMROM Decodeode Cache 4 4 wide micro-op execution up to 4 wide retire Rename/Alloc 10.4 Gb/s FSB Deeper out of order storage Retirement Unit 4 (ReOrder Buffer)er) 32 discontiguous micro-ops Schedulers considered for dispatch per ALU ALU ALU cycle Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB

33% Wider Than Previous Generation 36 Instructionon Fetch and PreDecodPreDecodee AdvancedAdvanced DigitalDigital MediaMedia BoostBoost Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache 128-bit packed Multiply 4 plus up to Rename/Alloc 128-bit packed Add 10.4 Gb/s plus FSB Retirement Unit 128-bit packed Load 4 (ReOrder Buffer)er) plus 128-bit packed Store Schedulers ALU ALU ALU plus Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere (how about a CMPJCC) FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB

2x Compute Throughput / Clock 37 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost

Lets scale a vector: B[i] := A[i] * C

A

Existing Intel® Core™ uarch Processor Advanced Digital Media Boost

B

2x Compute Throughput / Clock 38 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost Assume both have 128-bit path from L1 to Processor

A

Existing Intel® Core™ uarch Processor Advanced Digital Media Boost

B

2x Compute Throughput / Clock 39 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost

...handles all the memory data

A

Multiply can’t multiplier keep up with Existing Intel® Core™ uarch operates load bandwidth Processor Advanced Digital Media Boost on all data

B

2x Compute Throughput / Clock 40 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost Existing implementations eventually stall the load pipe waiting for multiplier

A

Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance

B

2x Compute Throughput / Clock 41 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost

...keeps pipeline free for computations

A

Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance

B

2x Compute Throughput / Clock 42 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost ...maintains 2X throughput compared to prior implementations

A

Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance

B

2x Compute Throughput / Clock 43 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost 8 Single Precision Flops/cycle

A

Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance

B

2x Compute Throughput / Clock 44 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost 44 DoubleDouble PrecisionPrecision Flops/cycleFlops/cycle

A

Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance

B

2x Compute Throughput / Clock 45 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost

A

Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance

B

2x Compute Throughput / Clock 46 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost

A

Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance

B

2x Compute Throughput / Clock 47 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost

A

Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance

B

2x Compute Throughput / Clock 48 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost

A

Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance

B

2x Compute Throughput / Clock 49 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost

A

Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance

B

2x Compute Throughput / Clock 50 AdvancedAdvanced DigitalDigital MediaMedia BoostBoost

A

Load eventually Load pipe stalls waiting for Existing Intel® Core™ uarch is free to multiplier Processor Advanced Digital Media Boost advance

B

Leading Compute Density 2x Compute Throughput / Clock 51 Instructionon Fetch and PreDecodPreDecodee SmartSmart MemoryMemory AccessAccess Instruction Queue 2M/4M 5 shared L2 uCodeuCode ROMROM Decodeode Cache 4 yy MemoryMemory up to Rename/Alloc 10.4 Gb/s DisambiguationDisambiguation FSB Retirement Unit 4 (ReOrder Buffer)er) yy ImprovedImproved Schedulers PrefetchersPrefetchers ALU ALU ALU Branch FAdd FMuull MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove L1 D-Cache and D-TLB

Hiding Latency to Memory

Subsystem 52 SmartSmart MemoryMemory AccessAccess –– GoalGoal System Bus

L1 L1 Data Data Cache Smart-Shared Cache L2 Cache CORE1 CORE2

WHEN Ensure data can be used as early as possible

WHERE Ensure user of data has it as close as possible

Hiding Latency to Memory

Subsystem 53 WithoutWithout MemoryMemory DisambiguationDisambiguation

Memory Load4 must WAIT Data W until previous 3 Load4 X stores complete

Store3 W 4 Data Z Waits for Data X Load2 Y before can execute 2 Store1 Y 1 Data Y

Data X SubsequentSubsequent LoadsLoads MustMust WaitWait

54 SmartSmart MemoryMemory AccessAccess WithWith IntelIntel’’ss NewNew MemoryMemory DisambiguationDisambiguation Memory Loads can Data W decouple from Stores Load4 X 4

Store3 W 1 Data Z Load4 can get its Load2 Y 3 data FIRST Store1 Y 2 Data Y

Data X

SolvingSolving thethe ProblemProblem ofof WhenWhen

55 SmartSmart MemoryMemory AccessAccess MemoryMemory DisambiguationDisambiguation yy MemoryMemory DisambiguationDisambiguation predictorpredictor –Loads that are predicted NOT to forward from preceding store are allowed to schedule as early as possible – increasing the performance of OOO memory pipelines yy DisambiguatedDisambiguated loadsloads checkedchecked atat retirementretirement –Extension to existing coherency mechanism –Invisible to software and system

Hiding Latency to Memory

Subsystem 56 SmartSmart MemoryMemory AccessAccess PrefetchersPrefetchers

Shared youngest L1 L2 Data Data Load4 Cache Cache Load3 Load2 Load1 oldest

57 SmartSmart MemoryMemory Access:Access: PrefetchersPrefetchers

Shared youngest L1 L2 Data Data Load4 Cache Cache Load3 Load2 Load1 oldest

MemoryMemory isis tootoo farfar awayaway

58 SmartSmart MemoryMemory Access:Access: PrefetchersPrefetchers

Shared youngest L1 L2 Data Data Load4 Cache Cache Load3 Load2 Load1 oldest

CachesCaches areare closercloser whenwhen theythey havehave thethe datadata

59 SmartSmart MemoryMemory Access:Access: PrefetchersPrefetchers

Shared youngest L1 L2 Data Data Load4 Cache Cache Load3 Load2 Load1 oldest

PrefetchersPrefetchers detectdetect applicationsapplications datadata referencereference patternspatterns 60 SmartSmart MemoryMemory Access:Access: PrefetchersPrefetchers

Shared youngest L1 L2 Data Data Load4 Cache Cache Load3 Load2 Load1 oldest

AndAnd bringbring thethe datadata closercloser toto datadata consumerconsumer

61 SmartSmart MemoryMemory Access:Access: PrefetchersPrefetchers

Shared youngest L1 L2 Data Data Load4 Cache Cache Load3 Load2 Load1 oldest

Solving the Problem of Where Minimizing Memory Latency 62 PrefetchersPrefetchers andand MultiMulti--CoreCore Instrustructionction Fetch and PreDecode

InstructionInstruction Queue 2M/4M 5 shared L2 uCuCodeode ROM DeDeccode Cache 4 up to Rename/AAlloclloc 10.4 Gb/s FSB RRetiremeetirementnt Unit 4 (ReOrReOrderder Buffer) ScScheheddulersulers ALU ALU ALU Branch FAdd FMulul MMX/SSE MMX/SSE MMX/SSEMMX/SSE Load Store FPmove FPmove FPmoFPmovvee L1 D-Cache and D-TLB

63 PrefetchersPrefetchers andand MultiMulti--CoreCore

64 PrefetchersPrefetchers andand MultiMulti--CoreCore Instrustructionction Fetch and PreDecode

InstructionInstruction Queue 2M/4M 5 shared L2 uCuCodeode ROM DeDeccode Cache 4 up to Rename/AAlloclloc 10.4 Gb/s FSB RRetiremeetirementnt Unit 4 (ReOrReOrderder Buffer) ScScheheddulersulers ALU ALU ALU Branch FAdd FMulul MMX/SSE MMX/SSE MMX/SSEMMX/SSE Load Store FPmove FPmove FPmoFPmovvee L1 D-Cache and D-TLB Three Individual PrefetchersPrefetchers perper CoreCore Two L2 Prefetchers dynamically shared 65 SmartSmart MemoryMemory AccessAccess y 8 Prefetchers per two-core processor – 2 data and 1 instruction prefetcher per core – able to handle multiple simultaneous patterns – 2 prefetchers in the L2 cache – tracking multiple patterns per core y Prefetchers monitor demand traffic and regulate “aggression” y Implementation “knobs” allow platform and segment specific settings tailored to applications and usage models

Data Is Where You Need It, When You Need It

66 AdvancedAdvanced SmartSmart CacheCache MultiMulti--corecore OptimizedOptimized

All the Smart Cache benefits: • L2 can adapt to each core’s load Core Core • Fast data sharing 1 2 • No replicated data

L2 Cache Plus: • 2X BW to L1 caches

Shared & Multi-Core Optimized, with 2x Bandwidth

67 AdvancedAdvanced SmartSmart CacheCache DynamicDynamic CacheCache AllocationAllocation

IndependentIndependent AdvanAdvancceded CacheCache (today) SmartSmart CacheCache

Core1 Core2 Core1 Core2

L2 L2 L2 Cache Cache Cache

Shared Cache adapts to mismatched loads. Independent Cache can thrash heavy app even when other cache is under-utilized

68 AdvancedAdvanced SmartSmart CacheCache EfficientEfficient DataData SharingSharing

Advanced Smart Independent Cache Cache

Core1 Core2 Core1 Core2

L2 Cache L2 Cache L2 Cache

FSBFSB FSBFSB Main memory Main memory

2X L2 to L1 Bandwidth 69 IntelligentIntelligent PowerPower CapabilityCapability

Extending the power management architecture • Intel® Pentium® M processor innovated a new power management architecture • Intel® Core™ Duo extended the Pentium® M processor capability to multi-core

New Power Features within each processor core • Ultra fine-grained power control • Split Busses • Platformization of Power Management Architecture

Enhancing Energy Efficiency 70 IntelligentIntelligent PowerPower CapabilityCapability Ultra Fine Grained Power Control Instruction FetchFetch and PPreDecreDecodeode Even during periods of high performance 2M/4M execution, many parts InstructiInstructionon QueQueueue 5 shared L2 of the chip core can be uCodeuCode shut off. ROM DecDecodeode Cache 4 Example could be a up to ReRenname/Allocame/Alloc SW memory initialization 10.4 Gb/s executing from front end with IQ operating FSB RetiremeRetirementnt Unit as loop cache. 4 (ReReOrOrderder BuffeBuffer)r) ScSchehedulerslers ALU ALU ALU Branch FAdd FMul MMXFP/SSE MMXFP/SSE MMFPX/SSE Load Store FPmove FPmove FPFPmomove

L1 D-Cache and D-TLB 71 IntelligentIntelligent PowerPower CapabilityCapability SplitSplit BussesBusses (core(core powerpower feature)feature)

Many buses are sized for worst-case data

(x86 instruction of 15 ) (ALU can write-back 128 bits)

Improved Energy Efficiency 72 IntelligentIntelligent PowerPower CapabilityCapability SplitSplit BussesBusses (core(core powerpower feature)feature)

By splitting buses to deal with varying data widths, we can gain the performance benefit of bus width while maintaining C dynamic closer to thinner buses

Improved Energy Efficiency 73 PlatformizationPlatformization ofof PowerPower ManagementManagement ArchitectureArchitecture yyIntegratingIntegrating bestbest featuresfeatures fromfrom ServerServer andand MobileMobile productsproducts yyExposingExposing moremore toto thethe systemsystem yy PSIPSI--22 Power Status Indicator (Mobile) yy DTSDTS Digital Thermal Sensors yy PECIPECI Platform Environment Control Interface

74 PowerPower StatusStatus IndicatorIndicator (Mobile) yy ProcessorProcessor communicatescommunicates powerpower consumptionconsumption toto externalexternal platformplatform componentscomponents –Optimization of voltage regulator efficiency –Load line and power delivery efficiency

PSI-2 / VID

VR

75 EnablingEnabling EfficientEfficient ProcessorProcessor andand PlatformPlatform ThermalThermal ControlControl……

DTSDTS –– DigitalDigital ThermalThermal SensorSensor y Several thermal sensors are located within the Processor to cover all possible hot spots y Dedicated logic scans the thermal sensors and measures the maximum temperature on the die at any given time y Accurately reporting Processor temperature enables advanced thermal control schemes

Core 1 LPF DTS Logic DTS control and status

LPF Core 2 DTS Logic

76 PlatformPlatform EnvironmentEnvironment ControlControl InterfaceInterface (PECI)(PECI) yy ProcessorProcessor providesprovides itsits temperaturetemperature readingreading overover aa multimulti dropdrop singlesingle wirewire busbus allowingallowing efficientefficient platformplatform thermalthermal controlcontrol Processor Fan

PROC #1 Auxiliary Fan PROC #2 Manager Chassis Fan 1 PROC #3

PECI Chassis Fan 2

77 Instructionon Fetch MicroarchitectureMicroarchitecture and PreDecodPreDecodee FeatureFeature SummarySummary

33% Instructiwider pipeson Que (4 vs.ue 3) 2M/4Mand 5 shared L2 Wide Dynamic Execution uCodeuCode greater efficiency ROMROM Decodeode Cache 4 Advanced Digital 2x compute throughput / clockup to Rename/Alloc Media Boost 10.4 Gb/s Minimizing latency – DataFSB Retirement Unit Smart Memory Access Where & When needed4 (ReOrder Buffer)er) Multi-CoreSchedu optimized,lers ALU ALU ALU Advanced Smart Cache Brsharedanch FA withdd 2xFMuul lbandwidth MMX/SSE MMX/SSE MMMMXX//SSSSEE LoadLoad StStoorere FPmove FPmove FPFPmovmove Improved energy efficient L1 D-Cache and D-TLB Intelligent Power Capability performance New, State-of-the-Art, Microarchitecture 78 AgendaAgenda

– Multi-core Update and New Micro-architecture level set – New Intel® Core™ Microarchitecture – Intel Microarchitecture History – Intel® Core™ Microarchitecture Design Goals and Roadmap – Processor Architecture 101 – Intel® Core™ Microarchitecture – Software Implications – Summary

79 IntelIntel® CoreCore™™ MicroarchitectureMicroarchitecture andand SoftwareSoftware yy SoftwareSoftware consistencyconsistency acrossacross applicationapplication spacespace – Wide Dynamic Execution will provide generic performance gains – Smart Memory Access targets memory intensive apps – Advanced Digital Media Boost provides a leap in capability for media and floating point apps – Multi-Core and Advanced Smart Cache further improve the growing number of multi-threaded applications yy SoftwareSoftware consistencyconsistency acrossacross marketsmarkets segmentssegments –New apps and optimizations can target single microarchitecture Immediate Performance Increase Across Applications and Segments 80 Agenda

• Multi-core Update and New Micro-architecture level set • New Intel® Core™ Microarchitecture • Summary

81 Summary

Continuing to drive aggressive multi-core ramp – Dual-core ramp in 2006, quad-core starts in early 2007

Intel® Core™ microarchitecture delivers leading performance and performance/watt – Conroe – >40% performance increase1 / >40% less power – Woodcrest - >80% performance increase1 / >35% less power – Mobile - Extending leadership delivered with Intel® Core™ Duo with >20% performance increase1

On track for product introductions starting in Q3’06 – Based upon new Intel® Core™ microarchitecture

1 - Estimated SPECint* rate 82