Intel® Coretm Microarchitecture

IntelIntel® CoreCoreTM MicroarchitectureMicroarchitecture

Steve Pawlowski Intel Senior Fellow CTO, GM Architecture & Planning Digital Enterprise Group

Ofri Wechsler Intel Fellow Director, CPU Architecture Mobility Group ContinuingContinuing FromFrom LastLast FallFall’’ss IDFIDF……

Let’s Take A Look Inside IntelIntel® CoreCore™™ MicroarchitectureMicroarchitecture The Energy Efficient Performance Leader

Platform – Level Energy Efficient Platform Performance Capabilities

Energy 3X Woodcrest H2 ‘06 Performance Consumption

2X Perf Dempsey MV H1 ‘06 Paxville DP H2 ‘05 Energy 1X Irwindale H1 ‘05 Core™ Microarchitecture: Higher Performance AND Lower Power DP Performance Per Watt Comparison with SPECint_rate at the Platform Level

Source: Intel® HistoryHistory XEON® 2004 PENTIUM® M 2003 Microarchitecture XEONTM 2002 and EM64T ITANIUM® Architecture 2001 Power Management PENTIUM® 4 Hyper Micro Fusion 2000 Threading PENTIUM® III NT 1999 EPIC IE FIC E EF NC R A ® E RM PENTIUM Pro NetBurstTM OW FO 1995 P ER SSE 2 P i486TM SSE PENTIUM® 1989 1993 Integrated CE Out Of Order N FPU MA OR Pipelining Register Renaming RF PE Speculative Branch Prediction Execution Superscalar

*Graphics not representative of actual die photo or relative size 5 HistoricalHistorical DrivingDriving ForcesForces

Increased Performance Shrinking Geometry via Increased Frequency

100000 10

10000

1 1000 Feature Frequency Size (MHz) 100 (um) 0.1 10

1 0.01 1970 1980 1990 2000 2010 2020 1970 1980 19901990 20002000 20102010 2020

1946 1971 2005 20 Numbers I4004 Processor 65nm in Main Memory 2300 Transistors 1B+ Transistors TheThe ChallengesChallenges

Power Limitations Diminishing Voltage Scaling

1000 10 0.0.7um7um 0.0.5um5um 0.0.35um35um 0.0.25um25um 0.0.18um18um ~3 0 % 0.0.13um13um CPU Supply 90nm90nm 100 1 65nm65nm Power Voltage 45nm45nm (W) (V) 30nm30nm

10 0.1 1990 1995 2000 2005 2010 2015 1990 1993 1 997 2001 2005 2 00 9

Power = Capacitance x Voltage2 x Frequency also Power ~ Voltage3 EnergyEnergy EfficientEfficient PerformancePerformance –– HighHigh EndEnd

DATACENTER “ENERGY LABEL”

NASA Columbia 2 MWatt 60 TFlops goal 10,240 cpus – Itanium II $50M Source: NASA 30,720 Flops/Watt 1,288 Flops/Dollar Computational Efficiency 17,066 Flops/Watt 467 Flops/Dollar ASC Purple 6 MWatt 100 TFlops goal 12K+ cpus – Power5 $230M

Source: LLNL AA NewNew EraEra…… THE NEW

Performance Equals IPC THE OLD Multi-Core Power Efficiency Microarchitecture Performance Advancements Equals Frequency Unconstrained Power

Voltage Scaling IntelIntel® CoreCore™™ MicroarchitectureMicroarchitecture

Low Power High Performance Scalable

Woodcrest Intel® Wide Dynamic Execution Server Optimized Intel® Intelligent Power Capability Conroe Intel® Intel Desktop Advanced Smart Cache Optimized 65nm Intel® Smart Memory Access Merom Intel® Mobile Advanced Optimized Digital Media Boost

*Graphics not representative of actual die photo or relative size IntelIntel® WideWide DynamicDynamic ExecutionExecution

EACH CORE CORE 1 CORE 2

EFFICIENT 14 STAGE INSTRUCTION FETCH INSTRUCTION FETCH PIPELINE AND PRE-DECODE AND PRE-DECODE

DEEPER BUFFERS INSTRUCTION QUEUE INSTRUCTION QUEUE

4 WIDE - DECODE TO DECODE DECODE EXECUTE

4 WIDE - MICRO-OP RENAME / ALLOC RENAME / ALLOC EXECUTE

MICRO RETIREMENT UNIT RETIREMENT UNIT and (REORDER BUFFER) (REORDER BUFFER) MACRO FUSION

ENHANCED SCHEDULERS SCHEDULERS ALUs

EXECUTE EXECUTE

Perf • 33% Wider Execution over Previous Gen ADVANTAGE • Comprehensive Advancements Energy • Enabled In Each Core IntelIntel® WideWide DynamicDynamic ExecutionExecution Micro and Macro Fusion

Micro Macro MACRO FUSION EXAMPLE Fusion Fusion CMP+JMP IN 1 CLOCK

WITH MACRO FUSION WITHOUT MACRO FUSION

INSTRUCTION 3 INSTRUCTION 3 uCODE DECODE INSTRUCTION 2 INSTRUCTION 2 ROM INSTRUCTION 1 INSTRUCTION 1 DECODE DECODE COMBINED INST 2 & 3 INTERNAL INST 3 INTERNAL INST 1 INTERNAL INST 2 EXECUTE EXECUTE INTERNAL INST 1 COMPLETED INST 3 EXECUTE COMPLETED INST 2 COMPLETED INST 3 COMPLETED INST 1 COMPLETED INST 2 COMPLETED INST 1

Perf • Instruction Load Reduced ~ 15%** ADVANTAGE ** Energy • Micro-Ops Reduced ~ 10%

*Graphics not representative of actual die photo or relative size ** Workload dependant IntelIntel® IntelligentIntelligent PowerPower CapabilityCapability

Ultra Coarse Ultra Process Grained Fine Transistor Grained Grained

• 65nm • Aggressive • Low VCC Arrays • Low Leakage • Strained Silicon Clock Gating • Blocks Controlled Transistors • Low-K Dielectric • Enhanced Via Sleep • Sleep • More Metal Layers Speed-Step Transistors Transistors

• Mobile-Level Power Management Energy ADVANTAGE • Energy Efficient Performance

*Graphics not representative of actual die photo or relative size IntelIntel® AdvancedAdvanced SmartSmart CacheCache Dynamic L2 Cache Usage Core™ Microarchitecture Core™ Microarchitecture Independent L2 Shared L2 Decreased Increased Traffic Traffic

Dynamically, Not Bi-Directionally Shareable Available x

L1 L1 L1 L1 CACHE CACHE CACHE CACHE

CORE 1 CORE 2 CORE 1 CORE 2

Perf • Higher Cache Hit Rate ADVANTAGE • Reduced BUS Traffic Energy • Lower Latency to Data

*Graphics not rrepresentativeepresentative of actuaactuall ddieie photo oorr relative ssizize IntelIntel® SmartSmart MemoryMemory AccessAccess Hardware-based Memory Disambiguation Core™ Microarchitecture Other

INST 2 “LOAD [Y]” INST 2 “LOAD [Y]” IN INST 1 “STORE [X]” ORDER INST 1 “STORE [X]” DECODE/SCHEDULE DECODE/SCHEDULE

INST 2 “LOAD [Y]” INST 2 “LOAD [Y]”

INST 1 “STORE [X]” INST 1 “STORE [X]” OUT HARDWARE OF ORDER Mem. Dis. INST 2 “LOAD [Y]” INST 2 “LOAD [Y]” Predictor EXECUTE STALL

INST 1 “STORE [X]” Inst. 2 Must EXECUTE Wait For Inst. 2 “Load” Inst. 1 “Store” Can Occur INST 1 “STORE [X]” To Complete Before Inst. 1 “Store”

Perf • Higher Utilization of Pipeline ADVANTAGE • Masks latency to data access Energy • Higher Performance IntelIntel® AdvancedAdvanced DigitalDigital MediaMedia BoostBoost Single Cycle SSE In Each Core SSE Operation (SSE/SSE2/SSE3) SOURCE 127 0 Fusion Single 127 0 Support Cycle X4 X3 X2 X1 SSE SSE/2/3 OP

Y4 Y3 Y2 Y1 DECODE DECODE DEST

Core™ μarch CLOCK X4opY4 X3opY3 X2opY2 X1opY1 CYCLE 1

EXECUTE EXECUTE CLOCK Previous X2opY2 X1opY1 CYCLE 1

CLOCK X4opY4 X3opY3 CYCLE 2

Perf • Increased Performance ADVANTAGE • 128 bit Single Cycle in each core Energy • Improved Energy Efficiency

*Graphics not representative of actual die photo or relative size NextNext GenerationGeneration PlatformsPlatforms

Energy Efficient Performance

Intel® Core™ Microarchitecture • 80W Target TDP Intel® Wide • 40W LV Target TDP Dynamic Server • 2 Execution Cores Bensley Execution Optimized Bensley • 4MB L2 Cache • Server Platform *Ts Intel® • DP Configurations Intelligent Power Capability ® Intel® Desktop Averill • 65W Mainstream TDP Advanced Optimized • 2 Execution Cores Smart Cache Bridge • 2MB & 4MB L2 Cache Creek • Desktop Platform *Ts Intel® Smart Memory Access • Mobile TDP ® • 2 Execution Cores Intel Mobile Napa Optimized • 2MB & 4MB L2 Cache Advanced refresh Digital Media • Mobile Power Boost Optimizations • Mobile Platform *Ts ComprehensiveComprehensive PlatformPlatform ArchitectureArchitecture Memory Controller Considerations

Business ++ Technical

Rapid Adoption Of RELIABILITY Technology Transitions

PERFORMANCE 100% PERFORMANCE SDRAM 80% DDR DDR2 DDR3 POWER MANAGEMENT 60% FPM COHERENCY LATENCY 40% FBD EDO REDUCTION 20%

0% MEMORY ACCESS REDUCTION 1996 2000 2004 2008 2012

PROCESS SCALING RATES Market Segments Require Different Technologies PLATFORM-LEVEL MOBILE DESKTOP SERVER ENHANCEMENTS ComprehensiveComprehensive PlatformPlatform ArchitectureArchitecture MEMORY CONTROLLER CONSIDERATIONS

Memory Controller in Chipset Technical

Extended RAS RELIABILITY for Servers Memory Closer to Integrated GFX, PERFORMANCE Smart Cache and Smart Memory MEMORY ACCESS REDUCTION Access

Centralized COHERENCY LATENCY Coherency Control REDUCTION

CPU Power States POWER MANAGEMENT w/ Memory Alive

MC vs Processor PROCESS SCALING RATES Device Scaling

IO Acceleration PLATFORM-LEVEL Technology ENHANCEMENTS DPDP ServerServer ArchitectureArchitecture

Platform Performance: FSB Scaling Bensley Platform It’s all about 800MHz Bandwidth & 1067MHz Latency 1333MHz Latency

Large Point to Point Shared Caches Interconnect 64 GB M A B M A B M A B M A Central Coherency B Easy Capacity Resolution Expansion M A B M A B M A B M A B

17 GB/s M A B M A AM B M A B Local and Remote B Sustained & Memory Latencies Blackford Balanced M A B M A B M A B M A Consistent B Throughput

Perf CONSTANTLY ANALYZING THE REQUIREMENTS, Energy THE TECHNOLOGIES, AND THE TRADEOFFS

*Graphics not rrepresentativeepresentative of actuaactuall ddieie photo oorr relative ssizize CoreCore™™ MicroarchitectureMicroarchitecture AdvancesAdvances WithWith QuadQuad CoreCore

Energy Efficient Performance

Quad Core Clovertown H1 ‘07

Clovertown 3X Woodcrest H2 ‘06

Server 2X Dempsey MV H1 ‘06 Paxville DP H2 ‘05 Kentsfield 1X Irwindale H1 ‘05

Desktop

DP Performance Per Watt Comparison with SPECint_rate at the Platform Level Source: Intel®

*Graphics not representative of actual die photo or relative size DevelopingDeveloping forfor thethe IntelIntel® CoreCore™™ MicroarchitectureMicroarchitecture

Program For Multicore

ExploitExploit EnergyEnergy EfficientEfficient PerformancePerformance

Benefit From Utilize SSE Power Efficiency Enhancements FurtherFurther InsightInsight IntoInto IntelIntel® CoreCoreTM MicroarchitectureMicroarchitecture y CoreTM Microarchitecture White Paper at rear of auditorium y InteIntell®'s CoreTM Microarchitecture (Session, MATS001) y InteIntell® Multi-Core Architecture and Implementations (Session, MATS002) y Multi-Core and CoreTM Microarchitecture (Chalk Talk, MATC005) y Shop Talk wwithith Intel® Fellows tomorrow morning 8-9 y Technology Showcase y Check out www.intel.com/multi-core