IntelIntel® CoreCoreTM MicroarchitectureMicroarchitecture
Steve Pawlowski Intel Senior Fellow CTO, GM Architecture & Planning Digital Enterprise Group
Ofri Wechsler Intel Fellow Director, CPU Architecture Mobility Group ContinuingContinuing FromFrom LastLast FallFall’’ss IDFIDF……
Let’s Take A Look Inside IntelIntel® CoreCore™™ MicroarchitectureMicroarchitecture The Energy Efficient Performance Leader
Platform – Level Energy Efficient Platform Performance Capabilities
Energy 3X Woodcrest H2 ‘06 Performance Consumption
2X Perf Dempsey MV H1 ‘06 Paxville DP H2 ‘05 Energy 1X Irwindale H1 ‘05 Core™ Microarchitecture: Higher Performance AND Lower Power DP Performance Per Watt Comparison with SPECint_rate at the Platform Level
Source: Intel® HistoryHistory XEON® 2004 PENTIUM® M 2003 Microarchitecture XEONTM 2002 and EM64T ITANIUM® Architecture 2001 Power Management PENTIUM® 4 Hyper Micro Fusion 2000 Threading PENTIUM® III NT 1999 EPIC IE FIC E EF NC R A ® E RM PENTIUM Pro NetBurstTM OW FO 1995 P ER SSE 2 P i486TM SSE PENTIUM® 1989 1993 Integrated CE Out Of Order N FPU MA OR Pipelining Register Renaming RF PE Speculative Branch Prediction Execution Superscalar
*Graphics not representative of actual die photo or relative size 5 HistoricalHistorical DrivingDriving ForcesForces
Increased Performance Shrinking Geometry via Increased Frequency
100000 10
10000
1 1000 Feature Frequency Size (MHz) 100 (um) 0.1 10
1 0.01 1970 1980 1990 2000 2010 2020 1970 1980 19901990 20002000 20102010 2020
1946 1971 2005 20 Numbers I4004 Processor 65nm in Main Memory 2300 Transistors 1B+ Transistors TheThe ChallengesChallenges
Power Limitations Diminishing Voltage Scaling
1000 10 0.0.7um7um 0.0.5um5um 0.0.35um35um 0.0.25um25um 0.0.18um18um ~3 0 % 0.0.13um13um CPU Supply 90nm90nm 100 1 65nm65nm Power Voltage 45nm45nm (W) (V) 30nm30nm
10 0.1 1990 1995 2000 2005 2010 2015 1990 1993 1 997 2001 2005 2 00 9
Power = Capacitance x Voltage2 x Frequency also Power ~ Voltage3 EnergyEnergy EfficientEfficient PerformancePerformance –– HighHigh EndEnd
DATACENTER “ENERGY LABEL”
NASA Columbia 2 MWatt 60 TFlops goal 10,240 cpus – Itanium II $50M Source: NASA 30,720 Flops/Watt 1,288 Flops/Dollar Computational Efficiency 17,066 Flops/Watt 467 Flops/Dollar ASC Purple 6 MWatt 100 TFlops goal 12K+ cpus – Power5 $230M
Source: LLNL AA NewNew EraEra…… THE NEW
Performance Equals IPC THE OLD Multi-Core Power Efficiency Microarchitecture Performance Advancements Equals Frequency Unconstrained Power
Voltage Scaling IntelIntel® CoreCore™™ MicroarchitectureMicroarchitecture
Low Power High Performance Scalable
Woodcrest Intel® Wide Dynamic Execution Server Optimized Intel® Intelligent Power Capability Conroe Intel® Intel Desktop Advanced Smart Cache Optimized 65nm Intel® Smart Memory Access Merom Intel® Mobile Advanced Optimized Digital Media Boost
*Graphics not representative of actual die photo or relative size IntelIntel® WideWide DynamicDynamic ExecutionExecution
EACH CORE CORE 1 CORE 2
EFFICIENT 14 STAGE INSTRUCTION FETCH INSTRUCTION FETCH PIPELINE AND PRE-DECODE AND PRE-DECODE
DEEPER BUFFERS INSTRUCTION QUEUE INSTRUCTION QUEUE
4 WIDE - DECODE TO DECODE DECODE EXECUTE
4 WIDE - MICRO-OP RENAME / ALLOC RENAME / ALLOC EXECUTE
MICRO RETIREMENT UNIT RETIREMENT UNIT and (REORDER BUFFER) (REORDER BUFFER) MACRO FUSION
ENHANCED SCHEDULERS SCHEDULERS ALUs
EXECUTE EXECUTE
Perf • 33% Wider Execution over Previous Gen ADVANTAGE • Comprehensive Advancements Energy • Enabled In Each Core IntelIntel® WideWide DynamicDynamic ExecutionExecution Micro and Macro Fusion
Micro Macro MACRO FUSION EXAMPLE Fusion Fusion CMP+JMP IN 1 CLOCK
WITH MACRO FUSION WITHOUT MACRO FUSION
INSTRUCTION 3 INSTRUCTION 3 uCODE DECODE INSTRUCTION 2 INSTRUCTION 2 ROM INSTRUCTION 1 INSTRUCTION 1 DECODE DECODE COMBINED INST 2 & 3 INTERNAL INST 3 INTERNAL INST 1 INTERNAL INST 2 EXECUTE EXECUTE INTERNAL INST 1 COMPLETED INST 3 EXECUTE COMPLETED INST 2 COMPLETED INST 3 COMPLETED INST 1 COMPLETED INST 2 COMPLETED INST 1
Perf • Instruction Load Reduced ~ 15%** ADVANTAGE ** Energy • Micro-Ops Reduced ~ 10%
*Graphics not representative of actual die photo or relative size ** Workload dependant IntelIntel® IntelligentIntelligent PowerPower CapabilityCapability
Ultra Coarse Ultra Process Grained Fine Transistor Grained Grained
• 65nm • Aggressive • Low VCC Arrays • Low Leakage • Strained Silicon Clock Gating • Blocks Controlled Transistors • Low-K Dielectric • Enhanced Via Sleep • Sleep • More Metal Layers Speed-Step Transistors Transistors
• Mobile-Level Power Management Energy ADVANTAGE • Energy Efficient Performance
*Graphics not representative of actual die photo or relative size IntelIntel® AdvancedAdvanced SmartSmart CacheCache Dynamic L2 Cache Usage Core™ Microarchitecture Core™ Microarchitecture Independent L2 Shared L2 Decreased Increased Traffic Traffic
Dynamically, Not Bi-Directionally Shareable Available x
L1 L1 L1 L1 CACHE CACHE CACHE CACHE
CORE 1 CORE 2 CORE 1 CORE 2
Perf • Higher Cache Hit Rate ADVANTAGE • Reduced BUS Traffic Energy • Lower Latency to Data
*Graphics not rrepresentativeepresentative of actuaactuall ddieie photo oorr relative ssizize IntelIntel® SmartSmart MemoryMemory AccessAccess Hardware-based Memory Disambiguation Core™ Microarchitecture Other
INST 2 “LOAD [Y]” INST 2 “LOAD [Y]” IN INST 1 “STORE [X]” ORDER INST 1 “STORE [X]” DECODE/SCHEDULE DECODE/SCHEDULE
INST 2 “LOAD [Y]” INST 2 “LOAD [Y]”
INST 1 “STORE [X]” INST 1 “STORE [X]” OUT HARDWARE OF ORDER Mem. Dis. INST 2 “LOAD [Y]” INST 2 “LOAD [Y]” Predictor EXECUTE STALL
INST 1 “STORE [X]” Inst. 2 Must EXECUTE Wait For Inst. 2 “Load” Inst. 1 “Store” Can Occur INST 1 “STORE [X]” To Complete Before Inst. 1 “Store”
Perf • Higher Utilization of Pipeline ADVANTAGE • Masks latency to data access Energy • Higher Performance IntelIntel® AdvancedAdvanced DigitalDigital MediaMedia BoostBoost Single Cycle SSE In Each Core SSE Operation (SSE/SSE2/SSE3) SOURCE 127 0 Fusion Single 127 0 Support Cycle X4 X3 X2 X1 SSE SSE/2/3 OP
Y4 Y3 Y2 Y1 DECODE DECODE DEST
Core™ μarch CLOCK X4opY4 X3opY3 X2opY2 X1opY1 CYCLE 1
EXECUTE EXECUTE CLOCK Previous X2opY2 X1opY1 CYCLE 1
CLOCK X4opY4 X3opY3 CYCLE 2
Perf • Increased Performance ADVANTAGE • 128 bit Single Cycle in each core Energy • Improved Energy Efficiency
*Graphics not representative of actual die photo or relative size NextNext GenerationGeneration PlatformsPlatforms
Energy Efficient Performance
Intel® Core™ Microarchitecture • 80W Target TDP Intel® Wide • 40W LV Target TDP Dynamic Server • 2 Execution Cores Bensley Execution Optimized Bensley • 4MB L2 Cache • Server Platform *Ts Intel® • DP Configurations Intelligent Power Capability ® Intel® Desktop Averill • 65W Mainstream TDP Advanced Optimized • 2 Execution Cores Smart Cache Bridge • 2MB & 4MB L2 Cache Creek • Desktop Platform *Ts Intel® Smart Memory Access • Mobile TDP ® • 2 Execution Cores Intel Mobile Napa Optimized • 2MB & 4MB L2 Cache Advanced refresh Digital Media • Mobile Power Boost Optimizations • Mobile Platform *Ts ComprehensiveComprehensive PlatformPlatform ArchitectureArchitecture Memory Controller Considerations
Business ++ Technical
Rapid Adoption Of RELIABILITY Technology Transitions
PERFORMANCE 100% PERFORMANCE SDRAM 80% DDR DDR2 DDR3 POWER MANAGEMENT 60% FPM COHERENCY LATENCY 40% FBD EDO REDUCTION 20%
0% MEMORY ACCESS REDUCTION 1996 2000 2004 2008 2012
PROCESS SCALING RATES Market Segments Require Different Technologies PLATFORM-LEVEL MOBILE DESKTOP SERVER ENHANCEMENTS ComprehensiveComprehensive PlatformPlatform ArchitectureArchitecture MEMORY CONTROLLER CONSIDERATIONS
Memory Controller in Chipset Technical
Extended RAS RELIABILITY for Servers Memory Closer to Integrated GFX, PERFORMANCE Smart Cache and Smart Memory MEMORY ACCESS REDUCTION Access
Centralized COHERENCY LATENCY Coherency Control REDUCTION
CPU Power States POWER MANAGEMENT w/ Memory Alive
MC vs Processor PROCESS SCALING RATES Device Scaling
IO Acceleration PLATFORM-LEVEL Technology ENHANCEMENTS DPDP ServerServer ArchitectureArchitecture
Platform Performance: FSB Scaling Bensley Platform It’s all about 800MHz Bandwidth & 1067MHz Latency 1333MHz Latency
Large Point to Point Shared Caches Interconnect 64 GB M A B M A B M A B M A Central Coherency B Easy Capacity Resolution Expansion M A B M A B M A B M A B
17 GB/s M A B M A AM B M A B Local and Remote B Sustained & Memory Latencies Blackford Balanced M A B M A B M A B M A Consistent B Throughput
Perf CONSTANTLY ANALYZING THE REQUIREMENTS, Energy THE TECHNOLOGIES, AND THE TRADEOFFS
*Graphics not rrepresentativeepresentative of actuaactuall ddieie photo oorr relative ssizize CoreCore™™ MicroarchitectureMicroarchitecture AdvancesAdvances WithWith QuadQuad CoreCore
Energy Efficient Performance
Quad Core Clovertown H1 ‘07
Clovertown 3X Woodcrest H2 ‘06
Server 2X Dempsey MV H1 ‘06 Paxville DP H2 ‘05 Kentsfield 1X Irwindale H1 ‘05
Desktop
DP Performance Per Watt Comparison with SPECint_rate at the Platform Level Source: Intel®
*Graphics not representative of actual die photo or relative size DevelopingDeveloping forfor thethe IntelIntel® CoreCore™™ MicroarchitectureMicroarchitecture
Program For Multicore
ExploitExploit EnergyEnergy EfficientEfficient PerformancePerformance
Benefit From Utilize SSE Power Efficiency Enhancements FurtherFurther InsightInsight IntoInto IntelIntel® CoreCoreTM MicroarchitectureMicroarchitecture y CoreTM Microarchitecture White Paper at rear of auditorium y InteIntell®'s CoreTM Microarchitecture (Session, MATS001) y InteIntell® Multi-Core Architecture and Implementations (Session, MATS002) y Multi-Core and CoreTM Microarchitecture (Chalk Talk, MATC005) y Shop Talk wwithith Intel® Fellows tomorrow morning 8-9 y Technology Showcase y Check out www.intel.com/multi-core