MultiMulti--CoreCore MicroprocessorMicroprocessor Chips:Chips: MotivationMotivation && ChallengesChallenges
DileepDileep Bhandarkar,Bhandarkar, Ph.Ph. D.D. Architect at Large Digital Enterprise Group Intel Corporation
May 2006
Copyright © 2006 Intel Corporation. 2006 Intel Distinguished Lecture Agenda
yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary
©2006, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture IntelIntel only:only: OnOn--timetime ““22--yearyear--cyclecycle”” 180nm 130nm 90nm 65nm 45nm Wafer Size (mm): 200 200/300 300 300 300 1st Production: 1999 2001 2003 2005 2007
Transistors: SiGe SiGe
Interconnects:
100nm LG 70nm LG 50nm LG 35nm LG Details CoSi2 CoSi2 NiSi NiSi Coming! Strain Si Strain Si 6 Al 6 Cu 7 Cu 8 Cu SiOF SiOF Low-k Low-k 4545 nmnm LogicLogic ProcessProcess onon TrackTrack forfor DeliveryDelivery inin 20072007
Process Name P1262 P1264 P1266 P1268
Lithography 90 nm 65 nm 45 nm 32 nm
1st Production 2003 2005 2007 2009
Moore'sMoore's LawLaw continues!continues! IntelIntel continuescontinues toto developdevelop aa newnew technologytechnology generationgeneration everyevery 22 yearsyears Intel 11th EMEA Academic Forum
HistoricalHistorical DrivingDriving ForcesForces
Increased Performance Shrinking Geometry via Increased Frequency
100000 10
10000
1 1000 Feature Frequency Size (MHz) 100 (um) 0.1 10
1 0.01 1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020
1971 1978 1985 1993 2005 4004 Processor 8008 Processor i386 Processor Pentium Processor Montecito 2300 Transistors IBM PC 32-bit 3.1M transistors 1.7B Transistors TheThe ChallengesChallenges
Power Limitations Diminishing Voltage Scaling
1000 10 0.0.7um7um 0.0.5um5um 0.0.35um35um 0.0.25um25um 0.0.18um18um ~3 0 % 0.0.13um13um CPU Supply 90nm90nm 100 1 65nm65nm Power Voltage 45nm45nm (W) (V) 30nm30nm
10 0.1 1990 1995 2000 2005 2010 2015 1990 1993 1997 2 001 2005 2 00 9
Power = Capacitance x Voltage2 x Frequency also Power ~ Voltage3 Agenda
yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary
©2005, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture DesignDesign ChallengesChallenges yy MemoryMemory latencylatency notnot scalingscaling asas fastfast asas processorprocessor speedspeed yy PowerPower growinggrowing nonnon--linearlylinearly withwith singlesingle threadthread performanceperformance yy DesignerDesigner productivityproductivity lagginglagging designdesign complexitycomplexity yy AbilityAbility toto validatevalidate andand testtest complexcomplex designdesign yy KeepingKeeping upup withwith newnew processprocess technologytechnology everyevery twotwo yearsyears
www.intel.com/education 2006 Intel Distinguished Lecture LongLong LatencyLatency DRAMDRAM Accesses:Accesses: NeedsNeeds LatencyLatency TolerantTolerant TechniquesTechniques
1400
1200
1000
800
600 ions during DRAM Access ions during DRAM ions during DRAM Access ions during DRAM 400
200 Peak Instruct Peak Instruct 0 Pentium® Pentium-Pro Pentium III Pentium 4 Future CPUs Processor Processor Processor Processor 66 MHz 200 MHz 1100 MHz 2 GHz
www.intel.com/education 2006 Intel Distinguished Lecture DRAMDRAM LatencyLatency ToleranceTolerance yy ContinueContinue buildingbuilding eveneven largerlarger cachescaches – Every semiconductor process generation provides opportunity to double cache size – Cache becomes larger part of die yy HideHide multiplemultiple threadsthreads ofof executionexecution behindbehind memorymemory latencylatency yy IntelIntel implementedimplemented simultaneoussimultaneous multimulti-- threadingthreading inin 20002000 yy ImplementImplement multimulti--corecore productsproducts asas MooreMoore’’ss LawLaw allowsallows
www.intel.com/education 2006 Intel Distinguished Lecture Agenda
yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary
©2005, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture SituationalSituational AnalysisAnalysis y With Each Process Generation transistor density doubles – Frequency has increased by ~1.5X; ~1.3x in future – Vcc has scaled by about ~0.8x; ~0.9x in future – Capacitance has scaled by 0.7x – Total power may not scale down due to increased leakage y Instruction Level Parallelism harder to find y Increasing single-stream performance often requires non-linear increase in design complexity y Many server applications are inherently parallel y Parallelism exists in multimedia applications y Multi-tasking usage models becoming popular
www.intel.com/education 2006 Intel Distinguished Lecture ProcessorProcessor PowerPower
www.intel.com/education 2006 Intel Distinguished Lecture DesignDesign ComplexityComplexity andand ProductivityProductivity factorsfactors yy HugeHuge transistortransistor budgetsbudgets stressstress abilityability toto designdesign andand verifyverify complexcomplex chipschips yy MultiMulti--corecore fitsfits wellwell withwith increasingincreasing transistortransistor budgetsbudgets yy MultiMulti--corecore designdesign addressesaddresses density/designerdensity/designer gapgap
www.intel.com/education 2006 Intel Distinguished Lecture Agenda
yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary
©2005, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture IronIron LawLaw ofof PerformancePerformance yy ExecutionExecution TimeTime isis thethe productproduct ofof –– PathPath LengthLength –– CyclesCycles PerPer InstructionInstruction (CPI)(CPI) –– CycleCycle TimeTime yyCPICPI isis thethe sumsum ofof –– infiniteinfinite--cachecache corecore cpicpi –– missmiss raterate ** effectiveeffective memorymemory latencylatency yy BadBad (good)(good) newsnews isis thatthat performanceperformance doesdoes notnot scalescale upup (down)(down) linearlylinearly withwith frequencyfrequency
www.intel.com/education 2006 Intel Distinguished Lecture TheThe MagicMagic ofof VoltageVoltage ScalingScaling yy PowerPower == CapacitanceCapacitance ** VoltageVoltage2 * FrequencyFrequency yy FrequencyFrequency αα VoltageVoltage inin regionregion ofof interestinterest yy PowerPower increasesincreases asas thethe cubecube ofof FrequencyFrequency yy GoodGood newsnews isis thatthat voltagevoltage scalingscaling worksworks yy 10%10% reductionreduction inin voltagevoltage yieldsyields – 10% reduction in frequency – 30% reduction in power – less than 10% reduction in performance
www.intel.com/education 2006 Intel Distinguished Lecture SimpleSimple DualDual CoreCore ExampleExample yy AssumeAssume SingleSingle CoreCore processorprocessor atat 100W100W – 80W for core, 20W for cache and I/O – 50% die are is core yy DualDual corecore withinwithin samesame powerpower envelopenvelop – 20W for I/O and cache – 40W per core – Die size increases by 50% – Reduce voltage by 21% to reduce core power to 40W – Frequency reduces by ~20% – Single thread perf reduces by ~15% – Throughput increases by 70-80%
www.intel.com/education 2006 Intel Distinguished Lecture PossiblePossible ImprovementsImprovements yy DevelopDevelop newnew powerpower efficientefficient corecore – E.g. extensive clock gating – Big power savings with little or no performance loss yy DesignDesign aa smallersmaller corecore withwith lowerlower performanceperformance – Area and power savings much greater than performance loss – Use larger number of cores yy AdjustAdjust frequencyfrequency andand powerpower ofof eacheach corecore withwith loadload factorfactor – Inactive cores can be put in sleep mode – Maintain overall die power constant
www.intel.com/education 2006 Intel Distinguished Lecture AA NewNew EraEra…… THE NEW
Performance Equals IPC THE OLD Multi-Core Power Efficiency Microarchitecture Performance Advancements Equals Frequency Unconstrained Power
Voltage Scaling IntelIntel CoreCore MicroMicro--architecturearchitecture FiveFive KeyKey InnovationsInnovations
Intel® Wide Intel® Intelligent Dynamic Execution Power Capability
Intel® Advanced Intel® Smart Digital Media Boost Memory Access
Intel® Advanced Smart Cache MultiMulti--CoreCore TrajectoryTrajectory
QuadQuad--CoreCore
DualDual--CoreCore
2H2H 20062006 1H1H 20072007 Architecture Transitions Microprocessor Design Model
Shrink/Derivative PRESLER · YONAH · DEMPSEY 65nm PRINCIPLES New Microarchitecture
2 YEARS MEROM · CONROE · WOODCREST 1. One micro-architecture for all high volume market Shrink/Derivative segments PENRYN 45nm 2. Optimized for performance/watt New Microarchitecture
2 YEARS NEHALEM 3. Parallel design teams
Shrink/Derivative 4. No waiting on new NEHALEM-C process technology 32nm 5. Chipset cadence offset New Microarchitecture for fast ramp
2 YEARS GESHER
OBJECTIVE: Sustained Technology Leadership Agenda
yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary
©2005, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture PossiblePossible EvolutionEvolution y Transistor density doubles with each process generation y New generation enables complex new core y Possible alternative design point – Double the cache capacity in same area – Double the number of processor cores – Frequency improves with process technology Core Core Core Core Core Core Core
Cache 2 x Cache 4 x Cache
90 nm 65 nm 45 nm www.intel.com/education 2006 Intel Distinguished Lecture Ramping Multi-core Everywhere
2005 2006* 2007*
Desktop Shipping >70% >90% Mainstream/Performance Desktop Client
Mobile Shipping >70% >90% Mobile Mainstream/Performance Client
Server Shipping >85% ~100% Server & Workstation * Data is projected run rate exiting the year. Source: Intel
Expect to ship >60 million multi-core processors by end of 2006
All products and dates are preliminary and subject to change without notice. CMPCMP ChallengesChallenges yy HowHow muchmuch ThreadThread LevelLevel ParallelismParallelism isis therethere inin mostmost workloads?workloads? yy AbilityAbility toto generategenerate codecode withwith lotslots ofof threadsthreads && performanceperformance scalingscaling yy ThreadThread synchronizationsynchronization yy OperatingOperating systemssystems forfor parallelparallel machinesmachines yy SingleSingle threadthread performanceperformance tradeofftradeoff yy PowerPower limitationslimitations yy OnOn--chipchip interconnect/cacheinterconnect/cache infrastructureinfrastructure yy MemoryMemory andand I/OI/O bandwidthbandwidth requiredrequired
www.intel.com/education 2006 Intel Distinguished Lecture IntelIntel’’ss SoftwareSoftware ToolsTools andand SupportSupport
Thread Checker Solutions, Blueprints, Thread Profiler Sizing/Scaling Guides
Math Kernel Libraries Driver Optimization Performance Primitives Labs Drivers
Solution Services Compilers Developer Services
Software College VTune™ Analyzers VTune™ Analyzers Early Access Programs
www.intel.com/education 2006 Intel Distinguished Lecture HowHow ManyMany Cores?Cores? yyWhereWhere doesdoes thethe doublingdoubling stop?stop? –– DrivenDriven byby softwaresoftware issuesissues yyTodayToday MicrosoftMicrosoft WindowsWindows supportssupports onlyonly 6464 threads!threads! yyHowHow manymany applicationsapplications scalescale toto 6464 threads?threads? yyHowHow wellwell doesdoes performanceperformance scalescale withwith threadthread count?count?
www.intel.com/education 2006 Intel Distinguished Lecture Agenda
yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary
©2005, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture LookingLooking BeyondBeyond CMPCMP yyHowHow farfar dodo wewe pushpush thethe numbernumber ofof generalgeneral purposepurpose cores?cores? yyIsIs therethere areare rolerole forfor applicationapplication specificspecific engines?engines? yyProgrammingProgramming modelmodel forfor heterogeneousheterogeneous corescores
www.intel.com/education 2006 Intel Distinguished Lecture ImprovingImproving PowerPower EfficiencyEfficiency
penalty:
>3% performance Specialized MIPS for 1% in power
4 CMP penalty: *CMP = Symmetric General Purpose (GP) cores parallelism P* CM 3 CMP
Performance 2 CMP ~1% performance for 1% in power
One processor ign 1% performance al des ntion for 3% in power Conve
Power
www.intel.com/education 2006 Intel Distinguished Lecture ApplicationApplication SpecificSpecific EnginesEngines yy CanCan achieveachieve betterbetter powerpower efficiencyefficiency thanthan generalgeneral purposepurpose corescores yy SimplerSimpler designdesign duedue toto targetedtargeted applicationapplication andand lacklack ofof supportsupport forfor fullfull operatingoperating systemsystem yy ChallengeChallenge – Needs to support high volume application – Reconfigurable? yy GraphicsGraphics andand MultimediaMultimedia enginesengines areare goodgood candidatescandidates
www.intel.com/education 2006 Intel Distinguished Lecture Agenda
yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary
©2005, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture SummarySummary yy OneOne billionbillion transistorstransistors areare herehere already!already! yy ChipChip LevelLevel MultiprocessingMultiprocessing andand largelarge cachescaches cancan exploitexploit MooreMoore’’ss LawLaw yy AmountAmount ofof parallelismparallelism inin futurefuture microprocessormicroprocessor systemssystems willwill increaseincrease yy HeterogeneousHeterogeneous corescores maymay emergeemerge eventuallyeventually yy NeedNeed applicationsapplications andand toolstools thatthat cancan exploitexploit parallelismparallelism yy DesignDesign challengeschallenges andand softwaresoftware issuesissues remainremain Collaborate,Collaborate, Innovate,Innovate, Lead!Lead! www.intel.com/education 2006 Intel Distinguished Lecture ClosingClosing ThoughtThought
““DonDon’’tt bebe encumberedencumbered byby pastpast history,history, gogo offoff andand dodo somethingsomething wonderful.wonderful.””
-- RobertRobert NoyceNoyce IntelIntel CoCo--founderfounder
www.intel.com/education 2006 Intel Distinguished Lecture