MultiMulti--CoreCore MicroprocessorMicroprocessor Chips:Chips: MotivationMotivation && ChallengesChallenges

DileepDileep Bhandarkar,Bhandarkar, Ph.Ph. D.D. Architect at Large Digital Enterprise Group Corporation

May 2006

Copyright © 2006 Intel Corporation. 2006 Intel Distinguished Lecture Agenda

yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary

©2006, Intel Corporation Intel, the Intel logo, , and are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture IntelIntel only:only: OnOn--timetime ““22--yearyear--cyclecycle”” 180nm 130nm 90nm 65nm 45nm Wafer Size (mm): 200 200/300 300 300 300 1st Production: 1999 2001 2003 2005 2007

Transistors: SiGe SiGe

Interconnects:

100nm LG 70nm LG 50nm LG 35nm LG Details CoSi2 CoSi2 NiSi NiSi Coming! Strain Si Strain Si 6 Al 6 Cu 7 Cu 8 Cu SiOF SiOF Low-k Low-k 4545 nmnm LogicLogic ProcessProcess onon TrackTrack forfor DeliveryDelivery inin 20072007

Process Name P1262 P1264 P1266 P1268

Lithography 90 nm 65 nm 45 nm 32 nm

1st Production 2003 2005 2007 2009

Moore'sMoore's LawLaw continues!continues! IntelIntel continuescontinues toto developdevelop aa newnew technologytechnology generationgeneration everyevery 22 yearsyears Intel 11th EMEA Academic Forum

HistoricalHistorical DrivingDriving ForcesForces

Increased Performance Shrinking Geometry via Increased Frequency

100000 10

10000

1 1000 Feature Frequency Size (MHz) 100 (um) 0.1 10

1 0.01 1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020

1971 1978 1985 1993 2005 4004 Processor 8008 Processor Processor Pentium Processor 2300 Transistors IBM PC 32-bit 3.1M transistors 1.7B Transistors TheThe ChallengesChallenges

Power Limitations Diminishing Voltage Scaling

1000 10 0.0.7um7um 0.0.5um5um 0.0.35um35um 0.0.25um25um 0.0.18um18um ~3 0 % 0.0.13um13um CPU Supply 90nm90nm 100 1 65nm65nm Power Voltage 45nm45nm (W) (V) 30nm30nm

10 0.1 1990 1995 2000 2005 2010 2015 1990 1993 1997 2 001 2005 2 00 9

Power = Capacitance x Voltage2 x Frequency also Power ~ Voltage3 Agenda

yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary

©2005, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture DesignDesign ChallengesChallenges yy MemoryMemory latencylatency notnot scalingscaling asas fastfast asas processorprocessor speedspeed yy PowerPower growinggrowing nonnon--linearlylinearly withwith singlesingle threadthread performanceperformance yy DesignerDesigner productivityproductivity lagginglagging designdesign complexitycomplexity yy AbilityAbility toto validatevalidate andand testtest complexcomplex designdesign yy KeepingKeeping upup withwith newnew processprocess technologytechnology everyevery twotwo yearsyears

www.intel.com/education 2006 Intel Distinguished Lecture LongLong LatencyLatency DRAMDRAM Accesses:Accesses: NeedsNeeds LatencyLatency TolerantTolerant TechniquesTechniques

1400

1200

1000

800

600 ions during DRAM Access ions during DRAM ions during DRAM Access ions during DRAM 400

200 Peak Instruct Peak Instruct 0 Pentium® Pentium-Pro Pentium III Future CPUs Processor Processor Processor Processor 66 MHz 200 MHz 1100 MHz 2 GHz

www.intel.com/education 2006 Intel Distinguished Lecture DRAMDRAM LatencyLatency ToleranceTolerance yy ContinueContinue buildingbuilding eveneven largerlarger cachescaches – Every semiconductor process generation provides opportunity to double cache size – Cache becomes larger part of die yy HideHide multiplemultiple threadsthreads ofof executionexecution behindbehind memorymemory latencylatency yy IntelIntel implementedimplemented simultaneoussimultaneous multimulti-- threadingthreading inin 20002000 yy ImplementImplement multimulti--corecore productsproducts asas MooreMoore’’ss LawLaw allowsallows

www.intel.com/education 2006 Intel Distinguished Lecture Agenda

yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary

©2005, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture SituationalSituational AnalysisAnalysis y With Each Process Generation transistor density doubles – Frequency has increased by ~1.5X; ~1.3x in future – Vcc has scaled by about ~0.8x; ~0.9x in future – Capacitance has scaled by 0.7x – Total power may not scale down due to increased leakage y Instruction Level Parallelism harder to find y Increasing single-stream performance often requires non-linear increase in design complexity y Many server applications are inherently parallel y Parallelism exists in multimedia applications y Multi-tasking usage models becoming popular

www.intel.com/education 2006 Intel Distinguished Lecture ProcessorProcessor PowerPower

www.intel.com/education 2006 Intel Distinguished Lecture DesignDesign ComplexityComplexity andand ProductivityProductivity factorsfactors yy HugeHuge transistortransistor budgetsbudgets stressstress abilityability toto designdesign andand verifyverify complexcomplex chipschips yy MultiMulti--corecore fitsfits wellwell withwith increasingincreasing transistortransistor budgetsbudgets yy MultiMulti--corecore designdesign addressesaddresses density/designerdensity/designer gapgap

www.intel.com/education 2006 Intel Distinguished Lecture Agenda

yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary

©2005, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture IronIron LawLaw ofof PerformancePerformance yy ExecutionExecution TimeTime isis thethe productproduct ofof –– PathPath LengthLength –– CyclesCycles PerPer InstructionInstruction (CPI)(CPI) –– CycleCycle TimeTime yyCPICPI isis thethe sumsum ofof –– infiniteinfinite--cachecache corecore cpicpi –– missmiss raterate ** effectiveeffective memorymemory latencylatency yy BadBad (good)(good) newsnews isis thatthat performanceperformance doesdoes notnot scalescale upup (down)(down) linearlylinearly withwith frequencyfrequency

www.intel.com/education 2006 Intel Distinguished Lecture TheThe MagicMagic ofof VoltageVoltage ScalingScaling yy PowerPower == CapacitanceCapacitance ** VoltageVoltage2 * FrequencyFrequency yy FrequencyFrequency αα VoltageVoltage inin regionregion ofof interestinterest yy PowerPower increasesincreases asas thethe cubecube ofof FrequencyFrequency yy GoodGood newsnews isis thatthat voltagevoltage scalingscaling worksworks yy 10%10% reductionreduction inin voltagevoltage yieldsyields – 10% reduction in frequency – 30% reduction in power – less than 10% reduction in performance

www.intel.com/education 2006 Intel Distinguished Lecture SimpleSimple DualDual CoreCore ExampleExample yy AssumeAssume SingleSingle CoreCore processorprocessor atat 100W100W – 80W for core, 20W for cache and I/O – 50% die are is core yy DualDual corecore withinwithin samesame powerpower envelopenvelop – 20W for I/O and cache – 40W per core – Die size increases by 50% – Reduce voltage by 21% to reduce core power to 40W – Frequency reduces by ~20% – Single thread perf reduces by ~15% – Throughput increases by 70-80%

www.intel.com/education 2006 Intel Distinguished Lecture PossiblePossible ImprovementsImprovements yy DevelopDevelop newnew powerpower efficientefficient corecore – E.g. extensive clock gating – Big power savings with little or no performance loss yy DesignDesign aa smallersmaller corecore withwith lowerlower performanceperformance – Area and power savings much greater than performance loss – Use larger number of cores yy AdjustAdjust frequencyfrequency andand powerpower ofof eacheach corecore withwith loadload factorfactor – Inactive cores can be put in sleep mode – Maintain overall die power constant

www.intel.com/education 2006 Intel Distinguished Lecture AA NewNew EraEra…… THE NEW

Performance Equals IPC THE OLD Multi-Core Power Efficiency Performance Advancements Equals Frequency Unconstrained Power

Voltage Scaling IntelIntel CoreCore MicroMicro--architecturearchitecture FiveFive KeyKey InnovationsInnovations

Intel® Wide Intel® Intelligent Dynamic Execution Power Capability

Intel® Advanced Intel® Smart Digital Media Boost Memory Access

Intel® Advanced Smart Cache MultiMulti--CoreCore TrajectoryTrajectory

QuadQuad--CoreCore

DualDual--CoreCore

2H2H 20062006 1H1H 20072007 Architecture Transitions Microprocessor Design Model

Shrink/Derivative PRESLER · · DEMPSEY 65nm PRINCIPLES New Microarchitecture

2 YEARS · · WOODCREST 1. One micro-architecture for all high volume market Shrink/Derivative segments 45nm 2. Optimized for performance/watt New Microarchitecture

2 YEARS NEHALEM 3. Parallel design teams

Shrink/Derivative 4. No waiting on new NEHALEM-C process technology 32nm 5. Chipset cadence offset New Microarchitecture for fast ramp

2 YEARS GESHER

OBJECTIVE: Sustained Technology Leadership Agenda

yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary

©2005, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture PossiblePossible EvolutionEvolution y Transistor density doubles with each process generation y New generation enables complex new core y Possible alternative design point – Double the cache capacity in same area – Double the number of processor cores – Frequency improves with process technology Core Core Core Core Core Core Core

Cache 2 x Cache 4 x Cache

90 nm 65 nm 45 nm www.intel.com/education 2006 Intel Distinguished Lecture Ramping Multi-core Everywhere

2005 2006* 2007*

Desktop Shipping >70% >90% Mainstream/Performance Desktop Client

Mobile Shipping >70% >90% Mobile Mainstream/Performance Client

Server Shipping >85% ~100% Server & Workstation * Data is projected run rate exiting the year. Source: Intel

Expect to ship >60 million multi-core processors by end of 2006

All products and dates are preliminary and subject to change without notice. CMPCMP ChallengesChallenges yy HowHow muchmuch ThreadThread LevelLevel ParallelismParallelism isis therethere inin mostmost workloads?workloads? yy AbilityAbility toto generategenerate codecode withwith lotslots ofof threadsthreads && performanceperformance scalingscaling yy ThreadThread synchronizationsynchronization yy OperatingOperating systemssystems forfor parallelparallel machinesmachines yy SingleSingle threadthread performanceperformance tradeofftradeoff yy PowerPower limitationslimitations yy OnOn--chipchip interconnect/cacheinterconnect/cache infrastructureinfrastructure yy MemoryMemory andand I/OI/O bandwidthbandwidth requiredrequired

www.intel.com/education 2006 Intel Distinguished Lecture IntelIntel’’ss SoftwareSoftware ToolsTools andand SupportSupport

Thread Checker Solutions, Blueprints, Thread Profiler Sizing/Scaling Guides

Math Kernel Libraries Driver Optimization Performance Primitives Labs Drivers

Solution Services Compilers Developer Services

Software College VTune™ Analyzers VTune™ Analyzers Early Access Programs

www.intel.com/education 2006 Intel Distinguished Lecture HowHow ManyMany Cores?Cores? yyWhereWhere doesdoes thethe doublingdoubling stop?stop? –– DrivenDriven byby softwaresoftware issuesissues yyTodayToday MicrosoftMicrosoft WindowsWindows supportssupports onlyonly 6464 threads!threads! yyHowHow manymany applicationsapplications scalescale toto 6464 threads?threads? yyHowHow wellwell doesdoes performanceperformance scalescale withwith threadthread count?count?

www.intel.com/education 2006 Intel Distinguished Lecture Agenda

yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary

©2005, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture LookingLooking BeyondBeyond CMPCMP yyHowHow farfar dodo wewe pushpush thethe numbernumber ofof generalgeneral purposepurpose cores?cores? yyIsIs therethere areare rolerole forfor applicationapplication specificspecific engines?engines? yyProgrammingProgramming modelmodel forfor heterogeneousheterogeneous corescores

www.intel.com/education 2006 Intel Distinguished Lecture ImprovingImproving PowerPower EfficiencyEfficiency

penalty:

>3% performance Specialized MIPS for 1% in power

4 CMP penalty: *CMP = Symmetric General Purpose (GP) cores parallelism P* CM 3 CMP

Performance 2 CMP ~1% performance for 1% in power

One processor ign 1% performance al des ntion for 3% in power Conve

Power

www.intel.com/education 2006 Intel Distinguished Lecture ApplicationApplication SpecificSpecific EnginesEngines yy CanCan achieveachieve betterbetter powerpower efficiencyefficiency thanthan generalgeneral purposepurpose corescores yy SimplerSimpler designdesign duedue toto targetedtargeted applicationapplication andand lacklack ofof supportsupport forfor fullfull operatingoperating systemsystem yy ChallengeChallenge – Needs to support high volume application – Reconfigurable? yy GraphicsGraphics andand MultimediaMultimedia enginesengines areare goodgood candidatescandidates

www.intel.com/education 2006 Intel Distinguished Lecture Agenda

yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy DesignDesign ChallengesChallenges yy WhyWhy MultiMulti--CoreCore ProcessorProcessor Chips?Chips? yy Power/PerformancePower/Performance TradeTrade--OffsOffs yy CMPCMP DirectionsDirections yy BeyondBeyond CMPCMP yy SummarySummary

©2005, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others www.intel.com/education 2006 Intel Distinguished Lecture SummarySummary yy OneOne billionbillion transistorstransistors areare herehere already!already! yy ChipChip LevelLevel MultiprocessingMultiprocessing andand largelarge cachescaches cancan exploitexploit MooreMoore’’ss LawLaw yy AmountAmount ofof parallelismparallelism inin futurefuture microprocessormicroprocessor systemssystems willwill increaseincrease yy HeterogeneousHeterogeneous corescores maymay emergeemerge eventuallyeventually yy NeedNeed applicationsapplications andand toolstools thatthat cancan exploitexploit parallelismparallelism yy DesignDesign challengeschallenges andand softwaresoftware issuesissues remainremain Collaborate,Collaborate, Innovate,Innovate, Lead!Lead! www.intel.com/education 2006 Intel Distinguished Lecture ClosingClosing ThoughtThought

““DonDon’’tt bebe encumberedencumbered byby pastpast history,history, gogo offoff andand dodo somethingsomething wonderful.wonderful.””

-- RobertRobert NoyceNoyce IntelIntel CoCo--founderfounder

www.intel.com/education 2006 Intel Distinguished Lecture