1
New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies
Fred Pollack Intel Fellow Director of Microprocessor Research Labs [email protected]
Micro32 Contributors: Shekhar Borkar, Ronny Ronen int l Fred Pollack e 2 Moore’s Law Transistors Per Die 108 256M 64M 7 Memory ® 10 16M Pentium III Microprocessor 4M 6 Pentium® II 10 1M 1M ® 256K Pentium Pro 105 Pentium® 10 64K i486™ 16K i386™ 104 4K 80286 1K 8086 103 8080 4004 102
101
100 ’70’70 ’73 ’73 ’76 ’76 ’79 ’79 ’82 ’82 ’85 ’85 ’88 ’88 ’91 ’91 ’94 ’94 '97 '97 2000 2000 Source: Intel Micro32 int l Fred Pollack e 3 In the Last 25 Years Life was Easy
Doubling of transistor density every 30 months Increasing die sizes, allowed by – Increasing Wafer Size – Process technology moving from “black art” to “manufacturing science” Doubling of transistors every 18 months
And, only constrained by cost & mfg limits
But how efficiently did we use the transistors?
Micro32 int l Fred Pollack e 4 Performance Efficiency of µarchitectures Tech Old µArch mm (linear) New µArch mm (linear) Area 1.0µ=== i386C 6.5 i486 11.5 Rti3.1 0.7µ=== i486C 9.5 Pentium® proc 17 3.2 0.5µ=== Pentium® proc 12.2 Pentium Pro® 17.3 2.1 0.18µ=== Pentium III® 10.3 Nextproc Gen ? 2--3 proc
Implications: (in the same technology) 1. New µArch ~ 2-3X die area of the last µArch 2. Provides 1.5-1.7X integer performance of the last µArch
We are on the Wrong Side of a Square Law
Micro32 int l Fred Pollack e 5 Power Efficiency
Power is proportional to Die-area * Frequency ~2X frequency with each process generation – Normally expect 1.5X from process technology – Less gates per pipeline stage, e.g. due to deeper pipelines – Pushing process technology Examples – On 0.35µ Pentium® processor at 200MHz vs Pentium II processor at 300Mhz. Difference due to pipeline depth. – Pentium II processor at 300MHz on .35u vs. Pentium III processor at 600Mhz on 0.25u » Same core µarchitecture » ~50Mhz in speed-path work. The rest was pushing the process technology Other Factors Decreasing voltage and capacitance with each new process technology Increasing use of circuit & µarch techniques for lower power ñIncreasing transistor sizing to push frequency
Micro32 int l Fred Pollack e 6 Trends and expectations
With Each Process Generation: 10,000 Frequency increased by ~2X Frequency doubles (not 1.5X) each generation Vcc will scale by only ~0.8 1,000 (not 0.7) Pentium® III proc Active power will scale by ~0.9 Mhz (not 0.5) Pentium II proc 100 Pentium Pro proc Active power density will Pentium proc increase by ~30-80% i486 (not stay constant) i386 Leakage power will make it 10 even worse, and
Micro32 int l Fred Pollack e 7 As the technology scales...
GATE GATE SOURCE Xj DRAIN
D Tox SOURCE DRAIN BODY BODY Leff = = = = = Width W 0.7, Length L 0.7, tox 0.7 1. Dimensions reduce 30%, this is good 0.7 × 0.7 Area Cap = C a = = 0.7, 0.7
Fringing Cap = C f = 0.7, Total Cap C = 0.7 2. Capacitance on a node reduces by 30%, this is good Die Area = X ×Y = 0.7 × 0.7 = 0.72 3. Transistor density (integration) doubles, this is good Cap = 0.7 = 1 Area 0.7 × 0.7 0.7 4. Capacitance per unit area increases 43%, this is not good Micro32 int l Fred Pollack e 8 Power density continues to get worse
1000 Rocket NozzleSun’s Nuclear Reactor Surface 100 2
Hot plate Pentium III ® processor processor Watts/cm 10 Pentium II ® Pentium Pro ® processor i386 Pentium ® processor i486 1 1.5µ 1µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 0.13µ 0.1µ 0.07µ
Surpassed hot-plate power density in 0.5µ Not too long to reach nuclear reactor
Micro32 int l Fred Pollack e 9 Some implications
We can’t build microprocessors with ever increasing die sizes The constraint is power – not manufacturability Given the trends: – What happens to power if we hold die size constant at each generation – What happens to die size, if we hold power constant at each generation
Micro32 int l Fred Pollack e 10 Constant die size (Allows ~100% growth in transistors each generation) 250 100
Lkg Pwr ) 2 200 Active Pwr ~15mm die 75 1.5X freq increase Power Density each generation 150 50
Watts 100
25 50 Power Density (W/cm Density Power 0 0 0.25µ 0.18µ 0.13µ 0.1µ
Limiters:Limiters: 1.1. PowerPower dissipation,dissipation, 2.2. PowerPower delivery,delivery, andand 3.3. PowerPower densitydensity Micro32 int l Fred Pollack e 11 If you limit die size due to power...
16 100 Die size ) Power Density 2 ~66 Watts total, 12 75 1.5X freq increase each generation
8 50
Die size (mm) 4 25 Power Density (W/cm Density Power 0 0 0.25µ 0.18µ 0.13µ 0.1µ DieDie sizesize hashas toto reducereduce ~25%~25% inin areaarea eacheach generationgeneration ––ImpliesImplies ~50% ~50% vs. vs. the the 200+% 200+% historical historical growth growth in in transistors transistors LimitsLimits performanceperformance PowerPower densitydensity doesdoes notnot improveimprove Micro32 int l Fred Pollack e 12 Therefore
Business-as-usual won’t work We need to look at alternatives – and we all are
Micro32 int l Fred Pollack e 13 Current Directions
Low-power circuit and µarch techniques* SIMD ISA extensions* On-die L2 caches Multiple CPU cores on die Multithreaded CPU On-Die L2 Caches
Micro32 int l Fred Pollack * Not discussed e 14 Memory is more power efficient
100 Static memory has 10X lower ) 2 active power density Lower leakage than logic Logic Leakage control is also easier Memory to implement than logic 10 Integrated L2 provides: 1. Higher bandwidth 2. Lower latency Power Density (Watts/cm Density Power 1 0.25µ 0.18µ 0.13µ 0.1µ So on-die L2 caches make sense
Micro32 int l Fred Pollack e 15 Can easily double the on-die L2 ... Logic Die Size Total Die size 20 75 Power Density Mem Area % 256KB 16 66 Watts constant, ) 512KB 2 1.5X freq increase 1MB 50 each generation 12 2MB
8 25 Die size (mm) Power Density (W/cm 4
0 0 0.25µ 0.18µ 0.13µ 0.1µ
Die space not used for logic can be used for on-die L2 cache – To improve performance with <10% increase in max power >512KB L2 good for server performance, but small impact on PC desktop performance And little help in “real” power density Micro32 int l Fred Pollack e 16 Example: Pentium® III processor on .18µ process technology
• 256KB L2 AdvancedAdvanced • 28 million SystemSystem transistors BufferingBuffering • 106 mm² die size AdvancedAdvanced • Multi-voltage Transfer capability: Transfer 1.1V-1.7V CacheCache • On-die GTL+ termination
Micro32 int l Fred Pollack e 17 Advanced System Buffering
Balanced increase in buffers to minimize bottlenecks Memory Bandwidth – Buffer sizes maximize utilization of the Prefetch 133MHz system bus bandwidth 1010 6 Fill buffers (vs 4) – 50% increase in concurrent non-blocking data cache operations 679 MB/s 8 Bus queue entries (vs 4) MB/s – Allows more outstanding memory/bus operations 4 Writeback buffers (vs 1) – Reduced blocking during cache replacement operations – Faster deallocation time for fill buffers 0.25µ 0.18µ Pentium® III Pentium® III processor processor
0.18µ Pentium® III processor 600 MHz (with ATC, ASB) vs. 0.25µ Pentium III processor 600 MHz Micro32 System configuration: Pre-production Intel VC820 board with 133 MHz system bus, 128MB RDRAM, Seagate int l Fred Pollack Barracuda SCSI, STB4400 Velocity AGP 2X. Intel internal design analysis tools used to obtain measurement data. e 18 Advanced Transfer Cache
Size Performance Gain – 256KB on-die level 2 cache at Equal MHz (600MHz) Organization 20% – 8-way set associative, 1024 sets – 32 byte line (32 bytes data, 4 bytes ECC) 12% – 36-bit physical address space Cache Bus % Improvement – Full speed, scaleable with core % Improvement frequency – 288-bit transfer width (256 data, 32 ECC) SPECint_base95 SPECfp_base95 – 2 cycle back-to-back throughput – >4x reduction in latency (as compared to 0.25µ Pentium® III processor)
0.18µ Pentium III processor 600 MHz (with ATC, ASB) vs. 0.25µ Pentium III processor 600B MHz Source: Intel MAP; Results estimated using Intel C/C++ Compiler 4.5 and Intel Fortran Compiler 4.5 Micro32 System configuration: Pre-production Intel VC820 board with 133 MHz system bus, 256MB RDRAM, int l Fred Pollack IBM371800 ATA-66, Diamond Viper 770 Ultra TNT2 AGP4X e 19 Power Density: Cache vs. Logic
Past: Thermal Uniformity Present: Logic vs. Cache
60 60
50 50
40 40
30 30 Y19 Y19 Y16 Y16 20 20 Y13 Y13 Y10 Y10 10 10 Y7 Y7 0 Y4 0
Y4 X1 X4 X7
X1 Y1 X3 X5 X10 X7 X13 X9 Y1 X16 X19 X11 X13 X15 X17 X19
As die temperature increases, CMOS logic slows down With low power density (past), can assume uniformity With increasing power density and on-die L2 cache, need to consider simplistic non-uniformity
Micro32 int l Fred Pollack e 20 Power Density: The Future
Power Map On-Die Temperature 110 250 100 200 90
150 80
100 70 Heat Flux (W/cm2)
60 (C) Temperature 50 50 0 40
With high power density, cannot assume uniformity – As die temperature increases, CMOS logic slows down – At high die temperatures, long-term reliability can be compromised Silicon is not a good heat conductor – Impact on packaging, w.r.t. cooling Micro32 int l Fred Pollack e 21 Packaging Implications
☛ Power removal for non-uniform heating is a big challenge since we need to spread the heat (smooth local concentrations) & then dissipate it in the ambient Hot spots created on die since we cannot completely smooth them away Die Attach
Spreader Need to spread out local concentrations
Micro32 int l Fred Pollack e 22 Multiple CPU on Die
Shared L2 more efficient than separate L2’s of ½ size About linear performance with CPU CPU die size vs. historical square law Unlikely that both CPUs are at Max Power at same time – Typical application power << max power on each CPU L2 Cache – Can throttle performance if both CPUs approach max power at same time. Can simplify interconnect in SMP system Also, can be used to build highly reliable system via FRC
Micro32 int l Fred Pollack e 23 Multithreading
Single CPU µArch augmented to look as 2 or more CPUs to software
Adds ~10% logic to CPU
Max Power increases <10%
Can increase throughput by 30+%
Helps to address increasing overhead of cache misses
Micro32 int l Fred Pollack e 24 Key Challenges for future Microarchitectures
Special Purpose Performance
Increased Execution Efficiency
Breaking the Dataflow Barrier – With efficiency
Micro32 int l Fred Pollack e 25 Special vs. General Purpose Performance
Special purpose performance can deliver more MIPS/mm². To Date: SIMD integer and floating-point instructions added to several ISAs – <10% in die area and power increase, and 1.5-4x increase in multimedia/3D kernels With future silicon budgets approaching 100M transistors, we need to consider: – Integration of other platform components (e.g. Memory controller, graphics) – Special purpose logic, programmable logic, & separately programmable engines – But all have very complex/costly software issues
Challenge: Design for “Valued Performance” Micro32 int l Fred Pollack e 26 Increased Execution Efficiency
Max Power Occurs on Code that Keeps Pipeline full and all superscalar units busy – Thermal solution designed for Max Power – Power delivery designed for Max Power But few apps spend any significant time operating at Max Power And the wider and deeper the execution core, the greater the inefficiency Thus, with power constraints, need to focus on techniques that increase execution core efficiency and only add modest additional logic and power
Micro32 int l Fred Pollack e 27 Challenges to Increased Efficiency
Improved prediction for less unused speculation Establish & use Confidence measures – For example, don’t speculate on a flaky branch if another thread can better use the execution resources
Micro32 int l Fred Pollack e 28 Data Flow Execution In Order Processors
11 R1R1 AA 11 R1R1 AA ------22 BB R1R1 22 33 BB R1R1 R2R2 CC BB == AA R2 C 3 R2 C 3 44 DD R2R2 ------DD == CC DD R2R2 44 Superscalar Source Scalar Out-of-Order Processors 11 33 R1R1 AA R2R2 CC ------– Order is not important, 2 B R1 D R2 --- data flow (dependencies) matters 2 44 B R1 D R2 --- Out-of-Order Superscalar The Goal: Shortest length as possible!
… But still limited by instruction dependencies
1.1. How How much much limited? limited? 11 22 33 44 2.2. Can Can we we break break the the Data Data Flow Flow Barrier? Barrier? Our Wet Dreams?
Micro32 int l Fred Pollack e 29 Breaking the Barrier: Beyond Data-Flow Execution Driving Idea: – Transform the DFG into an improved DFG which: » Has a shorter critical path (higher parallelism) » Has less instructions Families of Transformations – Safe transformations » Like compilers do, but using dynamic information – Speculative transformations » Guess intermediate values (results, addresses, flags,…) » Ignore dependencies » Verify and redo if wrong » Do it smart to reduce unused speculation (use confidence factor)
Micro32 int l Fred Pollack e 30 Simple Transformation Examples Move Elimination & Memory Bypass
11 R1R1 == R2R2move move [M1][M1] == R1R1store store 22 …… R3R3 == [M2][M2]load load 33 R4R4 == R3+1R3+1 alu alu Safe 44 Source - before & after OOO Transformation Predict M1==M2 [M1][M1] == R2R2 R1R1 == R2R2 X11 22 R3R3 == [M2][M2] ...... 33 R4R4 == R3R3 ++ 11 ...... After Move Elimination (use R2 for R1) 44
[M1][M1] == R2R2 R4R4 == R2R2 ++ 11 22 44 R3R3 == [M2][M2] ...... 33 Need After Memory Bypass Verification M1==M2
Micro32 int l Fred Pollack e 31 Value Prediction Predict the outcome value of an instruction instead of waiting for it to be ready/produced – But confidence factor is key to avoiding unused speculation Enables Dependency Elimination thus collapsing the DFG
11 11 33
22 22
33
Don’t wait for the outcome of “2” in order to execute “3” but predict it’s value
Micro32 int l Fred Pollack e 32 Value Identity Predictor
If you can’t predict the actual value, try to predict if it is identical to a value produced by a prior instruction Enables Dependency Redirection thus collapsing the DFG
After Gen eax … Gen ebx … Use ebx Before
But with confidence
Micro32 int l Fred Pollack e 33 Summary Logic Transistor growth constrained by power – not mfg – At constant power, 50% per process generation vs. over 200% in past Current Directions in microarchitecture that help – SIMD ISA extensions – On-die L2 caches – Multiple CPU cores on die – Multithreaded CPU Key Challenges for future Microarchitectures – Special purpose performance – Increased execution efficiency: improved prediction and confidence – Break the data-flow barrier, but in a power efficient manner CMOS Challenges beyond thermal power – Increasing power density – Leakage Power becoming a significant factor – Increasing and quickly-changing current with lower voltage (di/dt) – SER (soft error rate) – not just a memory problem Micro32 int l Fred Pollack e 34 Conclusion
Power is the key challenge – Need to address at all levels (process, circuits, architecture, compiler) – And, use multi-disciplinary approach
New Goal:
Double Valued Performance every 18 months, at the same power level
Micro32 int l Fred Pollack e