EE382A Lecture 16A:
Multi-Core Design Tradeoffs (cont.)
Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a
EE382A – Autumn 2009 Lecture 16A - 1 John P Shen Performance Iron Law and Multi-core Scalability
• Law #1 - CPU (Latency) Performance • Law #2 – MP (Throughput) Performance • Law #3 - Multi-core Performance Scalability Architecture Architecture – A Performance Scalability Model Performance Performance
• Law #4 – Algorithm and Performance Algorithm Algorithm
• Law #5 – Power and Performance Power Power
EE382A – Autumn 2009 Lecture 16A - 2 John P Shen Law #1 – CPU (Latency) Performance
[John DeVale & Bryan Black, 2005]
Deeper pipelining Wider
pipeline Increased CPI penalty
Cycle Time Time Cycle Increased Increased
EE382A – Autumn 2009 Lecture 16A - 3 John P Shen Latency vs. Throughput Performance
• Reduce Latency of Application – Uni-processor, Single Program – Target Single-Thread Performance – Examples: SPEC, PC and Workstations
• Increase Throughput of System – Multi-processors, Many Threads/Tasks – Target Multi-threaded/Multi-tasking Throughput – Example: Database Transaction Processing
EE382A – Autumn 2009 Lecture 16A - 4 John P Shen Law #2 – MP (Throughput) Performance
• Multi-Core Performance:
• Can Improve PerfMC by: – Increasing: n (no. of CPUs or cores) – Increasing: Frequency (CPU clock frequency) – Decreasing: PL (dynamic instruction count) – Decreasing: CPI (cycles/instruction)
EE382A – Autumn 2009 Lecture 16A - 5 John P Shen Law #3 – Multi-Core Performance Scalability
• Multi-Core Speedup:
• A Rigorous Scalability Model:
EE382A – Autumn 2009 Lecture 16A - 6 John P Shen Scalability Degradation
[Carole Dulong (x+y)=0.10 et al., 2005] 1p-16p 1-(x R^2 (x+y)=0.20 scaling +y) (x+y)=0.30 SEMPHY 0.993 0.999
PLSA 0.963 0.999
Rsearch 0.931 0.997
SVM-RFE 0.786 0.970
SNPs 0.685 0.967
GeneNet 0.642 0.983
EE382A – Autumn 2009 Lecture 16A - 7 John P Shen Path-Length Breakdown
EE382A – Autumn 2009 Lecture 16A - 8 John P Shen CPI Breakdown
EE382A – Autumn 2009 Lecture 16A - 9 John P Shen Scalability Headroom
? Reduce PL(n) Reduce CPI(n)
(x+y)=0.35
EE382A – Autumn 2009 Lecture 16A - 10 John P Shen Conspiring Forces Against MC Scaling Three Forms of Scalability Impedance
• Architecture – Increase of Path-Length Undermines Scalability – Increase of CPI Undermines Scalability
• Algorithm – Limitation of Language and Algorithm – Tyranny of Amdahl’s Law (sequential bottleneck)
• Power/Thermal – Increased Complexity and Inefficiency of Design – Super-linear Power Scaling Relative to Performance
EE382A – Autumn 2009 Lecture 16A - 11 John P Shen Law #4 – Algorithm and Performance (Amdahl’s Law & Gustafson’s Law)
f = sequential % Execution Time Time Execution Execution Time Execution Time f Parallelism (n) f* Parallelism (n)
EE382A – Autumn 2009 Lecture 16A - 12 John P Shen Two Distinct & Correlated Dimensions of Performance Scalability Impedance
Architecture Scalability:
Algorithm Scalability: (Amdahl’s Law)
EE382A – Autumn 2009 Lecture 16A - 13 John P Shen Combined Effect on Actual Speedup (Algorithm and Architecture Scalability)
10X SU 20X SU n > 13 n > 33
EE382A – Autumn 2009 Lecture 16A - 14 John P Shen Impact of Single-thread Performance on Multi-core Performance Scalability
10X SU 20X SU n > 17 n > 48
EE382A – Autumn 2009 Lecture 16A - 15 John P Shen Impact of Single-thread Performance on Multi-core Performance Scalability
10X SU 20X SU n > 14 n > 36
EE382A – Autumn 2009 Lecture 16A - 16 John P Shen Law #5 – Power and Performance
EE382A – Autumn 2009 Lecture 16A - 17 John P Shen Power and Performance Landscape
Pentium Pentium EE Pentium M Power (W)
120
100
80
100 300 500 700 1100 900 1300 1500 1700 60 1900 1 40 0.8 0.6 20 Spec2K/MHz 0.4 0 0.2 0 500 1000 1500 2000 2500 0 3000 3500 4000 Frequency (Hz) [John DeVale & Bryan Black, 2005] EE382A – Autumn 2009 Lecture 16A - 18 John P Shen Power and Throughput Performance
30 [Ed Grochowski, 2005] Pentium 4 (Psc) 25 CPU EPI Pentium 4 (Wmt)
20 I486 7 nj (1.74) powerpower = =perf perf (1.74) P5 10 nj 15 Scalar/Latency Throughput P6 17 nj Performance Performance 10 P4P-wmt 27 nj Relative Power Pentium Pro Pentium M P4P-psc 29 nj 5 Pentium (1.0) i486 power = perf ? Pentium M 9 nj 0 0 2 4 6 8 Relative Performance Low EPI
EE382A – Autumn 2009 Lecture 16A - 19 John P Shen • The issue is not small vs. big cores, nor in-order vs. out-of-order cores. The key metric is EPI. • The ideal core: ultra-low EPI with best possible single-thread or single-core performance and CPU EPI SU highly efficient power/performance scaling. i486 7 nj 1
P5 10 nj 2 P6 17 nj 3.5 P4P-wmt 27 nj 6 ? P4P-psc 29 nj 6.5
P4P: 27 nj Pentium M 9 nj 5.5 ? Neo-Core 5 nj 6.5 Pentium 4 (Psc) Pentium 4 (Wmt) P5: 10 nj P-M: 9 nj? Pentium Pro Pentium M Neo: 5 nj i486 Pentium Neo-Core ?
EE382A – Autumn 2009 Lecture 16A - 20 John P Shen CPU EPI SU
i486 7 nj 1
P5 10 nj 2 P6 17 nj 3.5 • The MC scaling goal is not maximizing the number of cores but achieving the maximum throughput P4P-wmt 27 nj 6 within fixed power envelop (using fewest cores). P4P-psc 29 nj 6.5 • The key issue is not the power scaling of replicated Pentium M 9 nj 5.5 cores, but the un-core power scaling that may push total power scaling towards the square law again. Neo core? 5 nj 6.5 EPI= 29nj • Assume a large- scale CMP with potentially many cores. EPI= • Replication of 10nj cores results in Pentium 4 (Psc) proportional Pentium 4 (Wmt) EPI= increases in both Pentium M 9nj Pentium Pro EPI= throughput i486 Pentium 5nj performance and power (hopefully).
EE382A – Autumn 2009 Lecture 16 - 21 John P Shen Holy Grail: <1 nj building–block core and near-linear un-core power scaling
EPI: CPU Cores Prog. Accelerators Fixed Function Units 10nj 1nj 0.1nj 0.01nj
100W Power Envelope
NP/DSP/GPU EPI
IXP2800 1 nj 20x Performance Increase 20x Performance Increase TMS320C6713 0.7 nj GeF7800GTX 0.6 nj
EE382A – Autumn 2009 Lecture 16A - 22 John P Shen Power scaling challenges: • Need low EPI cores • Control un-core scaling
? ?
Speedup scaling challenges: • Algorithm • Sequential % • Architecture • PL scaling • CPI scaling
EE382A – Autumn 2009 Lecture 16A - 23 John P Shen Quo Vadis?
• MC Scalability Strategies: – Algorithm – Languages & Specialized Parallelism – Architecture – CPI and Path Length Reduction – Power/Thermal – EPI Reduction & Scalable Un-core • Power/Thermal is the most critical MC scalability wall • Research Challenges: – Sequential % mitigation for compelling workloads – Ultra-low EPI core with great ST performance – Un-core fabric with near-linear power scaling • Un-core scaling is the new power goblin
EE382A – Autumn 2009 Lecture 16A - 24 John P Shen