EE382A Lecture 16A:

Multi-Core Design Tradeoffs (cont.)

Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a

EE382A – Autumn 2009 Lecture 16A - 1 John P Shen Performance Iron Law and Multi-core Scalability

• Law #1 - CPU (Latency) Performance • Law #2 – MP (Throughput) Performance • Law #3 - Multi-core Performance Scalability Architecture Architecture – A Performance Scalability Model Performance Performance

• Law #4 – Algorithm and Performance Algorithm Algorithm

• Law #5 – Power and Performance Power Power

EE382A – Autumn 2009 Lecture 16A - 2 John P Shen Law #1 – CPU (Latency) Performance

[John DeVale & Bryan Black, 2005]

Deeper pipelining Wider

pipeline Increased CPI penalty

Cycle Time Time Cycle Increased Increased

EE382A – Autumn 2009 Lecture 16A - 3 John P Shen Latency vs. Throughput Performance

• Reduce Latency of Application – Uni-processor, Single Program – Target Single-Thread Performance – Examples: SPEC, PC and Workstations

• Increase Throughput of System – Multi-processors, Many Threads/Tasks – Target Multi-threaded/Multi-tasking Throughput – Example: Database Transaction Processing

EE382A – Autumn 2009 Lecture 16A - 4 John P Shen Law #2 – MP (Throughput) Performance

• Multi-Core Performance:

• Can Improve PerfMC by: – Increasing: n (no. of CPUs or cores) – Increasing: Frequency (CPU clock frequency) – Decreasing: PL (dynamic instruction count) – Decreasing: CPI (cycles/instruction)

EE382A – Autumn 2009 Lecture 16A - 5 John P Shen Law #3 – Multi-Core Performance Scalability

• Multi-Core Speedup:

• A Rigorous Scalability Model:

EE382A – Autumn 2009 Lecture 16A - 6 John P Shen Scalability Degradation

[Carole Dulong (x+y)=0.10 et al., 2005] 1p-16p 1-(x R^2 (x+y)=0.20 scaling +y) (x+y)=0.30 SEMPHY 0.993 0.999

PLSA 0.963 0.999

Rsearch 0.931 0.997

SVM-RFE 0.786 0.970

SNPs 0.685 0.967

GeneNet 0.642 0.983

EE382A – Autumn 2009 Lecture 16A - 7 John P Shen Path-Length Breakdown

EE382A – Autumn 2009 Lecture 16A - 8 John P Shen CPI Breakdown

EE382A – Autumn 2009 Lecture 16A - 9 John P Shen Scalability Headroom

? Reduce PL(n) Reduce CPI(n)

(x+y)=0.35

EE382A – Autumn 2009 Lecture 16A - 10 John P Shen Conspiring Forces Against MC Scaling Three Forms of Scalability Impedance

• Architecture – Increase of Path-Length Undermines Scalability – Increase of CPI Undermines Scalability

• Algorithm – Limitation of Language and Algorithm – Tyranny of Amdahl’s Law (sequential bottleneck)

• Power/Thermal – Increased Complexity and Inefficiency of Design – Super-linear Power Scaling Relative to Performance

EE382A – Autumn 2009 Lecture 16A - 11 John P Shen Law #4 – Algorithm and Performance (Amdahl’s Law & Gustafson’s Law)

f = sequential % Execution Time Time Execution Execution Time Execution Time f Parallelism (n) f* Parallelism (n)

EE382A – Autumn 2009 Lecture 16A - 12 John P Shen Two Distinct & Correlated Dimensions of Performance Scalability Impedance

Architecture Scalability:

Algorithm Scalability: (Amdahl’s Law)

EE382A – Autumn 2009 Lecture 16A - 13 John P Shen Combined Effect on Actual Speedup (Algorithm and Architecture Scalability)

10X SU  20X SU  n > 13 n > 33

EE382A – Autumn 2009 Lecture 16A - 14 John P Shen Impact of Single-thread Performance on Multi-core Performance Scalability

10X SU  20X SU  n > 17 n > 48

EE382A – Autumn 2009 Lecture 16A - 15 John P Shen Impact of Single-thread Performance on Multi-core Performance Scalability

10X SU  20X SU  n > 14 n > 36

EE382A – Autumn 2009 Lecture 16A - 16 John P Shen Law #5 – Power and Performance

EE382A – Autumn 2009 Lecture 16A - 17 John P Shen Power and Performance Landscape

Pentium EE Power (W)

120

100

80

100 300 500 700 1100 900 1300 1500 1700 60 1900 1 40 0.8 0.6 20 Spec2K/MHz 0.4 0 0.2 0 500 1000 1500 2000 2500 0 3000 3500 4000 Frequency (Hz) [John DeVale & Bryan Black, 2005] EE382A – Autumn 2009 Lecture 16A - 18 John P Shen Power and Throughput Performance

30 [Ed Grochowski, 2005] (Psc) 25 CPU EPI Pentium 4 (Wmt)

20 7 nj (1.74) powerpower = =perf perf (1.74) P5 10 nj 15 Scalar/Latency Throughput 17 nj Performance Performance 10 P4P-wmt 27 nj Relative Power Pentium M P4P-psc 29 nj 5 Pentium (1.0) i486 power = perf ? Pentium M 9 nj 0 0 2 4 6 8 Relative Performance Low EPI

EE382A – Autumn 2009 Lecture 16A - 19 John P Shen • The issue is not small vs. big cores, nor in-order vs. out-of-order cores. The key metric is EPI. • The ideal core: ultra-low EPI with best possible single-thread or single-core performance and CPU EPI SU highly efficient power/performance scaling. i486 7 nj 1

P5 10 nj 2 P6 17 nj 3.5 P4P-wmt 27 nj 6 ? P4P-psc 29 nj 6.5

P4P: 27 nj Pentium M 9 nj 5.5 ? Neo-Core 5 nj 6.5 Pentium 4 (Psc) Pentium 4 (Wmt) P5: 10 nj P-M: 9 nj? Pentium Pro Pentium M Neo: 5 nj i486 Pentium Neo-Core ?

EE382A – Autumn 2009 Lecture 16A - 20 John P Shen CPU EPI SU

i486 7 nj 1

P5 10 nj 2 P6 17 nj 3.5 • The MC scaling goal is not maximizing the number of cores but achieving the maximum throughput P4P-wmt 27 nj 6 within fixed power envelop (using fewest cores). P4P-psc 29 nj 6.5 • The key issue is not the power scaling of replicated Pentium M 9 nj 5.5 cores, but the un-core power scaling that may push total power scaling towards the square law again. Neo core? 5 nj 6.5 EPI= 29nj • Assume a large- scale CMP with potentially many cores. EPI= • Replication of 10nj cores results in Pentium 4 (Psc) proportional Pentium 4 (Wmt) EPI= increases in both Pentium M 9nj Pentium Pro EPI= throughput i486 Pentium 5nj performance and power (hopefully).

EE382A – Autumn 2009 Lecture 16 - 21 John P Shen Holy Grail: <1 nj building–block core and near-linear un-core power scaling

EPI: CPU Cores Prog. Accelerators Fixed Function Units 10nj 1nj 0.1nj 0.01nj

100W Power Envelope

NP/DSP/GPU EPI

IXP2800 1 nj 20x Performance Increase 20x Performance Increase TMS320C6713 0.7 nj GeF7800GTX 0.6 nj

EE382A – Autumn 2009 Lecture 16A - 22 John P Shen Power scaling challenges: • Need low EPI cores • Control un-core scaling

? ?

Speedup scaling challenges: •  Algorithm •  Sequential % •  Architecture •  PL scaling •  CPI scaling

EE382A – Autumn 2009 Lecture 16A - 23 John P Shen Quo Vadis?

• MC Scalability Strategies: – Algorithm – Languages & Specialized Parallelism – Architecture – CPI and Path Length Reduction – Power/Thermal – EPI Reduction & Scalable Un-core • Power/Thermal is the most critical MC scalability wall • Research Challenges: – Sequential % mitigation for compelling workloads – Ultra-low EPI core with great ST performance – Un-core fabric with near-linear power scaling • Un-core scaling is the new power goblin

EE382A – Autumn 2009 Lecture 16A - 24 John P Shen