EE382A Lecture 16A: Multi-Core Design Tradeoffs (Cont.)

EE382A Lecture 16A: Multi-Core Design Tradeoffs (cont.) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a EE382A – Autumn 2009 Lecture 16A - 1 John P Shen Performance Iron Law and Multi-core Scalability • Law #1 - CPU (Latency) Performance • Law #2 – MP (Throughput) Performance • Law #3 - Multi-core Performance Scalability Architecture – A Performance Scalability Model Performance • Law #4 – Algorithm and Performance Algorithm • Law #5 – Power and Performance Power EE382A – Autumn 2009 Lecture 16A - 2 John P Shen Law #1 – CPU (Latency) Performance [John DeVale & Bryan Black, 2005] Deeper pipelining Wider pipeline Increased CPI penalty Cycle Time Time Cycle Increased Increased EE382A – Autumn 2009 Lecture 16A - 3 John P Shen Latency vs. Throughput Performance • Reduce Latency of Application – Uni-processor, Single Program – Target Single-Thread Performance – Examples: SPEC, PC and Workstations • Increase Throughput of System – Multi-processors, Many Threads/Tasks – Target Multi-threaded/Multi-tasking Throughput – Example: Database Transaction Processing EE382A – Autumn 2009 Lecture 16A - 4 John P Shen Law #2 – MP (Throughput) Performance • Multi-Core Performance: • Can Improve PerfMC by: – Increasing: n (no. of CPUs or cores) – Increasing: Frequency (CPU clock frequency) – Decreasing: PL (dynamic instruction count) – Decreasing: CPI (cycles/instruction) EE382A – Autumn 2009 Lecture 16A - 5 John P Shen Law #3 – Multi-Core Performance Scalability • Multi-Core Speedup: • A Rigorous Scalability Model: EE382A – Autumn 2009 Lecture 16A - 6 John P Shen Scalability Degradation [Carole Dulong (x+y)=0.10 et al., 2005] 1p-16p 1-(x R^2 (x+y)=0.20 scaling +y) (x+y)=0.30 SEMPHY 0.993 0.999 PLSA 0.963 0.999 Rsearch 0.931 0.997 SVM-RFE 0.786 0.970 SNPs 0.685 0.967 GeneNet 0.642 0.983 EE382A – Autumn 2009 Lecture 16A - 7 John P Shen Path-Length Breakdown EE382A – Autumn 2009 Lecture 16A - 8 John P Shen CPI Breakdown EE382A – Autumn 2009 Lecture 16A - 9 John P Shen Scalability Headroom ? Reduce PL(n) Reduce CPI(n) (x+y)=0.35 EE382A – Autumn 2009 Lecture 16A - 10 John P Shen Conspiring Forces Against MC Scaling Three Forms of Scalability Impedance • Architecture – Increase of Path-Length Undermines Scalability – Increase of CPI Undermines Scalability • Algorithm – Limitation of Language and Algorithm – Tyranny of Amdahl’s Law (sequential bottleneck) • Power/Thermal – Increased Complexity and Inefficiency of Design – Super-linear Power Scaling Relative to Performance EE382A – Autumn 2009 Lecture 16A - 11 John P Shen Law #4 – Algorithm and Performance (Amdahl’s Law & Gustafson’s Law) f = sequential % Execution Time Execution Time Execution Time Execution Time f Parallelism (n) f* Parallelism (n) EE382A – Autumn 2009 Lecture 16A - 12 John P Shen Two Distinct & Correlated Dimensions of Performance Scalability Impedance Architecture Scalability: Algorithm Scalability: (Amdahl’s Law) EE382A – Autumn 2009 Lecture 16A - 13 John P Shen Combined Effect on Actual Speedup (Algorithm and Architecture Scalability) 10X SU 20X SU n > 13 n > 33 EE382A – Autumn 2009 Lecture 16A - 14 John P Shen Impact of Single-thread Performance on Multi-core Performance Scalability 10X SU 20X SU n > 17 n > 48 EE382A – Autumn 2009 Lecture 16A - 15 John P Shen Impact of Single-thread Performance on Multi-core Performance Scalability 10X SU 20X SU n > 14 n > 36 EE382A – Autumn 2009 Lecture 16A - 16 John P Shen Law #5 – Power and Performance EE382A – Autumn 2009 Lecture 16A - 17 John P Shen Power and Performance Landscape Pentium Pentium EE Pentium M Power (W) 120 100 80 100 300 500 700 1100 900 1300 1500 1700 60 1900 1 40 0.8 0.6 20 Spec2K/MHz 0.4 0 0.2 0 500 1000 1500 2000 2500 3000 0 3500 4000 Frequency (Hz) [John DeVale & Bryan Black, 2005] EE382A – Autumn 2009 Lecture 16A - 18 John P Shen Power and Throughput Performance [Ed Grochowski, 2005] 30 Pentium 4 (Psc) 25 CPU EPI Pentium 4 (Wmt) I486 7 nj 20 (1.74) powerpower = =perf perf (1.74) P5 10 nj 15 Scalar/Latency Throughput P6 17 nj Performance Performance 10 P4P-wmt 27 nj Relative Power Pentium Pro Pentium M P4P-psc 29 nj 5 Pentium (1.0) i486 power = perf ? Pentium M 9 nj 0 0 2 4 6 8 Low EPI Relative Performance EE382A – Autumn 2009 Lecture 16A - 19 John P Shen • The issue is not small vs. big cores, nor in-order vs. out-of-order cores. The key metric is EPI. • The ideal core: ultra-low EPI with best possible single-thread or single-core performance and CPU EPI SU highly efficient power/performance scaling. i486 7 nj 1 P5 10 nj 2 P6 17 nj 3.5 P4P-wmt 27 nj 6 ? P4P-psc 29 nj 6.5 P4P: 27 nj Pentium M 9 nj 5.5 ? Neo-Core 5 nj 6.5 Pentium 4 (Psc) Pentium 4 (Wmt) P5: 10 nj P-M: 9 nj? Pentium Pro Pentium M Neo: 5 nj i486 Pentium Neo-Core ? EE382A – Autumn 2009 Lecture 16A - 20 John P Shen CPU EPI SU i486 7 nj 1 P5 10 nj 2 P6 17 nj 3.5 • The MC scaling goal is not maximizing the number of cores but achieving the maximum throughput P4P-wmt 27 nj 6 within fixed power envelop (using fewest cores). P4P-psc 29 nj 6.5 • The key issue is not the power scaling of replicated Pentium M 9 nj 5.5 cores, but the un-core power scaling that may push total power scaling towards the square law again. Neo core? 5 nj 6.5 EPI= 29nj • Assume a large- scale CMP with potentially many cores. EPI= • Replication of 10nj cores results in Pentium 4 (Psc) proportional Pentium 4 (Wmt) EPI= increases in both Pentium M 9nj Pentium Pro EPI= throughput i486 Pentium 5nj performance and power (hopefully). EE382A – Autumn 2009 Lecture 16 - 21 John P Shen Holy Grail: <1 nj building–block core and near-linear un-core power scaling EPI: CPU Cores Prog. Accelerators Fixed Function Units 10nj 1nj 0.1nj 0.01nj 100W Power Envelope NP/DSP/GPU EPI IXP2800 1 nj 20x Performance Increase TMS320C6713 0.7 nj GeF7800GTX 0.6 nj EE382A – Autumn 2009 Lecture 16A - 22 John P Shen Power scaling challenges: • Need low EPI cores • Control un-core scaling ? ? Speedup scaling challenges: • Algorithm • Sequential % • Architecture • PL scaling • CPI scaling EE382A – Autumn 2009 Lecture 16A - 23 John P Shen Quo Vadis? • MC Scalability Strategies: – Algorithm – Languages & Specialized Parallelism – Architecture – CPI and Path Length Reduction – Power/Thermal – EPI Reduction & Scalable Un-core • Power/Thermal is the most critical MC scalability wall • Research Challenges: – Sequential % mitigation for compelling workloads – Ultra-low EPI core with great ST performance – Un-core fabric with near-linear power scaling • Un-core scaling is the new power goblin EE382A – Autumn 2009 Lecture 16A - 24 John P Shen .

Load more