EE382A Lecture 16A: Multi-Core Design Tradeoffs (Cont.)

EE382A Lecture 16A: Multi-Core Design Tradeoffs (cont.) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a EE382A – Autumn 2009 Lecture 16A - 1 John P Shen Performance Iron Law and Multi-core Scalability • Law #1 - CPU (Latency) Performance • Law #2 – MP (Throughput) Performance • Law #3 - Multi-core Performance Scalability Architecture – A Performance Scalability Model Performance • Law #4 – Algorithm and Performance Algorithm • Law #5 – Power and Performance Power EE382A – Autumn 2009 Lecture 16A - 2 John P Shen Law #1 – CPU (Latency) Performance [John DeVale & Bryan Black, 2005] Deeper pipelining Wider pipeline Increased CPI penalty Cycle Time Time Cycle Increased Increased EE382A – Autumn 2009 Lecture 16A - 3 John P Shen Latency vs. Throughput Performance • Reduce Latency of Application – Uni-processor, Single Program – Target Single-Thread Performance – Examples: SPEC, PC and Workstations • Increase Throughput of System – Multi-processors, Many Threads/Tasks – Target Multi-threaded/Multi-tasking Throughput – Example: Database Transaction Processing EE382A – Autumn 2009 Lecture 16A - 4 John P Shen Law #2 – MP (Throughput) Performance • Multi-Core Performance: • Can Improve PerfMC by: – Increasing: n (no. of CPUs or cores) – Increasing: Frequency (CPU clock frequency) – Decreasing: PL (dynamic instruction count) – Decreasing: CPI (cycles/instruction) EE382A – Autumn 2009 Lecture 16A - 5 John P Shen Law #3 – Multi-Core Performance Scalability • Multi-Core Speedup: • A Rigorous Scalability Model: EE382A – Autumn 2009 Lecture 16A - 6 John P Shen Scalability Degradation [Carole Dulong (x+y)=0.10 et al., 2005] 1p-16p 1-(x R^2 (x+y)=0.20 scaling +y) (x+y)=0.30 SEMPHY 0.993 0.999 PLSA 0.963 0.999 Rsearch 0.931 0.997 SVM-RFE 0.786 0.970 SNPs 0.685 0.967 GeneNet 0.642 0.983 EE382A – Autumn 2009 Lecture 16A - 7 John P Shen Path-Length Breakdown EE382A – Autumn 2009 Lecture 16A - 8 John P Shen CPI Breakdown EE382A – Autumn 2009 Lecture 16A - 9 John P Shen Scalability Headroom ? Reduce PL(n) Reduce CPI(n) (x+y)=0.35 EE382A – Autumn 2009 Lecture 16A - 10 John P Shen Conspiring Forces Against MC Scaling Three Forms of Scalability Impedance • Architecture – Increase of Path-Length Undermines Scalability – Increase of CPI Undermines Scalability • Algorithm – Limitation of Language and Algorithm – Tyranny of Amdahl’s Law (sequential bottleneck) • Power/Thermal – Increased Complexity and Inefficiency of Design – Super-linear Power Scaling Relative to Performance EE382A – Autumn 2009 Lecture 16A - 11 John P Shen Law #4 – Algorithm and Performance (Amdahl’s Law & Gustafson’s Law) f = sequential % Execution Time Execution Time Execution Time Execution Time f Parallelism (n) f* Parallelism (n) EE382A – Autumn 2009 Lecture 16A - 12 John P Shen Two Distinct & Correlated Dimensions of Performance Scalability Impedance Architecture Scalability: Algorithm Scalability: (Amdahl’s Law) EE382A – Autumn 2009 Lecture 16A - 13 John P Shen Combined Effect on Actual Speedup (Algorithm and Architecture Scalability) 10X SU 20X SU n > 13 n > 33 EE382A – Autumn 2009 Lecture 16A - 14 John P Shen Impact of Single-thread Performance on Multi-core Performance Scalability 10X SU 20X SU n > 17 n > 48 EE382A – Autumn 2009 Lecture 16A - 15 John P Shen Impact of Single-thread Performance on Multi-core Performance Scalability 10X SU 20X SU n > 14 n > 36 EE382A – Autumn 2009 Lecture 16A - 16 John P Shen Law #5 – Power and Performance EE382A – Autumn 2009 Lecture 16A - 17 John P Shen Power and Performance Landscape Pentium Pentium EE Pentium M Power (W) 120 100 80 100 300 500 700 1100 900 1300 1500 1700 60 1900 1 40 0.8 0.6 20 Spec2K/MHz 0.4 0 0.2 0 500 1000 1500 2000 2500 3000 0 3500 4000 Frequency (Hz) [John DeVale & Bryan Black, 2005] EE382A – Autumn 2009 Lecture 16A - 18 John P Shen Power and Throughput Performance [Ed Grochowski, 2005] 30 Pentium 4 (Psc) 25 CPU EPI Pentium 4 (Wmt) I486 7 nj 20 (1.74) powerpower = =perf perf (1.74) P5 10 nj 15 Scalar/Latency Throughput P6 17 nj Performance Performance 10 P4P-wmt 27 nj Relative Power Pentium Pro Pentium M P4P-psc 29 nj 5 Pentium (1.0) i486 power = perf ? Pentium M 9 nj 0 0 2 4 6 8 Low EPI Relative Performance EE382A – Autumn 2009 Lecture 16A - 19 John P Shen • The issue is not small vs. big cores, nor in-order vs. out-of-order cores. The key metric is EPI. • The ideal core: ultra-low EPI with best possible single-thread or single-core performance and CPU EPI SU highly efficient power/performance scaling. i486 7 nj 1 P5 10 nj 2 P6 17 nj 3.5 P4P-wmt 27 nj 6 ? P4P-psc 29 nj 6.5 P4P: 27 nj Pentium M 9 nj 5.5 ? Neo-Core 5 nj 6.5 Pentium 4 (Psc) Pentium 4 (Wmt) P5: 10 nj P-M: 9 nj? Pentium Pro Pentium M Neo: 5 nj i486 Pentium Neo-Core ? EE382A – Autumn 2009 Lecture 16A - 20 John P Shen CPU EPI SU i486 7 nj 1 P5 10 nj 2 P6 17 nj 3.5 • The MC scaling goal is not maximizing the number of cores but achieving the maximum throughput P4P-wmt 27 nj 6 within fixed power envelop (using fewest cores). P4P-psc 29 nj 6.5 • The key issue is not the power scaling of replicated Pentium M 9 nj 5.5 cores, but the un-core power scaling that may push total power scaling towards the square law again. Neo core? 5 nj 6.5 EPI= 29nj • Assume a large- scale CMP with potentially many cores. EPI= • Replication of 10nj cores results in Pentium 4 (Psc) proportional Pentium 4 (Wmt) EPI= increases in both Pentium M 9nj Pentium Pro EPI= throughput i486 Pentium 5nj performance and power (hopefully). EE382A – Autumn 2009 Lecture 16 - 21 John P Shen Holy Grail: <1 nj building–block core and near-linear un-core power scaling EPI: CPU Cores Prog. Accelerators Fixed Function Units 10nj 1nj 0.1nj 0.01nj 100W Power Envelope NP/DSP/GPU EPI IXP2800 1 nj 20x Performance Increase TMS320C6713 0.7 nj GeF7800GTX 0.6 nj EE382A – Autumn 2009 Lecture 16A - 22 John P Shen Power scaling challenges: • Need low EPI cores • Control un-core scaling ? ? Speedup scaling challenges: • Algorithm • Sequential % • Architecture • PL scaling • CPI scaling EE382A – Autumn 2009 Lecture 16A - 23 John P Shen Quo Vadis? • MC Scalability Strategies: – Algorithm – Languages & Specialized Parallelism – Architecture – CPI and Path Length Reduction – Power/Thermal – EPI Reduction & Scalable Un-core • Power/Thermal is the most critical MC scalability wall • Research Challenges: – Sequential % mitigation for compelling workloads – Ultra-low EPI core with great ST performance – Un-core fabric with near-linear power scaling • Un-core scaling is the new power goblin EE382A – Autumn 2009 Lecture 16A - 24 John P Shen .

EE382A Lecture 16A: Multi-Core Design Tradeoffs (Cont.)

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support