EECS 570 Lecture 2 Message Passing & Shared Memory

EECS 570 Lecture 2 Message Passing & Shared Memory Intel Paragon XP/S Winter 2021 Prof. Satish Narayanasamy http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Drs. Adve, Falsafi, Martin, Musuvathi, Narayanasamy, Nowatzyk, Wenisch, Sarkar, Mikko Lipasti, Jim Smith, John Shen, Mark Hill, David Wood, Guri Sohi, Jim Smith, Natalie Enright Jerger, Michel Dubois, Murali Annavaram, Per Stenström, and probably others 1 EECS 570 Announcements Discussion this Friday. • Will discuss programming assignment 1 2 EECS 570 Readings For today ❒ David Wood and Mark Hill. “Cost-Effective Parallel Computing,” IEEE Computer, 1995. ❒ Mark Hill et al. “21st Century Computer Architecture.” CCC White Paper, 2012. For Wednesday 1/27: H Kim, R Vuduc, S Christina Delimitrou and Baghsorkhi, J Choi, Wen- Christos Kozyrakis. mei Hwu, Performance Amdahl's law for tail Analysis and Tuning for latency. Commun. ACM 61, General Purpose Graphics July 2018 Processing Units (GPGPU), Ch. 1 3 EECS 570 Sequential Merge Sort 16MB input (32-bit integers) Time Recurse(left) Sequential Recurse(right) Execution Merge to scratch array Copy back to input array 4 EECS 570 Parallel Merge Sort (as Parallel Directed Acyclic Graph) 16MB input (32-bit integers) Time Parallel Recurse(left) Recurse(right) Execution Merge to scratch array Copy back to input array 5 EECS 570 Parallel DAG for Merge Sort (2-core) Sequential Sort Merge Sequential Sort Time 6 EECS 570 Parallel DAG for Merge Sort (4-core) 7 EECS 570 Parallel DAG for Merge Sort (8-core) 8 EECS 570 The DAG Execution Model of a Parallel Computation • Given an input, dynamically create a DAG • Nodes represent sequential computation ❒ Weighted by the amount of work • Edges represent dependencies: ❒ Node A ! Node B means that B cannot be scheduled unless A is finished 9 EECS 570 Sorting 16 elements in four cores 10 EECS 570 Sorting 16 elements in four cores (4 element arrays sorted in constant time) 1 1 8 1 1 16 1 1 8 1 11 EECS 570 Performance Measures • Given a graph G, a scheduler S, and P processors • Tp(S) : Time on P processors using scheduler S • Tp : Time on P processors using best scheduler • T1 : Time on a single processor (sequential cost) • T∞ : Time assuming infinite resources 12 EECS 570 Work and Depth • T1 = Work ❒ The total number of operations executed by a computation • T∞ = Depth ❒ The longest chain of sequential dependencies (critical path) in the parallel DAG 13 EECS 570 T∞ (Depth): Critical Path Length (Sequential Bottleneck) 14 EECS 570 T1 (work): Time to Run Sequentially 15 EECS 570 Sorting 16 elements in four cores (4 element arrays sorted in constant time) 1 1 8 1 1 16 1 1 8 Work = 1 Depth = 16 EECS 570 Some Useful Theorems 17 EECS 570 Work Law • “You cannot avoid work by parallelizing” T1 / P ≤ TP 18 EECS 570 Work Law • “You cannot avoid work by parallelizing” T1 / P ≤ TP Speedup = T1 / TP 19 EECS 570 Work Law • “You cannot avoid work by parallelizing” T1 / P ≤ TP Speedup = T1 / TP • Can speedup be more than 2 when we go from 1-core to 2- core in practice? 20 EECS 570 Depth Law • More resources should make things faster • You are limited by the sequential bottleneck TP ≥ T∞ 21 EECS 570 Amount of Parallelism Parallelism = T1 / T∞ 22 EECS 570 Maximum Speedup Possible Speedup T1 / TP ≤ T1 / T∞ Parallelism “speedup is bounded above by available parallelism” 23 EECS 570 Greedy Scheduler • If more than P nodes can be scheduled, pick any subset of size P • If less than P nodes can be scheduled, schedule them all 24 EECS 570 More Reading: http://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/scribe/lec4/lecture4.pdf 25 EECS 570 Performance of the Greedy Scheduler TP(Greedy) ≤ T1 / P + T∞ Work law T1 / P ≤ TP Depth law T∞ ≤ TP 26 EECS 570 Greedy is optimal within factor of 2 TP ≤ TP(Greedy) ≤ 2 TP Work law T1 / P ≤ TP Depth law T∞ ≤ TP 27 EECS 570 Work/Depth of Merge Sort (Sequential Merge) • Work T1 : O(n log n) • Depth T∞ : O(n) ❒ Takes O(n) time to merge n elements • Parallelism: ❒ T1 / T∞ = O(log n) ! really bad! 28 EECS 570 Main Message • Analyze the Work and Depth of your algorithm • Parallelism is Work/Depth • Try to decrease Depth ❒ the critical path ❒ a sequential bottleneck • If you increase Depth ❒ better increase Work by a lot more! 29 EECS 570 Amdahl’s law • Sorting takes 70% of the execution time of a sequential program • You replace the sorting algorithm with one that scales perfectly on multi-core hardware • How many cores do you need to get a 4x speed-up on the program? 30 EECS 570 Amdahl’s law, � = 70% Speedup(f, c) = 1 / ( 1 – f) + f / c f = the parallel portion of execution 1 - f = the sequential portion of execution c = number of cores used 31 EECS 570 Amdahl’s law, � = 70% 4.0 3.0 Desired 4x speedup 2.0 Speedup achieved Speedup (perfect scaling on 70%) 1.0 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores 32 EECS 570 Amdahl’s law, � = 70% 4.0 3.0 Desired 4x speedup 2.0 Limit as c→∞ = 1/(1-f) = 3.33 Speedup Speedup achieved 1.0 (perfect scaling on 70%) 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores 33 EECS 570 Amdahl’s law, � = 10% 1.20 1.13 Amdahl’s law limit, Speedup achieved just 1.11x with perfect scaling 1.05 Speedup 0.98 0.90 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores 34 EECS 570 Amdahl’s law, � = 98% 50 37.5 25 Speedup 12.5 0 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 #cores 35 EECS 570 Lesson • Speedup is limited by sequential code • Even a small percentage of sequential code can greatly limit potential speedup 36 EECS 570 21st Century Computer Architecture A CCC community white paper http://cra.org/ccc/docs/init/21stcenturyarchitecturewhitepaper.pdf Slides from M. Hill, HPCA 2014 keynote 37 EECS 570 20th Century ICT Set Up • Information & Communication Technology (ICT) Has Changed Our World ❒ <long list omitted> • Required innovations in algorithms, applications, programming languages, … , & system software • key (invisible) enablers (cost-)performance gains ❒ Semiconductor technology (“Moore’s Law”) ❒ Computer architecture (~80x per Danowitz et al.) 38 EECS 570 Enablers: Technology + Architecture Architecture Technology Danowitz et al., CACM 04/2012, Figure 1 39 EECS 570 21st Century ICT Promises More Data-centric personalized health care Computation-driven scientific discovery Much more: known & unknown Human network analysis 40 EECS 570 21st Century App Characteristics BIG DATA ALWAYS ONLINE Whither enablers of future (cost-)performance gains? 41 EECS 570 SECURE/PRIVATE Technology’s Challenges 1/2 Late 20th Century The New Reality Moore’s Law — Transistor count still 2× BUT… 2× transistors/chip Dennard Scaling — Gone. Can’t repeatedly double ~constant power/chip power/chip 42 EECS 570 42 Technology’s Challenges 2/2 Late 20th Century The New Reality Moore’s Law — Transistor count still 2× BUT… 2× transistors/chip Dennard Scaling — Gone. Can’t repeatedly double ~constant power/chip power/chip Modest (hidden) Increasing transistor transistor unreliability unreliability can’t be hidden Focus on computation Communication (energy) more over communication expensive than computation 1-time costs amortized One-time cost much worse & via mass market want specialized platforms How should architects step up as technology falters? 43 EECS 570 “Timeline” from DARPA ISAT Our Focus New Technology CMOS Fallow Period System Capability (log) Capability System 80s 90s 00s 10s 20s 30s 40s 50s Source: Advancing Computer Systems without Technology Progress, ISAT Outbrief (http://www.cs.wisc.edu/~markhill/papers/isat2012_ACSWTP.pdf) Mark D. Hill and Christos kozyrakis, DARPA/ISAT Workshop, March 26-27, 2012. Approved for Public Release, Distribution Unlimited The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. 44 EECS 570 21st Century Comp Architecture 20th Century 21st Century Single-chip in Architecture as Infrastructure: generic Spanning sensors Xto clouds computer Performance + security, privacy, Cross- availability, programmability, … Cutting: Performance Energy First Break via invisible ● Parallelism X current instr.-level ● Specialization layers with parallelism ● Cross-layer design new Predictable New technologies (non-volatile interfaces technologies: memory, near-threshold, 3D, CMOS, DRAM, photonics, …) Rethink: memory & 45 EECS 570 & disks storage, reliability, communication 21st Century Comp Architecture 20th Century 21st Century Single-chip in Architecture as Infrastructure: generic Spanning sensors to clouds computer Performance + security, privacy, Cross- availability, programmability, … Cutting: Performance Energy First Break via invisible ● Parallelism X current instr.-level ● Specialization layers with parallelism ● Cross-layer design new Predictable New technologies (non-volatile interfaces technologies: memory, near-threshold, 3D, CMOS, DRAM, photonics, …) Rethink: memory & 46 EECS 570 & disks storage, reliability, communication 21st Century Comp Architecture 20th Century 21st Century Single-chip in Architecture as Infrastructure: generic Spanning sensors to clouds computer Performance + security, privacy, Cross- availability, programmability, … Cutting: Performance Energy First Break via invisible ● Parallelism current instr.-level ● Specialization layers with parallelism ● Cross-layer design new Predictable New technologies (non-volatile interfaces technologies: memory, near-threshold, 3D, CMOS, DRAM, photonics, …) Rethink: memory & 47 EECS 570 & disks storage, reliability, communication 21st Century Comp Architecture 20th Century 21st Century

Load more