28 – Parallel Processing 1 Lecture 28 – Parallel Processing 2

Lecture 28 – Parallel Processing 1 Lecture 28 – Parallel Processing 2 Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 3 Lecture 28 – Parallel Processing 4 Multiprocessing Flynn’s Taxonomy • Flynn’s Taxonomy of Parallel Machines • MISD: Multiple I, Single D Stream – How many Instruction streams? – Not used much – How many Data streams? (note: we’ll spend most time talking about just 1 class…) • MIMD: Multiple I, Multiple D Streams – Each processor executes its own instructions and • SISD: Single I Stream, Single D Stream – A uniprocessor operates on its own data – This is your typical off-the-shelf multiprocessor • SIMD: Single I, Multiple D Streams (made using a bunch of “normal” processors) – Each “processor” works on its own data • Not superscalar – But all execute the same instrs in lockstep • Each node is superscalar – E.g. a vector processor or MMX • Lessons will apply to multi-core too! University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 5 Lecture 28 – Parallel Processing 6 Or…in pictures! Multiprocessors • Uni: • SMP (“Symmetric”) • Why do we need multiprocessors? – (Also “CSM”) P M – Uniprocessor speed improving fast – But there are things that need even more speed • Pipelined P • Wait for a few years for Moore’s law to catch up? M M • Or use multiple processors and do it now? P • (Is Moore’s Law still catching up? M/C?) • Superscalar M • Distributed • Multiprocessor software problem • VLIW/”EPIC” – Most code is sequential (for uniprocessors) P M • MUCH easier to write and debug N M – Correct parallel code very, very difficult to write E P M T • Efficient and correct is much more difficult • Debugging even more difficult Let’s look at a few MIMD example configurations… University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 7 Lecture 28 – Parallel Processing 8 MIMD Multiprocessors Centralized Shared Memory Note: just 1 memory University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 9 Lecture 28 – Parallel Processing 10 MIMD Multiprocessors Distributed Memory Before, we did parallel processing by chaining together separate processors. Now we can do it on the same chip. Multiple, distributed memories here. University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 11 Lecture 28 – Parallel Processing 12 University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 13 Lecture 28 – Parallel Processing 14 University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 15 Lecture 28 – Parallel Processing 16 University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 17 Lecture 28 – Parallel Processing 18 University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 19 Lecture 28 – Parallel Processing 20 Speedup metric for performance on latency-sensitive applications • Time(1) / Time(P) for P processors – note: must use the best sequential algorithm for Time(1) -- the parallel algorithm may be different. B Ok, after all of that, what does “linear” speedup parallel processing really do for (ideal) performance? typical: rolls off speedup w/some # of processors 1 2 4 8 16 32 64 occasionally see 1 2 4 8 16 32 64 “superlinear”... why? # processors University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 21 Lecture 28 – Parallel Processing 22 Parallel Performance Parallel Programming • Serial sections • Parallel software is the problem – Very difficult to parallelize the entire app • Need to get significant performance improvement – Amdahl’s law – Otherwise, just use a faster uniprocessor, since it’s Speedup 1024 Speedup = 1024 1 Parallel = Parallel easier! SpeedupOverall = F = 0.5 FParallel = 0.99 FParallel Parallel (1- FParallel ) + SpeedupParallel SpeedupOverall = 1.998 SpeedupOverall = 91.2 • Difficulties – Partitioning – Coordination Let’s look at an example – Communications overhead C in context of CPI University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 23 Lecture 28 – Parallel Processing 24 Amdahl’s Law Scaling Example • Sequential part can limit speedup • Workload: sum of 10 scalars, and 10 ! 10 matrix sum – Speed up from 10 to 100 processors • Example: 100 processors, 90! speedup? • Single processor: Time = (10 + 100) ! t – Tnew = Tparallelizable/100 + Tsequential add 1 – Speedup = = 90 • 10 processors (1!F ) +F /100 – Time = 10 ! tadd + 100/10 ! tadd = 20 ! tadd parallelizable parallelizable – Speedup = 110/20 = 5.5 (55% of potential) – Solving: Fparallelizable = 0.999 • Need sequential part to be 0.1% of original time • 100 processors – Time = 10 ! tadd + 100/100 ! tadd = 11 ! tadd – Speedup = 110/11 = 10 (10% of potential) • Assumes load can be balanced across processors University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 25 Lecture 28 – Parallel Processing 26 Scaling Example (cont) Speedup Challenge • What if matrix size is 100 ! 100? • To get full benefit of parallelism need to be able to parallelize the entire program! • Single processor: Time = (10 + 10000) ! tadd • 10 processors • Amdahl’s Law – Time = (Time /Improvement)+Time – Time = 10 ! tadd + 10000/10 ! tadd = 1010 ! tadd after affected unaffected – Speedup = 10010/1010 = 9.9 (99% of potential) – Example: We want 100 times speedup with 100 processors – Timeunaffected = 0!!! • 100 processors – Time = 10 ! tadd + 10000/100 ! tadd = 110 ! tadd – Speedup = 10010/110 = 91 (91% of potential) • Assuming load balanced University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 27 Lecture 28 – Parallel Processing 28 Cache Coherence Problem • Shared memory easy with no caches – P1 writes, P2 can read – Only one copy of data exists (in memory) • Caches store their own copies of the data – Those copies can easily get inconsistent Seems like lots of trouble. – Classical example: adding to a sum Why do it? • P1 loads allSum, adds its mySum, stores new allSum • P1’s cache now has dirty data, but memory not updated Because we sort of have to… • P2 loads allSum from memory, adds its mySum, stores allSum • P2’s cache also has dirty data • Eventually P1 and P2’s cached data will go to memory • Regardless of write-back order, the final value ends up wrong University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 29 Lecture 28 – Parallel Processing 30 Let’s look back to Lecture 01 University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 31 Lecture 28 – Parallel Processing 32 There’s another kind of parallelism too. University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 33 Lecture 28 – Parallel Processing 34 University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 35 Lecture 28 – Parallel Processing 36 University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 37 Lecture 28 – Parallel Processing 38 University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 39 Lecture 28 – Parallel Processing 40 University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 41 Lecture 28 – Parallel Processing 42 University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 43 Lecture 28 – Parallel Processing 44 University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 45 Lecture 28 – Parallel Processing 46 University of Notre Dame University of Notre Dame Lecture 28 – Parallel Processing 47§7.5 Hardware Multithreading Lecture 28 – Parallel Processing 48 Multithreading • Performing multiple threads of execution in parallel – Replicate registers, PC, etc. – Fast switching between threads • Fine-grain multithreading – Switch threads after each cycle – Interleave instruction execution – If one thread stalls, others are executed Examples • Coarse-grain multithreading – Only switch on long stall (e.g., L2-cache miss) – Simplifies hardware, but doesn’t hide short stalls (eg, data hazards) University of Notre Dame University of Notre Dame.

Load more