10/8/10

Agenda

• Kinds of Parallelism CS 61C: Great Ideas in • Administrivia Architecture (Machine Structures) • Technology Break • Amdahl’s Law Instructors: Randy H. Katz David A. Paerson hp://inst.eecs.Berkeley.edu/~cs61c/fa10

10/8/10 Fall 2010 -- Lecture #17 1 10/8/10 Fall 2010 -- Lecture #17 2

Alternave Kinds of Parallelism: Agenda The Programming Viewpoint • Kinds of Parallelism • Job-level parallelism/-level parallelism • Administrivia – Running independent programs on mulple • Technology Break processors simultaneously – Example? • Amdahl’s Law • Parallel processing program – Single program that runs on mulple processors simultaneously – Example?

10/8/10 Fall 2010 -- Lecture #17 3 10/8/10 Fall 2010 -- Lecture #17 4

Alternave Kinds of Parallelism: Alternave Kinds of Parallelism: Hardware vs. Soware The Processor Viewpoint • Cluster: Set of interconnected across a – Like a mulprocessor, but with no • Mulcore microprocessor – Microprocessor containing mulple processors (“cores”) in a single IC, with shared memory • Concurrent soware can also run on serial hardware • Sequenal soware can also run on parallel hardware • Our “warehouse-scale” computers are • Focus is on parallel processing soware: sequenal or mulcore clusters! concurrent soware running on parallel hardware

10/8/10 Fall 2010 -- Lecture #17 5 10/8/10 Fall 2010 -- Lecture #17 6

1 10/8/10

Alternave Kinds of Parallelism: Alternave Kinds of Parallelism: Shared Memory Mulprocessor Computer Cluster

NI NI

Network Interface NO Shared Memory Local Area Network (LAN) or Cluster Interconnect

10/8/10 Fall 2010 -- Lecture #17 7 10/8/10 Fall 2010 -- Lecture #17 8

Alternave Kinds of Parallelism: Alternave Kinds of Parallelism: Single Instrucon/Single Data Stream Mulple Instrucon/Single Data Stream • Single Instrucon, • Mulple Instrucon, Single Data stream Single Data streams (SISD) (MISD) – Computer that exploits – Sequenal computer mulple instrucon that exploits no streams against a single parallelism in either the data stream for data instrucon or data operaons that can be Processing Unit streams. Examples of naturally parallelized. For SISD architecture are example, certain kinds of tradional uniprocessor array processors. machines – No longer commonly encountered, mainly of

10/8/10 Fall 2010 -- Lecture #17 9 10/8/10 Fall 2010 -- Lecture #17 historical interest only 10

Alternave Kinds of Parallelism: Alternave Kinds of Parallelism: Single Instrucon/Mulple Data Stream Mulple Instrucon/Mulple Data Streams • Single Instrucon, • Mulple Instrucon, Mulple Data streams Mulple Data streams (SIMD) (MIMD) – Mulple autonomous – Computer that exploits processors simultaneously mulple data streams execung different against a single instrucons on different data. Clusters/distributed instrucon stream to systems are generally operaons that may be recognized to be MIMD naturally parallelized, architectures, either e.g., an array processor exploing a single shared memory space or a or Graphics Processing space Unit (GPU)

10/8/10 Fall 2010 -- Lecture #17 11 10/8/10 Fall 2010 -- Lecture #17 12

2 10/8/10

Flynn Taxonomy SIMD Architectures

: executing one operation on multiple data streams

• Example to provide context: – Multiplying a coefficient vector by a data vector • In 2010, SIMD and MIMD most commonly encountered (e.g., in filtering) • Most common parallel processing programming style: y[i] := [i] × x[i], 0 ≤ i < n Single Program Mulple Data – Single program that runs on all processors of an MIMD • Sources of performance improvement: – Cross-processor execuon coordinaon through condional – One instruction is fetched & decoded for entire operation expressions (save for Dave and parallelism) – Multiplications are known to be independent • SIMD (aka hw-level data parallelism): specialized funcon – Pipelining/concurrency in memory access as well units, for handling lock-step calculaons involving arrays – Scienfic compung, signal processing, mulmedia (audio/video processing) • More on SIMD on Monday 10/8/10 Fall 2010 -- Lecture #17 13 10/8/10 Fall 2010 -- Lecture #17 Slide 14

Agenda Midterm Results: Scores

total min 13 • Kinds of Parallelism max 84 median 65 • Administrivia mean 63.1 st. dev. 14.4 • Technology Break 1/4 got >= 75 out of 85 1/2 got >= 65 out of 85 • Amdahl’s Law 2/3 >= 60 3/4 >= 55 80% >= 50 90% >= 45

10/8/10 Fall 2010 -- Lecture #17 15 10/8/10 Fall 2010 -- Lecture #17 16

Midterm Re-Grades: The Rules Course Kicks Into High Gear

• Wrien grading appeals only! • Aend Tuesday’s discussion to learn about exam • This week’s EC2 Lab soluons and grading scheme • Next week’s SIMD Lab • If you feel your exam was incorrectly graded, explain why and aach legible write-up to examinaon booklet • EC2 Project #2 (Due Saturday, 10/23, 1 Second • You have ONE week from Tuesday to hand it in in class, to Midnight) lab, or discussion • Following week’s Thread Parallelism Lab • We want to be fair, and correct mistakes, but are unlikely to change paral credit as assigned • NOTE: We reserve the right to re-examine the whole examinaon should you turn it in for a re-grade.

10/8/10 Fall 2010 -- Lecture #17 17 10/8/10 Fall 2010 -- Lecture #17 18

3 10/8/10

Agenda Agenda

• Kinds of Parallelism • Kinds of Parallelism • Administrivia • Administrivia • Technology Break • Technology Break • Amdahl’s Law • Amdahl’s Law

10/8/10 Fall 2010 -- Lecture #17 19 10/8/10 Fall 2010 -- Lecture #17 20

Big Idea: Amdahl’s Law Big Idea: Amdahl’s Law • due to enhancement E is Exec me w/o E Speedup = Speedup w/ E = ------Exec me w/ E • Suppose that enhancement E accelerates a fracon F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected Example: the execuon me of half of the program can be accelerated by a factor of 2. What is the program speed-up overall? Execuon Time w/ E = Execuon Time w/o E × [ (1-F) + F/S] Speedup w/ E = 1 / [ (1-F) + F/S ] 10/8/10 Fall 2010 -- Lecture #17 22 10/8/10 Fall 2010 -- Lecture #17 23

Big Idea: Amdahl’s Law Big Idea: Amdahl’s Law

If the poron of Speedup = 1 the program that can be parallelized (1 - F) + F is small, then the speedup is limited Non-speed-up part S Speed-up part

The non-parallel poron limits Example: the execuon me of half of the the performance program can be accelerated by a factor of 2. What is the program speed-up overall? 1 1 = = 1.33 0.5 + 0.5 0.5 + 0.25 2 10/8/10 Fall 2010 -- Lecture #17 24 10/8/10 Fall 2010 -- Lecture #17 25

4 10/8/10

Example #1: Amdahl’s Law Parallel Speed-up Example Speedup w/ E = 1 / [ (1-F) + F/S ] • Consider an enhancement which runs 20 mes faster but Z0 + Z1 + … + Z10 X1,1 X1,10 Y1,1 Y1,10 Paron 10 ways which is only usable 25% of the me and perform Speedup w/ E = 1/(.75 + .25/20) = 1.31 + on 10 parallel X10,1 X10,10 Y10,1 Y10,10 processing units • What if its usable only 15% of the me? Speedup w/ E = 1/(.85 + .15/20) = 1.17 Non-parallel part Parallel part

• Amdahl’s Law tells us that to achieve linear speedup with • 10 “scalar” operaons (non-parallelizable) 100 processors, none of the original computaon can be scalar! • 100 parallelizable operaons • To get a speedup of 90 from 100 processors, the percentage • 110 operaons of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99 10/8/10 Fall 2010 -- Lecture #17 27 10/8/10 Fall 2010 -- Lecture #17 28

Example #2: Amdahl’s Law Speedup w/ E = 1 / [ (1-F) + F/S ] Scaling

• Consider summing 10 scalar variables and two 10 by • To get good speedup on a mulprocessor while 10 matrices (matrix sum) on 10 processors keeping the problem size fixed is harder than geng Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5 good speedup by increasing the size of the problem. – Strong scaling: when speedup can be achieved on a • What if there are 100 processors ? mulprocessor without increasing the size of the problem Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0 – Weak scaling: when speedup is achieved on a mulprocessor by increasing the size of the problem proporonally to the increase in the number of processors • What if the matrices are 100 by 100 (or 10,010 adds in total) on 10 processors? • Load balancing is another important factor. Just a Speedup w/ E = 1/(.001 + .999/10) = 1/0.1009 = 9.9 single processor with twice the load of the others cuts the speedup almost in half • What if there are 100 processors ? Speedup w/ E = 1/(.001 + .999/100) = 1/0.01099 = 91 10/8/10 Fall 2010 -- Lecture #17 30 10/8/10 Fall 2010 -- Lecture #17 31

Summary

• Flynn Taxonomy of Parallel Architectures – SIMD: Single Instrucon Mulple Data – MIMD: Mulple Instrucon Mulple Data – SISD: Single Instrucon Single Data – MISD: Mulple Instrucon Single Data • Amdahl’s Law – Parallel code speed-up limited by the non- parallelized poron of the code – Strong scaling is hard to achieve

10/8/10 Fall 2010 -- Lecture #17 32

5