Agenda Agenda

10/8/10 Agenda • Kinds of Parallelism CS 61C: Great Ideas in Computer • Administrivia Architecture (Machine Structures) • Technology Break • Amdahl’s Law Instructors: Randy H. Katz David A. Paerson hGp://inst.eecs.Berkeley.edu/~cs61c/fa10 10/8/10 Fall 2010 -- Lecture #17 1 10/8/10 Fall 2010 -- Lecture #17 2 AlternaXve Kinds of Parallelism: Agenda The Programming Viewpoint • Kinds of Parallelism • Job-level parallelism/process-level parallelism • Administrivia – Running independent programs on mulXple • Technology Break processors simultaneously – Example? • Amdahl’s Law • Parallel processing program – Single program that runs on mulXple processors simultaneously – Example? 10/8/10 Fall 2010 -- Lecture #17 3 10/8/10 Fall 2010 -- Lecture #17 4 AlternaXve Kinds of Parallelism: AlternaXve Kinds of Parallelism: Hardware vs. So]ware The Processor Viewpoint • Cluster: Set of computers interconnected across a local area network – Like a mulXprocessor, but with no shared memory • MulXcore microprocessor – Microprocessor containing mulXple processors (“cores”) in a single IC, with shared memory • Concurrent so]ware can also run on serial hardware • SequenXal so]ware can also run on parallel hardware • Our “warehouse-scale” computers are • Focus is on parallel processing so]ware: sequenXal or mul9core clusters! concurrent so]ware running on parallel hardware 10/8/10 Fall 2010 -- Lecture #17 5 10/8/10 Fall 2010 -- Lecture #17 6 1 10/8/10 AlternaXve Kinds of Parallelism: AlternaXve Kinds of Parallelism: Shared Memory MulXprocessor Computer Cluster NI NI Network Interface NO Shared Memory Local Area Network (LAN) or Cluster Interconnect 10/8/10 Fall 2010 -- Lecture #17 7 10/8/10 Fall 2010 -- Lecture #17 8 AlternaXve Kinds of Parallelism: AlternaXve Kinds of Parallelism: Single InstrucXon/Single Data Stream MulXple InstrucXon/Single Data Stream • Single InstrucXon, • MulXple InstrucXon, Single Data stream Single Data streams (SISD) (MISD) – Computer that exploits – SequenXal computer mulXple instrucXon that exploits no streams against a single parallelism in either the data stream for data instrucXon or data operaXons that can be Processing Unit streams. Examples of naturally parallelized. For SISD architecture are example, certain kinds of tradiXonal uniprocessor array processors. machines – No longer commonly encountered, mainly of 10/8/10 Fall 2010 -- Lecture #17 9 10/8/10 Fall 2010 -- Lecture #17 historical interest only 10 AlternaXve Kinds of Parallelism: AlternaXve Kinds of Parallelism: Single InstrucXon/MulXple Data Stream MulXple InstrucXon/MulXple Data Streams • Single InstrucXon, • MulXple InstrucXon, MulXple Data streams MulXple Data streams (SIMD) (MIMD) – MulXple autonomous – Computer that exploits processors simultaneously mulXple data streams execuXng different against a single instrucXons on different data. Clusters/distributed instrucXon stream to systems are generally operaXons that may be recognized to be MIMD naturally parallelized, architectures, either e.g., an array processor exploiXng a single shared memory space or a or Graphics Processing distributed memory space Unit (GPU) 10/8/10 Fall 2010 -- Lecture #17 11 10/8/10 Fall 2010 -- Lecture #17 12 2 10/8/10 Flynn Taxonomy SIMD Architectures • Data parallelism: executing one operation on multiple data streams • Example to provide context: – Multiplying a coefficient vector by a data vector • In 2010, SIMD and MIMD most commonly encountered (e.g., in filtering) • Most common parallel processing programming style: y[i] := c[i] × x[i], 0 ≤ i < n Single Program MulXple Data – Single program that runs on all processors of an MIMD • Sources of performance improvement: – Cross-processor execuXon coordinaXon through condiXonal – One instruction is fetched & decoded for entire operation expressions (save for Dave and thread parallelism) – Multiplications are known to be independent • SIMD (aka hw-level data parallelism): specialized funcXon – Pipelining/concurrency in memory access as well units, for handling lock-step calculaXons involving arrays – ScienXfic compuXng, signal processing, mulXmedia (audio/video processing) • More on SIMD on Monday 10/8/10 Fall 2010 -- Lecture #17 13 10/8/10 Fall 2010 -- Lecture #17 Slide 14 Agenda Midterm Results: Scores total min 13 • Kinds of Parallelism max 84 median 65 • Administrivia mean 63.1 st. dev. 14.4 • Technology Break 1/4 got >= 75 out of 85 1/2 got >= 65 out of 85 • Amdahl’s Law 2/3 >= 60 3/4 >= 55 80% >= 50 90% >= 45 10/8/10 Fall 2010 -- Lecture #17 15 10/8/10 Fall 2010 -- Lecture #17 16 Midterm Re-Grades: The Rules Course Kicks Into High Gear • WriGen grading appeals only! • AGend Tuesday’s discussion to learn about exam • This week’s EC2 Lab soluXons and grading scheme • Next week’s SIMD Lab • If you feel your exam was incorrectly graded, explain why and aGach legible write-up to examinaXon booklet • EC2 Project #2 (Due Saturday, 10/23, 1 Second • You have ONE week from Tuesday to hand it in in class, to Midnight) lab, or discussion • Following week’s Thread Parallelism Lab • We want to be fair, and correct mistakes, but are unlikely to change parXal credit as assigned • NOTE: We reserve the right to re-examine the whole examinaXon should you turn it in for a re-grade. 10/8/10 Fall 2010 -- Lecture #17 17 10/8/10 Fall 2010 -- Lecture #17 18 3 10/8/10 Agenda Agenda • Kinds of Parallelism • Kinds of Parallelism • Administrivia • Administrivia • Technology Break • Technology Break • Amdahl’s Law • Amdahl’s Law 10/8/10 Fall 2010 -- Lecture #17 19 10/8/10 Fall 2010 -- Lecture #17 20 Big Idea: Amdahl’s Law Big Idea: Amdahl’s Law • Speedup due to enhancement E is Exec me w/o E Speedup = Speedup w/ E = ---------------------- Exec me w/ E • Suppose that enhancement E accelerates a fracXon F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected Example: the execuXon Xme of half of the program can be accelerated by a factor of 2. What is the program speed-up overall? Execuon Time w/ E = Execuon Time w/o E × [ (1-F) + F/S] Speedup w/ E = 1 / [ (1-F) + F/S ] 10/8/10 Fall 2010 -- Lecture #17 22 10/8/10 Fall 2010 -- Lecture #17 23 Big Idea: Amdahl’s Law Big Idea: Amdahl’s Law If the porXon of Speedup = 1 the program that can be parallelized (1 - F) + F is small, then the speedup is limited Non-speed-up part S Speed-up part The non-parallel porXon limits Example: the execuXon Xme of half of the the performance program can be accelerated by a factor of 2. What is the program speed-up overall? 1 1 = = 1.33 0.5 + 0.5 0.5 + 0.25 2 10/8/10 Fall 2010 -- Lecture #17 24 10/8/10 Fall 2010 -- Lecture #17 25 4 10/8/10 Example #1: Amdahl’s Law Parallel Speed-up Example Speedup w/ E = 1 / [ (1-F) + F/S ] • Consider an enhancement which runs 20 Xmes faster but Z0 + Z1 + … + Z10 X1,1 X1,10 Y1,1 Y1,10 ParXXon 10 ways which is only usable 25% of the Xme and perform Speedup w/ E = 1/(.75 + .25/20) = 1.31 + on 10 parallel X10,1 X10,10 Y10,1 Y10,10 processing units • What if its usable only 15% of the Xme? Speedup w/ E = 1/(.85 + .15/20) = 1.17 Non-parallel part Parallel part • Amdahl’s Law tells us that to achieve linear speedup with • 10 “scalar” operaXons (non-parallelizable) 100 processors, none of the original computaXon can be scalar! • 100 parallelizable operaXons • To get a speedup of 90 from 100 processors, the percentage • 110 operaons of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99 10/8/10 Fall 2010 -- Lecture #17 27 10/8/10 Fall 2010 -- Lecture #17 28 Example #2: Amdahl’s Law Speedup w/ E = 1 / [ (1-F) + F/S ] Scaling • Consider summing 10 scalar variables and two 10 by • To get good speedup on a mulXprocessor while 10 matrices (matrix sum) on 10 processors keeping the problem size fixed is harder than geng Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5 good speedup by increasing the size of the problem. – Strong scaling: when speedup can be achieved on a • What if there are 100 processors ? mulXprocessor without increasing the size of the problem Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0 – Weak scaling: when speedup is achieved on a mulXprocessor by increasing the size of the problem proporXonally to the increase in the number of processors • What if the matrices are 100 by 100 (or 10,010 adds in total) on 10 processors? • Load balancing is another important factor. Just a Speedup w/ E = 1/(.001 + .999/10) = 1/0.1009 = 9.9 single processor with twice the load of the others cuts the speedup almost in half • What if there are 100 processors ? Speedup w/ E = 1/(.001 + .999/100) = 1/0.01099 = 91 10/8/10 Fall 2010 -- Lecture #17 30 10/8/10 Fall 2010 -- Lecture #17 31 Summary • Flynn Taxonomy of Parallel Architectures – SIMD: Single Instrucon Mul9ple Data – MIMD: Mul9ple Instrucon Mul9ple Data – SISD: Single InstrucXon Single Data – MISD: MulXple InstrucXon Single Data • Amdahl’s Law – Parallel code speed-up limited by the non- parallelized porXon of the code – Strong scaling is hard to achieve 10/8/10 Fall 2010 -- Lecture #17 32 5 .

Agenda Agenda

2.5 Classification of Parallel Computers

Su(3) Gluodynamics on Graphics Processing Units V

Improving Tasks Throughput on Accelerators Using Opencl Command Concurrency∗

State-Of-The-Art in Parallel Computing with R

Towards a Scalable File System on Computer Clusters Using Declustering

Programmable Interconnect Control Adaptive to Communication Pattern of Applications

Introduction to Parallel Computing

A Beginner's Guide to High–Performance Computing

Performance Comparison of MPICH and Mpi4py on Raspberry Pi-3B

Designing a Low Cost and Scalable PC Cluster System for HPC Environment

Review Questions and Answers on Cloud Overview 1

A Simplified Introduction to the Architecture of High Performance Computing