CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors

NJIT Computer Science Dept CS650 Computer Architecture CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors and PC Clustering Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 10: Intro to Multiprocessors/Clustering 1/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Key Issues Run PhotoShop on 1 PC or N PCs Programmability • How to program a bunch of PCs viewed as a single logical machine. Performance Scalability - Speedup • Run PhotoShop on 1 PC (forget the specs of this PC) • Run PhotoShop on N PCs • Will it run faster on N PCs? Speedup = ? Lecture 10: Intro to Multiprocessors/Clustering 2/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Types of Multiprocessors Key: Data and Instruction Single Instruction Single Data (SISD) • Intel processors, AMD processors Single Instruction Multiple Data (SIMD) • Array processor • Pentium MMX feature Multiple Instruction Single Data (MISD) • Systolic array • Special purpose machines Multiple Instruction Multiple Data (MIMD) • Typical multiprocessors (Sun, SGI, Cray,...) Single Program Multiple Data (SPMD) • Programming model Lecture 10: Intro to Multiprocessors/Clustering 3/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Shared-Memory Multiprocessor Processor Prcessor Prcessor Interconnection network Main Memory Storage I/O Lecture 10: Intro to Multiprocessors/Clustering 4/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Distributed-Memory Multiprocessor Processor Processor Processor IO/S MM IO/S MM IO/S MM Interconnection network IO/S MM IO/S MM IO/S MM Processor Processor Processor Lecture 10: Intro to Multiprocessors/Clustering 5/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Key Issues Programmability • Intel Pentium, AMD machines • Performance (Scalability) - Speedup • Run PhotoShop on 1 machine, or n machines • Execution time on 1 machine = x • Execution time on n machines = y • Speedup = x/y = the best speedup = n (often higher due to vari- ous reasons including cache behavior) Lecture 10: Intro to Multiprocessors/Clustering 6/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Parallelization of Applications Parallel portion 80% Serial portion 20% Lecture 10: Intro to Multiprocessors/Clustering 7/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Parallelization for 4 Processors Consider a loop with 100 iterations: for (i=0;i<100;i++) a[i] = b[i] + c[i]; The loop has no loop-carried dependence, straight forward to parallelize the code for 4 processors. Each processor executes 1/4 of the loop, i.e. 25 iterations as fol- lows: P0: for (i=0;i<25;i++) a[i] = b[i] + c[i]; P1: for (i=25;i<50;i++) a[i] = b[i] + c[i]; P2: for (i=50;i<75;i++) a[i] = b[i] + c[i]; P3: for (i=75;i<100;i++) a[i] = b[i] + c[i]; Lecture 10: Intro to Multiprocessors/Clustering 8/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Shared-Memory Multiprocessor Processor Prcessor Prcessor Interconnection network for (i=0;i<100;i++) a[i] = b[i] + c[i]; Storage I/O Main Memory Lecture 10: Intro to Multiprocessors/Clustering 9/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Distributed-Memory Multiprocessor Processor Processor for (i=0;i<25;i++) for (i=25;i<50;i++) IO/S a[i] = b[i] + c[i]; IO/S a[i] = b[i] + c[i]; Interconnection network for (i=50;i<75;i++) for (i=75;i<100;i++) IO/S a[i] = b[i] + c[i]; IO/S a[i] = b[i] + c[i]; Processor Processor Lecture 10: Intro to Multiprocessors/Clustering 10/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Amdahl’s Law Maximum speedup is limited by the serial fraction of a program • N = the number of processors, • s = the time spent by a processor on serial part of a program, • p = the time spent by a processor on parallel part of a program • t = total time = s + p = 1 Maximum possible speedup is given by: sp+ 1 1 D ==------------ ---------------------- =------------------------------- p p 1 s + ---- 1 – p + ---- 1 – p⎛⎞1 – ---- N N ⎝⎠N 1 1 – ---- D p = ------------- 1 1 – ---- N Lecture 10: Intro to Multiprocessors/Clustering 11/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Amdahl’s Law - Example N = 10 PCs, the speedup to achieve D = 8, the portion of program to be parallelized will be 1 1 1 – ---- 1 – --- D 8 0.875 p ==------------- --------------- ==------------- 0.972 1 1 0.9 1 – ---- 1 – ------ N 10 What do these numbers mean? want to achieve 8 fold speedup on 10 PCs --> 97.2% of the entire code has to be parallelized! Lecture 10: Intro to Multiprocessors/Clustering 12/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Amdahl’s Law - Example main() { ......... Parallel portion 97.2% ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... Serial portion 2.8% ......... } Lecture 10: Intro to Multiprocessors/Clustering 13/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Clustering of PCs • A Poor Man’s Supercomputer • Linux box clustering • MS Windows clustering • For scalability and high availability • Mail server clustering: MS Exchange • DB server clustering - MS SQL server, MySQL sever, ... • VPN box clustering • Firewall clustering • Filtering clustering • Web clustering Lecture 10: Intro to Multiprocessors/Clustering 14/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Data Center Infrastructure Chapter 6 of our textbook Lecture 10: Intro to Multiprocessors/Clustering 15/15 12/7/2003 A. Sohn.

CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors

2.5 Classification of Parallel Computers

Balanced Multithreading: Increasing Throughput Via a Low Cost Multithreading Hierarchy

Amdahl's Law Threading, Openmp

Parallel Generation of Image Layers Constructed by Edge Detection

High-Performance Message Passing Over Generic Ethernet Hardware with Open-MX Brice Goglin

An Investigation of Symmetric Multi-Threading Parallelism for Scientific Applications

What Is SPMD? Messages

Exploiting Automatic Vectorization to Employ SPMD on SIMD Registers

Polynomial-Time Algorithms for Enforcing Sequential Consistency in SPMD Programs with Arrays

Computer Systems Architecture

Data Management and Control-Flow Aspects of an SIMD/SPMD Parallel Language/Compiler

Computer Architecture: Parallel Processing Basics