<<

IntroductionIntroduction toto ParallelParallel ProcessingProcessing • Parallel Architecture: Definition & Broad issues involved – A Generic Parallel ComputerComputer Architecture • The Need And Feasibility of Parallel Why? – Scientific Supercomputing Trends – CPU Performance and Technology Trends, Parallelism in Generations – Computer System Peak FLOP Rating History/Near Future • The Goal of Parallel Processing • Elements of • Factors Affecting Parallel System Performance • Parallel Architectures History – Parallel Programming Models – Flynn’s 1972 Classification of • Current Trends In Parallel Architectures – Modern Parallel Architecture Layered Framework • Shared Parallel Architectures • -Passing Multicomputers: Message-Passing Programming Tools • Data Parallel Systems • Dataflow Architectures • Systolic Architectures: Multiplication Example

PCA Chapter 1.1, 1.2 CMPE655 - Shaaban #1 lec # 1 Spring 2017 1-24-2017 ParallelParallel ComputerComputer ArchitectureArchitecture A parallel computer (or multiple system) is a collection of communicating processing elements (processors) that cooperate to solve large computational problems fast by dividing such problems into parallel tasks, exploiting -Level Parallelism (TLP). i.e Parallel Processing

• Broad issues involved: Task = Computation done on one processor – The concurrency (parallelism) and communication characteristics of parallel for a given computational problem (represented by dependency graphs) – Computing Resources and Computation Allocation: • The number of processing elements (PEs), computing power of each element and amount/organization of physical memory used. • What portions of the computation and data are allocated or mapped to each PE. – Data access, Communication and (Ordering) • How the processing elements cooperate and communicate. • How data is shared/transmitted between processors. • Abstractions and primitives for cooperation/communication and synchronization. • The characteristics and performance of parallel system network (System interconnects). – Parallel Processing Performance and Goals: Goals • Maximize performance enhancement of parallelism: Maximize Parallel . – By minimizing parallelization overheads and balancing workload on processors • Scalability of performance to larger systems/problems. Processor = Programmable computing element that runs stored programs written using pre-defined instruction set CMPE655 - Shaaban Processing Elements = PEs = Processors #2 lec # 1 Spring 2017 1-24-2017 AA GenericGeneric ParallelParallel ComputerComputer ArchitectureArchitecture

Parallel Machine Network 2 Network Interconnects (Custom or industry standard) ° ° ° 1 Processing (compute) nodes Communication Processing Nodes Mem assist (CA) Network Interface AKA Communication Assist (CA)

Parallel Programming $ (custom or industry standard) Environments P One or more processing elements or processors 2-8 cores per chip per node: Custom or commercial . Single or multiple processors per chip 1 Processing Nodes: Homogenous or heterogonous Each processing node contains one or more processing elements (PEs) or processor(s), memory system, plus communication assist: (Network interface and communication controller) 2 Parallel machine network (System Interconnects). Function of a parallel machine network is to efficiently (reduce communication cost) transfer information (data, results .. ) from source node to destination node as needed to allow cooperation among parallel processing nodes to solve large computational problems divided into a number parallel computational tasks. CMPE655 - Shaaban Parallel Computer = Multiple Processor System #3 lec # 1 Spring 2017 1-24-2017 TheThe NeedNeed AndAnd FeasibilityFeasibility ofof ParallelParallel ComputingComputing • Application demands: More computing cycles/data memory needed Driving – Scientific/Engineering computing: CFD, Biology, Chemistry, Physics, ... Force – General-purpose computing: Video, Graphics, CAD, , Transaction Processing, Gaming… – Mainstream multithreaded programs, are similar to parallel programs • Technology Trends: Moore’s Law still alive – Number of on chip growing rapidly. Clock rates expected to continue to go up but only slowly. Actual performance returns diminishing due to deeper pipelines. – Increased density allows integrating multiple processor cores per creating Chip- Multiprocessors (CMPs) even for mainstream computing applications (desktop/..). • Architecture Trends: + multi-tasking (multiple independent programs) – Instruction-level parallelism (ILP) is valuable (superscalar, VLIW) but limited. – Increased clock rates require deeper pipelines with longer latencies and higher CPIs. – Coarser-level parallelism (at the task or thread level, TLP), utilized in multiprocessor systems is the most viable approach to further improve performance. • Main motivation for development of chip-multiprocessors (CMPs) Multi-core Processors • Economics: – The increased utilization of commodity of-the-shelf (COTS) components in high performance parallel computing systems instead of costly custom components used in traditional leading to much lower parallel system cost. • Today’s microprocessors offer high-performance and have multiprocessor support eliminating the need for designing expensive custom Pes. • Commercial System Area Networks (SANs) offer an alternative to custom more costly networks CMPE655 - Shaaban #4 lec # 1 Spring 2017 1-24-2017 Why is Parallel Processing Needed? Challenging Applications in Applied Science/Engineering • Astrophysics Traditional Driving Force For HPC/Parallel Processing • Atmospheric and Ocean Modeling • Such applications have very high • Biomolecular simulation: 1- computational and 2- data memory requirements that cannot be met • with single-processor architectures. • Computational Fluid Dynamics (CFD) Many applications contain a large • degree of computational parallelism • Computer vision and image understanding • and Data-intensive Computing • Engineering analysis (CAD/CAM) • Global climate modeling and forecasting • Material Sciences

• Military applications Driving force for High Performance Computing (HPC) • Quantum chemistry and multiple processor system development • VLSI design • …. CMPE655 - Shaaban #5 lec # 1 Spring 2017 1-24-2017 WhyWhy isis ParallelParallel ProcessingProcessing Needed?Needed? ScientificScientific ComputingComputing DemandsDemands Driving force for HPC and multiple processor system development (Data Memory Requirement) Computational and memory demands exceed the capabilities of even the fastest current uniprocessor systems

5-16 GFLOPS for uniprocessor

GLOP = 109 FLOPS TeraFLOP = 1000 GFLOPS = 1012 FLOPS CMPE655 - Shaaban PetaFLOP = 1000 TeraFLOPS = 1015 FLOPS #6 lec # 1 Spring 2017 1-24-2017 ScientificScientific SupercomputingSupercomputing TrendsTrends • Proving ground and driver for innovative architecture and advanced high performance computing (HPC) techniques: – Market is much smaller relative to commercial (desktop/) segment. – Dominated by costly vector machines starting in the 1970s through the 1980s. – Microprocessors (including GPUs) have made huge gains in the performance needed for such applications: • High clock rates. (Bad: Higher CPI?) • Multiple pipelined floating point units. • Instruction-level parallelism. • Effective use of caches. Enabled with high transistor density/chip • Multiple processor cores/chip (2 cores 2002-2005, 4 end of 2006, 6-12 cores 2011) 16 cores in 2013, 100s in 2016 However even the fastest current single microprocessor systems

still cannot meet the needed computational demands. As shown in last slide • Currently: Large-scale microprocessor-based multiprocessor systems and computer clusters are replacing (replaced?) vector supercomputers that utilize custom processors. CMPE655 - Shaaban #7 lec # 1 Spring 2017 1-24-2017 UniprocessorUniprocessor PerformancePerformance EvaluationEvaluation • CPU Performance benchmarking is heavily program-mix dependent. • Ideal performance requires a perfect machine/program match. • Performance measures: – Total CPU time = T = TC / f = TC x = I x CPI x C

= I x (CPIexecution + M x k) x C (in seconds) TC = Total program execution clock cycles f = C = CPU clock cycle time = 1/f I = Instructions executed count

CPI = CPIexecution = CPI with ideal memory M = Memory stall cycles per memory access k = Memory accesses per instruction – MIPS Rating = I / (T x 106) = f / (CPI x 106) = f x I /(TC x 106) (in million ) 6 – Throughput Rate: Wp = 1/ T = f /(I x CPI) = (MIPS) x 10 /I (in programs per second)

• Performance factors: (I, CPIexecution, m, k, C) are influenced by: instruction-set architecture (ISA) , design, CPU micro-architecture, implementation and control, cache and memory hierarchy, program access locality, and program instruction mix and instruction dependencies.

T = I x CPI x C CMPE655 - Shaaban #8 lec # 1 Spring 2017 1-24-2017 SingleSingle CPUCPU PerformancePerformance TrendsTrends • The microprocessor is currently the most natural building for multiprocessor systems in terms of cost and performance. • This is even more true with the development of cost-effective multi-core microprocessors that support TLP at the chip level.

100 Supercomputers Custom Processors

10 Mainframes Microprocessors Minicomputers

Performance 1 Commodity Processors

0.1 1965 1970 1975 1980 1985 1990 1995 CMPE655 - Shaaban #9 lec # 1 Spring 2017 1-24-2017 MicroprocessorMicroprocessor FrequencyFrequency TrendTrend 10,000 100 Processor freq IBM Power PC scales by 2X per Realty Check: DEC generation Gate delays/clock Clock frequency scaling is slowing down! 21264S 1,000 (Did silicone finally hit 21164A 21264 the wall?) 21064A Pentium(R) Why? 21164 10

Mhz II 21066 MPC750 1- Static power leakage 604 604+ Pentium Pro 2- Clock distribution 100 601, 603 (R) Gate Delays/ Clock delays Pentium(R) Result: 486 Deeper Pipelines 386 Longer stalls 10 1 Higher CPI (lowers effective 1987 1991 1993 1995 1997 1999 2001 2003 2005 1989 No longer performance the case • Frequency doubles each generation ? per cycle) • Number of gates/clock reduce by 25% Solution: • Leads to deeper pipelines with more stages Exploit TLP at the chip level, (e.g Intel Pentium 4E has 30+ stages) Chip-multiprocessor (CMPs)

T = I x CPI x C CMPE655 - Shaaban #10 lec # 1 Spring 2017 1-24-2017 TransistorTransistor CountCount GrowthGrowth RateRate Enabling Technology for Chip-Level Thread-Level Parallelism (TLP)

~ 4,000,000x transistor density increase in the last 45 years Currently ~ 9 Billion

Moore’s Law:

2X transistors/Chip Every 1.5 years (circa 1970) still holds

Enables Thread-Level Parallelism (TLP) at the chip level: Chip-Multiprocessors (CMPs) + Simultaneous Multithreaded (SMT) processors 4-Bit Intel 4004 (2300 transistors) Solution

• One billion transistors/chip reached in 2005, two billion in 2008-9, Now ~ nine billion • grows faster than clock rate: Currently ~ 40% per year • Single-threaded uniprocessors do not efficiently utilize the increased transistor count.

Limited ILP, increased size of cache CMPE655 - Shaaban #11 lec # 1 Spring 2017 1-24-2017 ParallelismParallelism inin MicroprocessorMicroprocessor VLSIVLSI GenerationsGenerations Bit-level parallelism Instruction-level Thread-level (?) 100,000,000 (ILP) (TLP) Multiple micro-operations Superscalar AKA operation Simultaneous per cycle /VLIW level parallelism Single-issue CPI <1 Multithreading SMT: (multi-cycle non-pipelined) Pipelined υ e.g. Intel’s Hyper-threading CPI =1 10,000,000 υ Chip-Multiprocessors (CMPs) υυυ υ υ e.g IBM Power 4, 5 Not Pipelined υ R10000 Intel Pentium D, Core Duo CPI >> 1 υυυυυυ υ AMD υ Dual Core Opteron υ υ υυυυυυυ Sun UltraSparc T1 (Niagara) υ υ υ 1,000,000 υυυυ υυυυ υ υυ Pentium υ Chip-Level υ υ υυ i80386 υ TLP/Parallel i80286 υ υ υ R3000 Transistors 100,000 Processing υ υ υ R2000 Even more important υ i8086 due to slowing clock

10,000 rate increase υ i8080 υ i8008 υ ILP = Instruction-Level Parallelism υ i4004 Single Thread TLP = Thread-Level Parallelism Per Chip 1,000 1970 1975 1980 1985 1990 1995 2000 2005 Improving microprocessor generation performance by CMPE655 - Shaaban exploiting more levels of parallelism #12 lec # 1 Spring 2017 1-24-2017 Dual-Core Chip-Multiprocessor (CMP) Architectures Single Die Two Dies – Shared Package Single Die Private Caches Private Caches Shared L2 Cache Shared System Interface Private System Interface

Shared L2 or L3

On-chip crossbar/switch Off-Chip Cores communicate using shared cache Communication FSB (Lowest communication latency) Cores communicate using on-chip Cores communicate over external Interconnects (shared system interface) Front Side (FSB) Examples: (Highest communication latency) IBM POWER4/5 Examples: Examples: Intel Pentium Core Duo (Yonah), Conroe AMD Dual Core Opteron, Intel Pentium D, (Core 2), i7, Sun UltraSparc T1 (Niagara) Athlon 64 X2 Intel Quad core (two dual-core chips) AMD Phenom …. Intel Itanium2 (Montecito)

Source: Real World Technologies, CMPE655 - Shaaban http://www.realworldtech.com/page.cfm?ArticleID=RWT101405234615 #13 lec # 1 Spring 2017 1-24-2017 Example Six-Core CMP: AMD Phenom II X6 Six processor cores sharing 6 MB of level 3 (L3) cache

CMP = Chip-Multiprocessor CMPE655 - Shaaban #14 lec # 1 Spring 2017 1-24-2017 Example Eight-Core CMP: Intel Nehalem-EX

Eight processor cores sharing 24 MB of level 3 (L3) cache Each core is 2-way SMT (2 threads per core), for a total of 16 threads

CMPE655 - Shaaban #15 lec # 1 Spring 2017 1-24-2017 Example 100-Core CMP: Tilera TILE-Gx Processor No shared cache

Communication Links

On-Chip Network Switch

Network-on-Chip (NoC) Example

For more information see: http://www.tilera.com/ CMPE655 - Shaaban #16 lec # 1 Spring 2017 1-24-2017 MicroprocessorsMicroprocessors Vs.Vs. VectorVector ProcessorsProcessors UniprocessorUniprocessor Performance:Performance: LINPACKLINPACK 10,000 Now about ν n = 1,000 σCRAY n = 100 Vector Processors 5-16 GFLOPS λ Micro n = 1,000 υ Micro n = 100 per microprocessor core ν

1,000 ν T94 1 GFLOP σ 9 C90 (10 FLOPS) σ λ ν λ DEC 8200 ν ν Ymp σ Xmp/416 λ υυ σ λ 100 λ υ IBM Power2/990 MIPS R4400 Xmp/14se σ λ λ DEC Alpha υ HP9000/735 LINPACK (MFLOPS) LINPACK υ DEC Alpha AXP σν CRAY 1s υ HP 9000/750 υ IBM RS6000/540

10 λ MIPS M/2000 λ Microprocessors υ MIPS M/120 υ Sun 4/260 1 υλ 1975 1980 1985 1990 1995 2000 CMPE655 - Shaaban #17 lec # 1 Spring 2017 1-24-2017 ParallelParallel Performance:Performance: LINPACKLINPACK Since ~ June 2016 Current Top LINPACK Performance: 10,000 Now about 93,014,600 GFlop/s = 93,014.6 TeraFlops/s = 93.014 PetaFlops/s λ MPP peak TaihuLight ( @ National Supercomputing Center, Wuxi, China) ν CRAY peak 10,649,600 total processor cores: (40,960 Sunway SW26010 260C 260-core processors @1.45 GHz)

1,000 ASCI Red λ 1 TeraFLOP Paragon XP/S MP 12 (6768) (10 FLOPS = 1000 GFLOPS) λ Paragon XP/S MP (1024) λ ν CM-5 λ T3D 100 T932(32) ν Paragon XP/S λ

LINPACK (GFLOPS) LINPACK CM-200 λ CM-2 λ λ ν C90(16) 10 Delta

iPSC/860 ν λ λ nCUBE/2(1024) Ymp/832(8)

1ν Xmp /416(4) GLOP = 109 FLOPS TeraFLOP = 1000 GFLOPS = 1012 FLOPS PetaFLOP = 1000 TeraFLOPS = 1015 FLOPS ExaFLOP = 1000 PetaFLOPS = 1018 FLOPS

0.1 1985 1987 1989 1991 1993 1995 1996 Current ranking of top 500 parallel supercomputers CMPE655 - Shaaban in the world is found at: www..org #18 lec # 1 Spring 2017 1-24-2017 WhyWhy isis ParallelParallel ProcessingProcessing Needed?Needed? LINPAK Performance Trends 10,000 10,000 λMPP peak νCRAY peak νCRAY n = 1,000 σCRAY n = 100 λMicro n = 1,000 υMicro n = 100 1,000 ASCI Reλ d ν 1 TeraFLOP Paragon XP/S MP 1,000 ν (6768) T94 (1012 FLOPS =1000 GFLOPS) λ 1 GFLOP Paragon XP/S MP σ 9 (1024)λ (10 FLOPS) C90 ν σ λ CM-5λ T3D ν 100 (GFLOPS) λ DEC 8200 ν ν ν Ymp T932(32)

σ ACK Xmp/416 λ υυ Paragon XP/S σ λ IBM Power2/99 (MFLOPS) λ λ 100 υ MIPS R4400 LINP CM-200λ Xmp/14se CM-2λ λ ν C90(16) ACK σ λ λ DEC Alpha 10 Delta υ HP9000/735 LINP υ DEC Alpha AXP σν CRAY 1s υ HP 9000/750 iPSC/860 υ IBM RS6000/540 ν λ λ nCUBE/2(1024) Ymp/832(8) 10 λ 1ν Xmp/416(4) MIPS M/2000 λ υ MIPS M/120 υ Sun 4/260 0.1 1 λυ 1985 1987 1989 1991 1993 1995 199 1975 1980 1985 1990 1995 200 Uniprocessor Performance Parallel System Performance

GLOP = 109 FLOPS TeraFLOP = 1000 GFLOPS = 1012 FLOPS PetaFLOP = 1000 TeraFLOPS = 1015 FLOPS CMPE655 - Shaaban ExaFLOP = 1000 PetaFLOPS = 1018 FLOPS #19 lec # 1 Spring 2017 1-24-2017 ComputerComputer SystemSystem PeakPeak FLOPFLOP RatingRating HistoryHistory Current Top Peak FP Performance: Since ~ June 2016 125,436,000 GFlops/s = 125,436 TeraFlops/s = 125.43 PetaFlops/s Sunway TaihuLight ( @ National Supercomputing Center, Wuxi, China) 10,649,600 total processor cores: (40,960 Sunway SW26010 260C 260-core processors @1.45 GHz) Sunway TaihuLight Peta FLOP (1015 FLOPS = 1000 Tera FLOPS) Quadrillion Flops

Teraflop (1012 FLOPS = 1000 GFLOPS)

GLOP = 109 FLOPS TeraFLOP = 1000 GFLOPS = 1012 FLOPS PetaFLOP = 1000 TeraFLOPS = 1015 FLOPS ExaFLOP = 1000 PetaFLOPS = 1018 FLOPS

Current ranking of top 500 parallel supercomputers CMPE655 - Shaaban in the world is found at: www.top500.org CMPE655 - Shaaban #20 lec # 1 Spring 2017 1-24-2017 November 2005 Top 500 List Started in 1992

Source (and for current list): www.top500.org CMPE655 - Shaaban #21 lec # 1 Spring 2017 1-24-2017 TOP500 Supercomputers 32nd List (November 2008): TheThe TopTop 1010

KW

Source (and for current list): www.top500.org CMPE655 - Shaaban #22 lec # 1 Spring 2017 1-24-2017 TOP500 Supercomputers 34th List (November 2009): TheThe TopTop 1010

KW

Source (and for current list): www.top500.org CMPE655 - Shaaban #23 lec # 1 Spring 2017 1-24-2017 TOP500 Supercomputers 36th List (November 2010): TheThe TopTop 1010

KW

Source (and for current list): www.top500.org CMPE655 - Shaaban #24 lec # 1 Spring 2017 1-24-2017 TOP500 Supercomputers 38th List (November 2011): TheThe TopTop 1010 Current List

KW

Source (and for current list): www.top500.org CMPE655 - Shaaban #25 lec # 1 Spring 2017 1-24-2017 TOP500 Supercomputers KW 40th List (November 2012) The Top 10

Cray XK7 - ( @ Oak Ridge National Laboratory)

LINPACK Performance: 17.59 PetaFlops/s (quadrillion Flops per second)

Peak Performance: 27.1 PetaFlops/s

560,640 total processor cores:

299,008 Opteron cores (18,688 AMD Opteron 6274 16-core processors @ 2.2 GHz) + 261,632 GPU cores (18,688 Tesla Kepler K20x GPUs @ 0.7 GHz)

Source (and for current list): www.top500.org

CMPE655 - Shaaban #26 lec # 1 Spring 2017 1-24-2017 TOP500 Supercomputers 41st List (June 2013): TheThe TopTop 1010

KW

Source (and for current list): www.top500.org CMPE655 - Shaaban #27 lec # 1 Spring 2017 1-24-2017 TOP500 Supercomputers 42nd List (Nov. 2013): TheThe TopTop 1010

KW

Source (and for current list): www.top500.org CMPE655 - Shaaban #28 lec # 1 Spring 2017 1-24-2017 TOP500 Supercomputers 43rd List (June 2014): TheThe TopTop 1010

KW

Source (and for current list): www.top500.org CMPE655 - Shaaban #29 lec # 1 Spring 2017 1-24-2017 TOP500 Supercomputers 45th List (June 2015): TheThe TopTop 1010

KW

Source (and for current list): www.top500.org CMPE655 - Shaaban #30 lec # 1 Spring 2017 1-24-2017 TOP500 Supercomputers 47th List (June 2016): TheThe TopTop 1010

KW

2005 2017 ~ 360 x Times Faster

Source (and for current list): www.top500.org CMPE655 - Shaaban #31 lec # 1 Spring 2017 1-24-2017 TheThe GoalGoal ofof ParallelParallel ProcessingProcessing • Goal of applications in using parallel machines: Maximize Speedup over single processor performance

Parallel Performance (p processors) Speedup (p processors) = Performance (1 processor)

• For a fixed problem size (input data set), performance = 1/time Fixed Problem Size Parallel Speedup

Parallel Speedup, Speedupp Time (1 processor) Speedup fixed problem (p processors) = Time (p processors)

• Ideal speedup = number of processors = p

Very hard to achieve + load imbalance Due to parallelization overheads: communication cost, dependencies ... CMPE655 - Shaaban #32 lec # 1 Spring 2017 1-24-2017 TheThe GoalGoal ofof ParallelParallel ProcessingProcessing • Parallel processing goal is to maximize parallel speedup: Fixed Problem Size Parallel Speedup Or time Time(1) Sequential Work on one processor Speedup = Time(p) < Max (Work + Synch Wait Time + Comm Cost + Extra Work) Time Parallelization overheads i.e the processor with maximum execution time • Ideal Speedup = p = number of processors Implies or Requires

– Very hard to achieve: Implies no parallelization overheads and perfect load balance among all processors. • Maximize parallel speedup by: 1 – Balancing computations on processors (every processor does the same amount of work) and the same amount of overheads. 2 – Minimizing communication cost and other overheads associated with each step of parallel program creation and execution. • Parallel Performance Scalability: + Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size (number of processors) are increased.

CMPE655 - Shaaban #33 lec # 1 Spring 2017 1-24-2017 ElementsElements ofof ParallelParallel ComputingComputing HPC Driving Assign Force Computing parallel Problems computations (Tasks) to processors

Processing Nodes/Network Parallel Algorithms Mapping Parallel and Data Hardware Structures Architecture Parallel Programming Dependency Operating System analysis (Task Dependency High-level Binding Graphs) High-level Applications Languages (Compile, Load)

Parallel Performance e.g Parallel Speedup Program Evaluation CMPE655 - Shaaban #34 lec # 1 Spring 2017 1-24-2017 ElementsElements ofof ParallelParallel ComputingComputing 1 Computing Problems:

Driving – Numerical Computing: Science and and engineering numerical Force problems demand intensive integer and floating point computations. – Logical Reasoning: (AI) demand logic inferences and symbolic manipulations and large space searches. 2 Parallel Algorithms and Data Structures – Special algorithms and data structures are needed to specify the computations and communication present in computing problems (from dependency analysis). – Most numerical algorithms are deterministic using regular data structures. – Symbolic processing may use heuristics or non-deterministic searches. – Parallel development requires interdisciplinary interaction. CMPE655 - Shaaban #35 lec # 1 Spring 2017 1-24-2017 ElementsElements ofof ParallelParallel ComputingComputing

3 Hardware Resources Parallel Architecture Computing power A – Processors, memory, and devices (processing nodes) form the hardware core of a computer system. B – Processor connectivity (system interconnects, network), memory organization, influence the system architecture. 4 Operating Systems Communication/connectivity – Manages the allocation of resources to running processes. – Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory mapping, interprocessor communication. • Parallelism exploitation possible at: 1- algorithm design, 2- program writing, 3- compilation, and 4- run time.

CMPE655 - Shaaban #36 lec # 1 Spring 2017 1-24-2017 ElementsElements ofof ParallelParallel ComputingComputing 5 System Software Support – Needed for the development of efficient programs in high-level languages (HLLs.) – Assemblers, loaders. – Portable parallel programming languages/libraries – User interfaces and tools. 6 Compiler Support Approaches to parallel programming (a) – Approach • Parallelizing compiler: Can automatically detect parallelism in sequential source code and transforms it into parallel constructs/code. • Source code written in conventional sequential languages (b) – Approach: • explicitly specifies parallelism using: – Sequential compiler (conventional sequential HLL) and low-level of the target parallel computer , or .. – Concurrent (parallel) HLL . • Concurrency Preserving Compiler: The compiler in this case preserves the parallelism explicitly specified by the programmer. It may perform some program flow analysis, dependence checking, limited optimizations for parallelism detection. Illustrated next CMPE655 - Shaaban #37 lec # 1 Spring 2017 1-24-2017 TwoTwo ApproachesApproaches toto ParallelParallel ProgrammingProgramming

Programmer Programmer

Source code written in Source code written in sequential languages C, C++ concurrent dialects of C, C++ , LISP .. FORTRAN, LISP ..

Parallelizing Concurrency compiler preserving compiler

(b) Explicit Parallel Concurrent (a) Implicit Parallelism object code Parallelism object code Programmer explicitly Compiler automatically specifies parallelism detects parallelism in using parallel constructs Execution by sequential source code Execution by and transforms it into runtime system parallel constructs/code runtime system CMPE655 - Shaaban #38 lec # 1 Spring 2017 1-24-2017 Factors Affecting Parallel System Performance • Related: i.e Inherent – Available concurrency and profile, grain size, uniformity, patterns. Parallelism • Dependencies between computations represented by dependency graph – Type of parallelism present: Functional and/or . – Required communication/synchronization, uniformity and patterns. – Data size requirements. – Communication to computation ratio (C-to-C ratio, lower is better). • Parallel program Related: – Programming model used. – Resulting data/code memory requirements, locality and working set characteristics. – Parallel task grain size. – Assignment (mapping) of tasks to processors: Dynamic or static. – Cost of communication/synchronization primitives. • Hardware/Architecture related: – Total CPU computational power available. + Number of processors – Types of computation modes supported. (hardware parallelism) – Shared address space Vs. . – Communication network characteristics (topology, , latency) – Memory hierarchy properties. CMPE655 - Shaaban Concurrency = Parallelism #39 lec # 1 Spring 2017 1-24-2017 Sequential Execution Task Dependency Graph Possible Parallel Execution Schedule on Two Processors P0, P1 on one processor 0 Task: 0 1 Computation 1 run on one Idle 2 A A 2 A processor 3 3 Comm Comm 4 4 5 B 5 C 6 B C 6 B 7 7 Comm 8 C 8 F 9 9 D 10 D E F 10 11 D 11 E Idle Comm 12 12 Comm 13 13 14 E G 14 Idle 15 15 G 16 16 F 17 What would the speed be 17

18 with 3 processors? 18 T2 =16 19 4 processors? 5 … ? 19 20 G 20 21 21 Assume computation time for each task A-G = 3 P0 P1 Time Assume communication time between parallel tasks = 1 Time Assume communication can overlap with computation T =21 Speedup on two processors = T1/T2 = 21/16 = 1.3 1 CMPE655 - Shaaban A simple parallel execution example #40 lec # 1 Spring 2017 1-24-2017 Non-pipelined Scalar EvolutionEvolution ofof ComputerComputer ArchitectureArchitecture Sequential Lookahead

Limited I/E Overlap Functional Pipelining Parallelism

Multiple Pipeline Pipelined (single or multiple issue) Func. Units

Implicit Explicit Vector/data parallel Vector Vector I/E: Instruction Fetch and Execute Memory-to Register-to Shared SIMD: Single Instruction -Memory -Register over Multiple Data streams Memory MIMD: Multiple Instruction Parallel streams over Multiple Data SIMD Machines MIMD streams Associative Processor Multicomputer Multiprocessor Processor Array

Data Parallel Computer Clusters Processors (MPPs) Message Passing CMPE655 - Shaaban #41 lec # 1 Spring 2017 1-24-2017 ParallelParallel ArchitecturesArchitectures HistoryHistory Historically, parallel architectures were tied to parallel programming models: • Divergent architectures, with no predictable pattern of growth.

Application Software

Systolic System Software Data Parallel Arrays SIMD Architectures Architecture Message Passing Dataflow

More on this next lecture CMPE655 - Shaaban #42 lec # 1 Spring 2017 1-24-2017 ParallelParallel ProgrammingProgramming ModelsModels • Programming methodology used in coding parallel applications • Specifies: 1- communication and 2- synchronization Ordering? • Examples: However, a good way to utilize multi-core processors for the masses! – Multiprogramming: or Multi-tasking (not true parallel processing!) No communication or synchronization at program level. A number of independent programs running on different processors in the system. – Shared memory address space (SAS): Parallel program threads or tasks communicate implicitly using a shared memory address space (shared data in memory). – Message passing: Explicit point to point communication (via send/receive pairs) is used between parallel program tasks using . – Data parallel: More regimented, global actions on data (i.e the same parallel operations over all elements on an array or vector) – Can be implemented with shared address space or message passing. CMPE655 - Shaaban #43 lec # 1 Spring 2017 1-24-2017 FlynnFlynn’’ss 19721972 ClassificationClassification ofof ComputerComputer ArchitectureArchitecture (Taxonomy) Instruction Stream = Thread of Control or Hardware Context

(a)• Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines or uniprocessors.

(b)• Single Instruction stream over Multiple Data streams (SIMD): Vector , array of synchronized processing elements. Data parallel systems (GPUs?) (c)• Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution. (d)• Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers:

Tightly coupled processors • Shared memory multiprocessors.

Loosely coupled • Multicomputers: Unshared , processors message-passing used instead (e.g clusters) Classified according to number of instruction streams CMPE655 - Shaaban (threads) and number of data streams in architecture #44 lec # 1 Spring 2017 1-24-2017 Flynn’s Classification of Computer Architecture (Taxonomy) Single Instruction stream over Multiple Data streams (SIMD): Uniprocessor Vector computers, array of synchronized processing elements.

Shown here: CU = array of synchronized PE = Processing Element processing elements M = Memory

Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines or uniprocessors.

Parallel computers or multiprocessor systems Multiple Instruction streams and a Single Data stream (MISD): Multiple Instruction streams over Multiple Systolic arrays for pipelined execution. Data streams (MIMD): Parallel computers: Distributed memory multiprocessor system shown Classified according to number of instruction streams (threads) and number of data streams in architecture CMPE655 - Shaaban #45 lec # 1 Spring 2017 1-24-2017 CurrentCurrent TrendsTrends InIn ParallelParallel ArchitecturesArchitectures Conventional or sequential • The extension of “computer architecture” to support communication and cooperation:

– OLD: Instruction Set Architecture (ISA) – NEW: Communication Architecture

• Defines: 1 – Critical abstractions, boundaries, and primitives (interfaces). 2 – Organizational structures that implement interfaces (hardware or software) Implementation of Interfaces

, libraries and OS are important bridges today

i.e. software abstraction layers More on this next lecture CMPE655 - Shaaban #46 lec # 1 Spring 2017 1-24-2017 ModernModern ParallelParallel ArchitectureArchitecture LayeredLayered FrameworkFramework

CAD Scientific modeling Parallel applications

Multiprogramming Shared Message Data Programming models address passing parallel User Space Compilation Software or library Communication abstraction User/system boundary Operating systems support System Space Hardware/software boundary Communication hardware (ISA) Physical communication medium

Hardware Hardware: Processing Nodes & Interconnects

More on this next lecture CMPE655 - Shaaban #47 lec # 1 Spring 2017 1-24-2017 SharedShared AddressAddress SpaceSpace (SAS)(SAS) ParallelParallel ArchitecturesArchitectures (in shared address space) • Any processor can directly reference any memory location – Communication occurs implicitly as result of loads and stores • Convenient: Communication is implicit via loads/stores – Location transparency – Similar programming model to time-sharing in uniprocessors • Except processes run on different processors

• Good throughput on multiprogrammed workloads i.e multi-tasking • Naturally provided on a wide range of platforms – Wide range of scale: few to hundreds of processors • Popularly known as shared memory machines or model – Ambiguous: Memory may be physically distributed among processing nodes. i.e Distributed shared memory multiprocessors

Sometimes called Tightly-Coupled Parallel Computers CMPE655 - Shaaban #48 lec # 1 Spring 2017 1-24-2017 Shared Address Space (SAS) Parallel Programming Model • : virtual address space plus one or more threads of control • Portions of address spaces of processes are shared:

Machine physical address space Virtual address spaces for a collection of processes communicating

via shared addresses Pn pr i vat e In SAS: Load Communication is implicit Pn Shared via loads/stores. Common physical P2 addresses P1 Space Ordering/Synchronization P0 is explicit using synchronization St or e P2 pr i vat e Primitives. Shared portion of address space Private

P1 pr i vat e Space Private portion of address space P0 pr i vat e

• Writes to shared address visible to other threads (in other processes too) • Natural extension of the uniprocessor model: Thus communication is implicit via loads/stores • Conventional memory operations used for communication • Special atomic operations needed for synchronization: i.e for event ordering and

• Using Locks, Semaphores etc. Thus synchronization is explicit • OS uses shared memory to coordinate processes. CMPE655 - Shaaban #49 lec # 1 Spring 2017 1-24-2017 ModelsModels ofof SharedShared--MemoryMemory MultiprocessorsMultiprocessors 1 • The Uniform/Centralized Memory Access (UMA) Model: – All physical memory is shared by all processors. – All processors have equal access (i.e equal memory bandwidth and access latency) to all memory addresses. – Also referred to as Symmetric Memory Processors (SMPs).

2 • Distributed memory or Non- (NUMA) Model: – Shared memory is physically distributed locally among processors. Access latency to remote memory is higher.

3 • The Cache-Only Memory Architecture (COMA) Model: – A special case of a NUMA machine where all distributed main memory is converted to caches. – No memory hierarchy at each processor. CMPE655 - Shaaban #50 lec # 1 Spring 2017 1-24-2017 ModelsModels ofof SharedShared--MemoryMemory MultiprocessorsMultiprocessors Uniform Memory Access (UMA) Model 1 UMA or Symmetric Memory Processors (SMPs). Interconnect: I/O devices Bus, Crossbar, Multistage network P: Processor Mem Mem Mem Mem I/O ctrl I/O ctrl M or Mem: Memory C: Cache Interconnect Interconnect /Network D: Cache directory

Processor Processor

Network Network

°°° D D D °°° M $ M $ M $ C C C P P P P P P NUMA 3 Distributed memory or 2 Non-uniform Memory Access (NUMA) Model Cache-Only Memory Architecture (COMA)

CMPE655 - Shaaban #51 lec # 1 Spring 2017 1-24-2017 UniformUniform MemoryMemory AccessAccess (UMA)(UMA) Example:Example: IntelIntel PentiumPentium ProPro QuadQuad Circa 1997

4-way SMP CPU P-Pro P-Pro P-Pro Interrupt 256-KB module module module controller L2 $ Bus interface

Shared P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz) FSB

PCI PCI Memory bridge bridge controller

PCI I/O MIU cards PCI bus PCI bus 1-, 2-, or 4-way interleaved DRAM • All coherence and glue in processor module • Highly integrated, targeted at high volume • Computing node used in Intel’s ASCI-Red MPP Bus-Based Symmetric Memory Processors (SMPs). A single Front Side Bus (FSB) is shared among processors CMPE655 - Shaaban This severely limits scalability to only ~ 2-4 processors #52 lec # 1 Spring 2017 1-24-2017 NonNon--UniformUniform MemoryMemory AccessAccess (NUMA)(NUMA) Example:Example: 88--SocketSocket AMDAMD OpteronOpteron NodeNode

Circa 2003

Dedicated point-to-point interconnects (HyperTransport links) used to connect Total 32 processor cores when processors alleviating the traditional limitations of FSB-based SMP systems. quad core Opteron processors used Each processor has two integrated DDR memory channel controllers: (128 cores with 16-core processors) memory bandwidth scales up with number of processors. NUMA architecture since a processor can access its own memory at a lower latency than access to remote memory directly connected to other processors in the system. CMPE655 - Shaaban #53 lec # 1 Spring 2017 1-24-2017 NonNon--UniformUniform MemoryMemory AccessAccess (NUMA)(NUMA) Example:Example: 44--SocketSocket Intel Nehalem-EX NodeNode

CMPE655 - Shaaban #54 lec # 1 Spring 2017 1-24-2017 DistributedDistributed SharedShared--MemoryMemory MultiprocessorMultiprocessor SystemSystem Example:Example:

NUMA MPP Example CrayCray T3ET3E Circa 1995-1999 External I/O MPP = Massively Parallel Processor System

P Mem $

Mem ctrl and NI

More recent Cray MPP Example: XY Switch Cray X1E Communication Assist (CA) 3D Torus Point-To-Point Network Z • Scale up to 2048 processors, DEC Alpha EV6 microprocessor (COTS) • Custom 3D Torus point-to-point network, 480MB/s links • generates communication requests for non-local references • No hardware mechanism for coherence (SGI Origin etc. provide this)

Example of Non-uniform Memory Access (NUMA) CMPE655 - Shaaban #55 lec # 1 Spring 2017 1-24-2017 MessageMessage--PassingPassing MulticomputersMulticomputers • Comprised of multiple autonomous computers (computing nodes) connected via a suitable network. Industry standard System Area Network (SAN) or proprietary network • Each node consists of one or more processors, local memory, attached storage and I/O and Communication Assist (CA). • Local memory is only accessible by local processors in a node (no shared memory among nodes). • Inter-node communication is carried explicitly out by message passing through the connection network via send/receive operations. Thus communication is explicit • Process communication achieved using a message-passing programming environment (e.g. PVM, MPI). Portable, platform-independent – Programming model more removed or abstracted from basic hardware operations • Include:

1 – A number of commercial Massively Parallel Processor systems (MPPs).

2 – Computer clusters that utilize commodity of-the-shelf (COTS) components.

Also called Loosely-Coupled Parallel Computers CMPE655 - Shaaban #56 lec # 1 Spring 2017 1-24-2017 MessageMessage--PassingPassing AbstractionAbstraction Tag

Send (X, Q, t) Match Receive Y,P,t

Data AddressY Recipient Send X, Q, t Tag

Address X Blocking Recipient blocks (waits) Receive (Y, P, t) until message is received Local pr ocess Local pr ocess address space address space Sender P Recipient Q Data Sender Process P Process Q

• Send specifies buffer to be transmitted and receiving process. Communication is explicit • Receive specifies sending process and application storage to receive into. via sends/receives • Memory to memory copy possible, but need to name processes. • Optional tag on send and matching rule on receive. i.e event ordering, in this case • User process names local data and entities in process/tag space too • In simplest form, the send/receive match achieves implicit pairwise synchronization event – Ordering of computations according to dependencies Synchronization is • Many possible overheads: copying, buffer management, protection ... implicit

Pairwise synchronization Data Dependency Blocking Receive using send/receive match /Ordering Sender P Recipient Q CMPE655 - Shaaban #57 lec # 1 Spring 2017 1-24-2017 MessageMessage--PassingPassing Example:Example: IntelIntel ParagonParagon Circa 1983

i860 i860 Intel Paragon node Each node L1 $ L1 $ Is a 2-way-SMP

Memory bus (64-bit, 50 MHz)

Mem DMA ctrl Driver NI 4-way Sandia’ s Intel Paragon XP/S-based Supercomputer interleaved Communication DRAM Assist (CA)

8 bits, 175 MHz, 2D bidirectional with processing node attached to every switch

2D grid point to point network CMPE655 - Shaaban #58 lec # 1 Spring 2017 1-24-2017 MessageMessage--PassingPassing Example:Example: IBMIBM SPSP--22 MPP Circa 1994-1998

Power 2 CPU IBM SP-2 node

L2 $

Memory bus

General interconnection network formed from Memory 4-way interleaved 8-port switches controller DRAM • Made out of essentially MicroChannel bus NIC complete RS6000 I/O DMA

i860 NI DRAM • Network interface integrated in I/O bus (bandwidth limited by I/O Communication bus) Assist (CA)

Multi-stage network CMPE655 - Shaaban MPP = Massively Parallel Processor System #59 lec # 1 Spring 2017 1-24-2017 MessageMessage--PassingPassing MPPMPP Example:Example: IBM Blue Gene/L Circa 2005 (2 processors/chip) • (2 chips/compute card) • (16 compute cards/node board) • (32 node boards/tower) • (64 tower) = 128k = 131072 (0.7 GHz PowerPC System440) processors (64k nodes) (64 cabinets, 64x32x32) 2.8 Gflops peak System Location: Lawrence Livermore National Laboratory per processor core Cabinet (32 Node boards, 8x8x16) Networks: 3D Torus point-to-point network Global tree 3D point-to-point network (both proprietary) Node Board (32 chips, 4x4x2) 16 Compute Cards

Compute Card (2 chips, 2x1x1) 180/360 TF/s 16 TB DDR Chip (2 processors) 2.9/5.7 TF/s 256 GB DDR 90/180 GF/s 8 GB DDR Design Goals: 5.6/11.2 GF/s - High computational power efficiency 2.8/5.6 GF/s 0.5 GB DDR - High computational density per volume 4 MB LINPACK Performance: 280,600 GFLOPS = 280.6 TeraFLOPS = 0.2806 Peta FLOP Top Peak FP Performance: Now about 367,000 GFLOPS = 367 TeraFLOPS = 0.367 Peta FLOP CMPE655 - Shaaban #60 lec # 1 Spring 2017 1-24-2017 MessageMessage--PassingPassing ProgrammingProgramming ToolsTools • Message-passing programming environments include: – Message Passing Interface (MPI): • Provides a standard for writing concurrent message-passing programs. • MPI implementations include parallel libraries used by existing

programming languages (C, C++). Both MPI & PVM are examples of the explicit parallelism approach – Parallel (PVM): to parallel programming • Enables a collection of heterogeneous computers to used as a coherent and flexible concurrent computational resource. • PVM support software executes on each machine in a user- configurable pool, and provides a computational environment of concurrent applications. • User programs written for example in C, Fortran or Java are provided access to PVM through the use of calls to PVM library routines. Both MPI and PVM are portable (platform-independent) and allow the user to explicitly specify parallelism CMPE655 - Shaaban #61 lec # 1 Spring 2017 1-24-2017 DataData ParallelParallel SystemsSystems SIMD in Flynn taxonomy • Programming model (Data Parallel) – Similar operations performed in parallel on each element of data structure – Logically single thread of control, performs Control sequential or parallel steps processor – Conceptually, a processor is associated with each data element • Architectural model PE PE ° ° ° PE – Array of many simple processors each with little memory PE PE ° ° ° PE • Processors don’t sequence through instructions – Attached to a control processor that issues ° ° ° ° ° ° ° ° ° instructions – Specialized and general communication, global PE PE ° ° ° PE synchronization • Example machines: All PE are synchronized – Thinking Machines CM-1, CM-2 (and CM-5) (same instruction or operation – Maspar MP-1 and MP-2, in a given cycle)

Other Data Parallel Architectures: Vector Machines CMPE655 - Shaaban PE = Processing Element #62 lec # 1 Spring 2017 1-24-2017 DataflowDataflow ArchitecturesArchitectures • Represent computation as a graph of essential data dependencies – Non- (Not PC-based Architecture) i.e data or results – Logical processor at each node, activated by availability of operands – Message (tokens) carrying tag of next instruction sent to next processor – Tag compared with others in matching store; match fires execution 1bce Research Dataflow machine

Token prototypes include: + − × a = (b +1) × (b − c) Distribution d = c × e • The MIT Tagged Architecture f = a × d Network d • The Manchester Dataflow Machine × Dataflow graph a × The Tomasulo approach for dynamic Dependency graph for Network instruction execution utilizes a dataflow entire computation (program) f driven execution engine: • The data dependency graph for a small window of instructions is constructed Token Program One Node store store dynamically when instructions are issued in order of the program. Network Waiting Instruction Execute Form Matching fetch token • The execution of an issued instruction Token queue is triggered by the availability of its Network operands (data it needs) over the CDB. Token Matching Token Distribution CMPE655 - Shaaban Tokens = Copies of computation results #63 lec # 1 Spring 2017 1-24-2017 Commit or Retirement (In Order) Speculative Tomasulo FIFO Usually Processor implemented as a circular + buffer Tomasulo’s Algorithm Instructions to issue in Next to Order: commit = Speculative Tomasulo Instruction Queue (IQ)

Store The Tomasulo approach for Results dynamic instruction execution utilizes a dataflow driven execution engine:

• The data dependency graph for a small window of instructions is constructed dynamically when instructions are issued in order of the program.

• The execution of an issued instruction is triggered by the availability of its operands (data it needs) over the CDB.

From 550 Lecture 6 CMPE655 - Shaaban #64 lec # 1 Spring 2017 1-24-2017 Example of Flynn’s Taxonomy’s MISD (Multiple Instruction Streams Single Data Stream): SystolicSystolic ArchitecturesArchitectures • Replace single processor with an array of regular processing elements • Orchestrate data flow for high throughput with less memory access PE = Processing Element M = Memory M M

PE PE PE PE

• Different from linear pipelining – Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory • Different from SIMD: each PE may do something different • Initial motivation: VLSI Application-Specific Integrated Circuits (ASICs) • Represent algorithms directly by chips connected in regular pattern

A possible example of MISD in Flynn’s Classification of Computer Architecture CMPE655 - Shaaban #65 lec # 1 Spring 2017 1-24-2017 Systolic Array Example: C = A X B 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid b2,2 • Each processor accumulates one b2,1 b1,2 element of the product b2,0 b1,1 b0,2

b1,0 b0,1 Column 2 Alignments in time b0,0 Column 1 Columns of B Column 0 Rows of A a0,2 a0,1 a0,0

Row 0

a1,2 a1,1 a1,0

Row 1

a2,2 a2,1 a2,0 Row 2 T = 0 CMPE655 - Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ #66 lec # 1 Spring 2017 1-24-2017 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one b2,2 element of the product b2,1 b1,2 b2,0 b1,1 b0,2 Alignments in time b1,0 b0,1

b0,0

a0,0*b0,0 a0,0 a0,2 a0,1

a1,2 a1,1 a1,0

a2,2 a2,1 a2,0

T = 1 CMPE655 - Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ #67 lec # 1 Spring 2017 1-24-2017 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product b2,2 b2,1 b1,2 Alignments in time b2,0 b1,1 b0,2

b1,0 b0,1

a0,0*b0,0 a0,0*b0,1 a0,2 a0,1 + a0,1*b1,0 a0,0

b0,0 a1,0*b0,0 a1,2 a1,1 a1,0

a2,2 a2,1 a2,0

T = 2 CMPE655 - Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ #68 lec # 1 Spring 2017 1-24-2017 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product b2,2 Alignments in time b2,1 b1,2

b2,0 b1,1 b0,2

a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 a0,2 + a0,1*b1,0 a0,1 + a0,1*b1,1 a0,0 + a0,2*b2,0 C00 b1,0 b0,1 a1,0*b0,0 a1,2 a1,1 + a1,1*b1,0 a1,0 a1,0*b0,1

b0,0 a2,0*b0,0 a2,2 a2,1 a2,0

T = 3 CMPE655 - Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ #69 lec # 1 Spring 2017 1-24-2017 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time b2,2

b2,1 b1,2

a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 a0,2 + a0,1*b1,1 a0,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 C00 C01 b2,0 b1,1 b0,2 a1,0*b0,0 a1,0*b0,2 a1,2 + a1,1*b1,0 a1,1 a1,0*b0,1 a1,0 + a1,2*a2,0 +a1,1*b1,1

C10 b1,0 b0,1 a2,0*b0,1 a2,1 a2,0*b0,0 a2,0 a2,2 + a2,1*b1,0 a2,2 T = 4 CMPE655 - Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ #70 lec # 1 Spring 2017 1-24-2017 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time

b2,2

a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 a0,2 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2 C00 C01 C02 b2,1 b1,2 a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,2 a1,0*b0,1 a1,1 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 C10 C11 b2,0 b1,1 b0,2 a2,0*b0,0 a2,1 a2,0*b0,1 a2,0 a2,0*b0,2 a2,2 + a2,1*b1,0 + a2,1*b1,1 + a2,2*b2,0 T = 5 C20 CMPE655 - Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ #71 lec # 1 Spring 2017 1-24-2017 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time

a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2 C00 C01 C02 b2,2 a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,0*b0,1 a1,2 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 + a1,2*b2,2 C10 C11 C12 b2,1 b1,2 a2,0*b0,0 a2,2 a2,0*b0,1 a2,1 a2,0*b0,2 + a2,1*b1,0 + a2,1*b1,1 + a2,1*b1,2 + a2,2*b2,0 + a2,2*b2,1 C21 T = 6 C20 CMPE655 - Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ #72 lec # 1 Spring 2017 1-24-2017 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one On one processor = O(n3) t = 27? element of the product Speedup = 27/7 = 3.85 Alignments in time

a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2 C00 C01 C02

a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,0*b0,1 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 + a1,2*b2,2 Done C10 C11 C12 b2,2 a2,0*b0,0 a2,0*b0,1 a2,2 a2,0*b0,2 + a2,1*b1,0 + a2,1*b1,1 + a2,1*b1,2 + a2,2*b2,0 + a2,2*b2,1 + a2,2*b2,2 C21 C22 T = 7 C20 CMPE655 - Shaaban Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ #73 lec # 1 Spring 2017 1-24-2017