Parallel Computer Architecture

Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 4 Parallel Architecture Types (3) r Mul2core • Multicore SMP+GPU Cluster ¦ Mul2core processor – Shared memory addressing within SMP node C C C C cores can be m m m m hardware – Message passing between SMP mul2threaded nodes memory (hyperthread) – GPU accelerators attached ¦ GPU accelerator M … M processor … PCI P … P P P memory interconnec2on network ¦ “Fused” processor accelerator P … P P … P processor M … M memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 5 How do you get parallelism in the hardware? q Instruction-Level Parallelism (ILP) q Data parallelism ❍ Increase amount of data to be operated on at same time q Processor parallelism ❍ Increase number of processors q Memory system parallelism ❍ Increase number of memory units ❍ Increase bandwidth to memory q Communication parallelism ❍ Increase amount of interconnection between elements ❍ Increase communication bandwidth Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 6 Instruction-Level Parallelism r Opportunities for splitting up instruction processing r Pipelining within instruction r Pipelining between instructions r Overlapped execution r Multiple functional units r Out of order execution r Multi-issue execution r Superscalar processing r Superpipelining r Very Long Instruction Word (VLIW) r Hardware multithreading (hyperthreading) Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 7 Parallelism in Single Processor Computers q History of processor architecture innovation ILP Unpipelined Pipelined Vector mulple only scalar vector E unit instruc2ons instruc2ons horizontal issue-when- register memory control ready to register to memory CDC FPS CDC CDC 6600 AP-120B 7600 CRAY-1 Cyber-205 scoreboarding VLIW IBM 360/91 reservaon staons Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 8 Vector Processing q Scalar processing ❍ Processor instructions operate on scalar values ❍ integer registers and floating point registers q Vectors Liquid-cooled with inert ❍ Set of scalar data ﬂuorocarbon. (That’s a ❍ Vector registers waterfall fountain!!!) ◆ integer, floating point (typically) Cray 2 ❍ Vector instructions operate on vector registers (SIMD) q Vector unit pipelining q Multiple vector units q Vector chaining Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 9 Data Parallel Architectures q SIMD (Single Instruction Multiple Data) ❍ Logical single thread (instruction) of control Co n t r ol pr o c es s or ❍ Processor associated with data elements q Architecture PE PE PE ° ° ° ❍ PE PE PE Array of simple processors with memory ° ° ° ❍ Processors arranged in a regular topology ° ° ° ° ° ° ° ° ° PE PE PE ❍ Control processor issues instructions ° ° ° ◆ All processors execute same instruction (maybe disabled) ❍ Specialized synchronization and communicaton ❍ Specialized reduction operations ❍ Array processing Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 10 AMT DAP 500 q Applied Memory Technology (AMT) q Distributed Array Processor (DAP) USER INTERFACE HOST CONNECTION HOST UNIT ACCUMULATOR O MASTER CODE ACTIVITY CONTROL A CONTROL MEMORY UNIT CARRY C PROCESSOR ELEMENTS DATA D FAST DATA CHANNEL ARRAY MEMORY 32K BITS 32 Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 11 32 Thinking Machines Connection Machine (Tucker, IEEE Computer, Aug. 1988) 16,000 processors!!! Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 12 Vector and SIMD Processing Timeline CDC Cyber 205 ETA 10 (Levine, 1982) (ETA, Inc. 1989) CDC 7600 Cray Y-MP Cray/MPP (CDC, 1970) (Cray Research, 1989) (Cray Research, 1993) Cray 1 (Russell, 1978) Fujitsu, NEC, Hitachi Models (a) Mul)vector track DAP 610 (AMT, Inc. 1987) Goodyear MPP (Batcher, 1980) CM2 Illiac IV (TMC, 1990) (Barnes et al, 1968) MasPar MP1 MasPar MP2 BSP (Nickolls, 1990) (1991) (Kuck and Stokes, 1982) IBM GF/11 (Beetem et al, 1985) (b) SIMD track Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 13 What’s the maximum parallelism in a program? q “MaxPar: An Execution Driven Simulator for Studying Parallel Systems,” Ding-Kai Chen, M.S. Thesis, University of Illinois, Urbana-Champaign, 1989. q Analyze the data dependencies in application execution 512-point FFT Flo52 Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 14 Dataflow Architectures q Represent computation as graph of dependencies q Operations stored in memory until operands are ready q Operations can be dispatched to processors q Tokens carry tags of next 1 b c e + instruction to processor a = (b +1) ↔ (b − c) − ↔ d = c ↔ e f = a ↔ d d ↔ Dataflow graph q Tag compared in matching a ↔ store Network f q A match fires execution Token Program q Machine does the hard store store Network Waiting Instruction Execute Form parallelization work Matching fetch token Token queue q Hard to build correctly Network Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 15 Shared Physical Memory q Add processors to single processor computer system q Processors share computer system resources ❍ Memory, storage, … q Sharing physical memory ❍ Any processor can reference any memory location ❍ Any I/O controller can reference any memory address ❍ Single physical memory address space q Operating system runs on any processor, or all ❍ OS see single memory address space ❍ Uses shared memory to coordinate q Communication occurs as a result of loads and stores Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 16 Caching in Shared Memory Systems q Reduce average latency ❍ automatic replication closer to processor q Reduce average bandwidth q Data is logically transferred from producer to consumer to memory ❍ store reg → mem ❍ load reg ← mem P P P q Processors can share data efficiently q What happens when store and load are executed on different processors? q Cache coherence problems Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 17 Shared Memory Multiprocessors (SMP) q Architecture types Single processor Mul2ple processors P P P P P P P P P P mul2-port shared bus M M M interconnec2on M network q Differences lie in memory system interconnection I/ O de v i c e s What does Me m Me m Me m Me m I/ O c t r l I/ O c t r l this look like? In t e r co n n e c t In t e r co n n e c t Pr o c e s so r Pr o c e s s o r Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 18 Bus-based SMP I/O I/O q Memory bus handles all C C M M memory read/write traffic q Processors share bus $ $ q Uniform Memory Access (UMA) P P ❍ Memory (not cache) uniformly equidistant ❍ Take same amount of time (generally) to complete q May have multiple memory modules ❍ Interleaving of physical address space q Caches introduce memory hierarchy ❍ Lead to data consistency problems ❍ Cache coherency hardware necessary (CC-UMA) Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 19 Crossbar SMP P q Replicates memory bus for … every processor and I/O controller P ❍ I/O Every processor has direct path C I/O q UMA SMP architecture C q Can still have cache coherency issues M M M M q Multi-bank memory or interleaved memory q Advantages ❍ Bandwidth scales linearly (no shared links) q Problems ❍ High incremental cost (cannot afford for many processors) ❍ Use switched multi-stage interconnection network Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 20 “Dance Hall” SMP and Shared Cache q Interconnection network connects processors to memory M M ° ° ° M q

Parallel Computer Architecture

2.5 Classification of Parallel Computers

Chapter 5 Multiprocessors and Thread-Level Parallelism

Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks

Parallel Programming

Massively Parallel Computing with CUDA

Computer Hardware

Computer Organization & Architecture Eie

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Thread-Level Parallelism – Part 1

A Review of Multicore Processors with Parallel Programming

Parallel Programming Using Openmp Feb 2014

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters