<<

Parallel (High-Performance) Architectures

Tarek El-Ghazawi

Department of Electrical and Computer Engineering The George Washington University

Tarek El-Ghazawi, Introduction to High-Performance slide 1 Introduction to Systems

Outline Definitions and Conceptual Classifications » Parallel Processing, MPP’s, and Related Terms » Flynn’s Classification of Computer Architectures Operational Models for Parallel Interconnection Networks MPP’s Performance

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 2 Definitions and Conceptual Classification

What is Parallel Processing? - A form of data processing which emphasizes the exploration and exploitation of inherent parallelism in the underlying problem. Other related terms » Processors » Heterogeneous Processing – In the1990s, heterogeneous workstations from different vendors – Now, accelerators such as GPUs, FPGAs, Intel’s Xeon Phi, … » »

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 3 Definitions and Conceptual Classification

Why Massively Parallel Processors (MPPs)? » Increase processing speed and memory allowing studies of problems with higher resolutions or bigger sizes » Provide a low cost alternative to using expensive processor and memory technologies (as in traditional vector machines)

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 4 Stored Program Computer

The IAS machine was the first electronic computer developed, under the direction of Jon von Neumann at the Institute of Advanced Studies (IAS), Princeton » First electronic implementation of the stored program concept » The computer organizations is now popular as the “,” which stores program and data in a common memory » can be: Fetch, Decode, Execute, and Write Back

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 5 von Neumann Machine Processor Registers Memory

Program Data Address

ALU

µ-Instruction Register Data / Instructions CU

Instruction Register

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 6 Flynn’s Classification

Not necessary the most thorough, but the easiest to remember 1966 Based on the multiplicity of data and instructions streams-- Categories = {single instr.(SI), multiple instr.(MI)} X {single data(SD), multiple data(MD)} Ordered pairs generated are: » Single instruction single data (SISD) » Single instruction multiple data (SIMD) » Multiple instructions single data (MISD) » Multiple instructions multiple data (MIMD)

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 7 Definitions and Conceptual Classifications: Flynn’s Classification

SISD - This is the uniprocessor architecture

Processor µIS I/O DS CU PE M (ALU)

IS

CU= , PE= Processing Element (same as ALU), M= Memory IS= Instruction Stream, DS= Data Stream, µIS= MicroInstructions Stream Tarek El-Ghazawi, Introduction to High-Performance Computing slide 8 Definitions and Conceptual Classifications: Flynn’s Classification

SIMD Processor

DS DS PE1 LM1 CU • IS memory CU µIS • (Program) • DS DS PEN LMN

CU= Control Unit, PE= Processing Element, M= Memory IS= Instruction Stream, DS= Data Stream, LM=Local Memory

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 9 Definitions and Conceptual Classifications: Flynn’s Classification

MIMD ()

Processor IS µIS DS CU1 PE1 • Shared or • • Distributed • Processor • Memory • IS µIS DS CUN PEN

CU= Control Unit, PE= Processing Element, M= Memory IS= Instruction Stream, DS= Data Stream

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 10 Definitions and Conceptual Classifications: Flynn’s Classification

MISD » like

Processor Processor Processor

CU CU CU

µIS µIS µIS Memory PE PE PE DS DS DS DS CU= Control Unit, PE= Processing Element, M= Memory

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 11 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid b2,2 • Each processor accumulates one b2,1 b1,2 element of the product b2,0 b1,1 b0,2 b1,0 b0,1 in time Columns of B Alignments b0,0

Rows of A a0,2 a0,1 a0,0

a1,2 a1,1 a1,0

a2,2 a2,1 a2,0 T = 0 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 12 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one b2,2 element of the product b2,1 b1,2 b2,0 b1,1 b0,2 Alignments in time b1,0 b0,1

b0,0

a0,0*b0,0 a0,0 a0,2 a0,1

a1,2 a1,1 a1,0

a2,2 a2,1 a2,0 T = 1 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 13 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product b2,2 b2,1 b1,2 Alignments in time b2,0 b1,1 b0,2

b1,0 b0,1

a0,0*b0,0 a0,0*b0,1 a0,2 a0,1 + a0,1*b1,0 a0,0

b0,0 a1,0*b0,0 a1,2 a1,1 a1,0

a2,2 a2,1 a2,0

T = 2 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 14 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product b2,2 Alignments in time b2,1 b1,2

b2,0 b1,1 b0,2

a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 a0,2 + a0,1*b1,0 a0,1 + a0,1*b1,1 a0,0 + a0,2*b2,0

b1,0 b0,1 a1,0*b0,0 a1,2 a1,1 + a1,1*b1,0 a1,0 a1,0*b0,1

b0,0 a2,0*b0,0 a2,2 a2,1 a2,0

T = 3 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 15 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product

Alignments in time b2,2

b2,1 b1,2

a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 a0,2 + a0,1*b1,1 a0,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1

b2,0 b1,1 b0,2 a1,0*b0,0 a1,0*b0,2 a1,2 + a1,1*b1,0 a1,1 a1,0*b0,1 a1,0 + a1,2*a2,0 +a1,1*b1,1

b1,0 b0,1 a2,0*b0,0 a2,0 a2,0*b0,1 a2,2 a2,1 + a2,1*b1,0

T = 4 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 16 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product

Alignments in time

b2,2

a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 a0,2 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2

b2,1 b1,2 a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,2 a1,0*b0,1 a1,1 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1

b2,0 b1,1 b0,2 a2,0*b0,0 a2,1 a2,0*b0,1 a2,0 a2,0*b0,2 a2,2 + a2,1*b1,0 + a2,1*b1,1 + a2,2*b2,0 T = 5 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 17 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product

Alignments in time

a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2

b2,2 a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,0*b0,1 a1,2 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 + a1,2*b2,2

b2,1 b1,2 a2,0*b0,0 a2,2 a2,0*b0,1 a2,1 a2,0*b0,2 + a2,1*b1,0 + a2,1*b1,1 + a2,1*b1,2 + a2,2*b2,0 + a2,2*b2,1 T = 6 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 18 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product

Alignments in time

a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2

a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,0*b0,1 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 + a1,2*b2,2 Done b2,2 a2,0*b0,0 a2,0*b0,1 a2,2 a2,0*b0,2 + a2,1*b1,0 + a2,1*b1,1 + a2,1*b1,2 + a2,2*b2,0 + a2,2*b2,1 + a2,2*b2,2 T = 7 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 19 Definitions

Some Styles of Parallelism that could be seen in a program » - many data items can be processed in the same manner at the same time – SIMD or Vector processors

» Functional Parallelism - program has different independent modules that can execute simultaneously

» Overlapped/Temporal Parallelism - program has a sequence of tasks that can be executed in an overlapped fashion. – the most important form of this overlapped parallelism is PIPELINING.

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 20 Pipelining

Example of pipelining 3-Stage » pipeline processor of 3 stages e.g. in Out for floating point addition: Align, Add, S0 S1 S2 Normalize » 4 tasks each has 3 subtasks each performed by a stage » takes 6 clocks instead of 12 » 3 times faster than non-pipelined if S2 T02 T12 T22 T32 number of tasks is large enough S1 T01 T11 T21 T31 T41 » an example is pipelined instruction processing, the stages could be for S0 T00 T10 T20 T30 T40 T50 instruction fetch, decode, operand fetch, execute, or a combinations of these. Clocks » overlapped/temporal parallelism Tarek El-Ghazawi, Introduction to High-Performance Computing slide 21 Operational Models for Parallel Computers

Basic categories of Commercial High-Performance Computers (Historical Perspective): » SIMD [Stand-alone SIMD are Obsolete] » MIMD » Vector Processors » Clusters [Networks of Workstations] » A modern system however may employ many of these styles at different levels of the architecture

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 22 Operational Models for Parallel Computers:

Designed to handle vector operations efficiently, a vector machine » can replace an entire loop which manipulates an array with a single vector instruction » can fetch an entire array from memory in a pipelined fashion (in conjunction with parallel high-speed interleaved memory) » can keep the elements of an entire array in a CPU vector register » can also manipulate scalar operations as it has both vector and scalar instructions and functional units and accepts scalar and vector data » has Classes of operations involving scalar and/or vector operands include: S x V -> V; V -> S; V x V -> V; V -> V

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 23 Operational Models for Parallel Computers: Vector Processor Scalar Units Instruction Pipe 1 Scalar Processing Registers Pipe 2 Unit (IPU) High-Speed Main Memory vector Instruction (Interleaved Controller Memory) Pipe 1 Vector Vector Registers Pipe 2 Access Controller Vector Units » IPU fetches and decodes both scalar and vector instructions

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 24 Operational Models for Parallel Computers: SIMD SIMD machines also known as Array Processors » A recognized architecture which emphasizes data parallelism » Stand alone MPP systems in the 1980’s and 1990’s including Goodyear Aerospace MPP; CM1,2, and 200; MP1&2 » More recently, SIMD hardware units and extensions: Intel MMX, IBM VMX/Altivec » CUM has the program, and CU fetches and decodes instructions » Microinstructions are broadcast to all processors IS » Synchronization is in hardware CU CUM » Independent Branching Problem! Broadcasting mIS DS DS DS PEM PE PEM PE• • • PEM PE

To F.E. and I/O Interconnection Network Tarek El-Ghazawi, Introduction to High-Performance Computing slide 25 Operational Models for Parallel Computers: MIMD

MIMD » A class of MPP’s which emphasizes control/funtional parallelism » Full-blown processors operating asynchronously » Synchronization is often done in software (single O.S.) » SPMD mode to emulate SIMD – app. level sync. » Examples: Clusters, IBM BlueGene; SGI Altix; Cray XC50, Sunway TaihuLight, .. Local or Distributed Processor Shared Memory IS,DS CU PE mIS Processor LM IS,DS Interconnection shared CU PE Network Memory mIS • • LM • Tarek El-Ghazawi, Introduction to High-Performance Computing slide 26 Operational Models for Parallel Computers: MIMD

Taxonomy of MIMD MIMD Shared Address Spaces Multiple Address Space Multiprocessors (MPs) Multicomputers (NORMA) (Scaleable) (MPPs, cluster)

Central Memory MPs Distributed MPs (Non-Scaleable) (Scaleable) UMA/SMP NUMA/DSM COMA (e.g. SGI Power Ch.) (e.g. SGI Altix) (e.g. KSR) Note: a NUMA could be either -coherent or not Tarek El-Ghazawi, Introduction to High-Performance Computing slide 27 Some Abbreviations

SPMD - Single Program Multiple Data

SIMD – Single Instruction Multiple Data MIMD – Multiple Instruction Multiple Data MISD – Multiple Instruction Single Data

SMP - Symmetric Multiprocessor DSM - Distributed Shared Memory

UMA - NUMA - non-uniform memory access

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 28 Operational Models for Parallel Computers: MIMD Uniform Memory Access (UMA) MP Model - Equal access time of memory locations by all processors » Symmetric -- equal access to peripherals and all run O.S. Kernel, aka Symmetric Multiprocessor (SMP) » PVP – parallel vector processors

P2 P1 ••• Pn Processors and caches

System Interconnect (Bus, X-bar, MIN)

Shared Memory ••• Modules I/O SM1 SMm Tarek El-Ghazawi, Introduction to High-Performance Computing slide 29 Operational Models for Parallel Computers: MIMD

Non-Uniform Memory Access – One level NUMA Local memory is mapped into a global address space » fastest is access into local memory (LM) or a memory in the same cluster in the hierarchical cluster Model (next slide) » second fast access is into global shared memory (GSM) » third is into a memory in an external cluster » Example is SGI Altix system

LM P Shared local memories IN

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 30 Operational Models for Parallel Computers: MIMD

Hierarchical NUMA GSM GSM ••• GSM GSM – Global Shared Memory Global interconnection network ••• CSM – Cluster P CSM P Shared Memory C C CSM I I CIN – Cluster N N Interconnection Network Cluster 1 Cluster n A more hierarchical example

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 31 Operational Models for Parallel Computers: MIMD

Cache-Only Memory Access (COMA) MP’s » A special case of NUMA » Information eventually migrate to where it will be used » E.g. KSR-1 and KSR-2

IN

D D D C C • • • C P P P

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 32 Operational Models for Parallel Computers: MIMD

Multicomputers – MPPs or clusters » » No remote memory access (NORMA) » Only Message-Passing Communications » Examples: All clusters, IBM Blue Gene; Cray XC50, Sunway TaihuLight

Message-Passing IN

P P • • • M M

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 33 Interconnection Networks

Why interconnection networks? » Bus System: O(n) delay and O(1) cost » Fully Connected System: O(1) delay and O(n) cost » Bus is cheapest, Fully Connected is fastest. Interconnection networks is a discipline exploring the tradeoffs between the two

P2 P1 P3 P1 P2 Pn • • • Bus System Fully-Connected System

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 34 Interconnection Networks

Network Performance and Cost Metrics

» Distance - between two nodes - number of hubs to get to one from the other » Diameter - max. distance in the network (and max. delay)

» Average Distance - Average delay (avg. no. of hubs)

» Latency - time from request till start of transmission

» Bandwidth - transmission rate when all processors are sending and receiving

» Bisection Bandwidth - rate of transmission (or number of wires) from one half of the machine to the other half

» Node degree - Number of ports per node (cost)

» Connectivity - Min number of parallel paths between any two nodes

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 35 Interconnection Networks

3-Cube Fixed Interconnection Networks Linear Array ••• Ring ••• 2D-Torus 2D Mesh Cray XE6 ••• 2D-Torus ••• “small” systems 3D-Torus large systems •••

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 36 Interconnection Networks

Multistage Interconnection Networks (MIN’s) » Sets an array of to establish a connections Switching Elements in Different Settings

An Omega MIN

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 37 Interconnection Networks

Communications Protocol » Circuit switching

» Packet switching – Store and forward routing - sends whole packet at a time – Cut-through routing - divides packet into flits – pipeline  packets are not buffered at intermediate nodes  Packet blocking mechanisms » Virtual Cut-through – if message is blocked flits are accumulated in node at location of lead packet (requires larger buffer – whole packet must fit) » Wormhole – when a message is blocked, trailing flits are stored at their current node (small flits buffers)

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 38 Interconnection Networks

Current systems replaced FIFO on each link by DMA, buffers and more intelligence Diminishing role of topology » Store&Forward routing: topology important » Introduction of pipelined (wormhole) routing made it less so 101 100 » Cost is in node-network interface » Simplifies programming 001 000

111 110

011 010 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 39 Interconnection Networks

Tarek El-Ghazawi, Introduction to High-Performance Computing slide 40