Parallel (High-Performance) Computer Architectures
Tarek El-Ghazawi
Department of Electrical and Computer Engineering The George Washington University
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 1 Introduction to Parallel Computing Systems
Outline Definitions and Conceptual Classifications » Parallel Processing, MPP’s, and Related Terms » Flynn’s Classification of Computer Architectures Operational Models for Parallel Computers Interconnection Networks MPP’s Performance
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 2 Definitions and Conceptual Classification
What is Parallel Processing? - A form of data processing which emphasizes the exploration and exploitation of inherent parallelism in the underlying problem. Other related terms » Massively Parallel Processors » Heterogeneous Processing – In the1990s, heterogeneous workstations from different processor vendors – Now, accelerators such as GPUs, FPGAs, Intel’s Xeon Phi, … » Grid computing » Cloud Computing
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 3 Definitions and Conceptual Classification
Why Massively Parallel Processors (MPPs)? » Increase processing speed and memory allowing studies of problems with higher resolutions or bigger sizes » Provide a low cost alternative to using expensive processor and memory technologies (as in traditional vector machines)
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 4 Stored Program Computer
The IAS machine was the first electronic computer developed, under the direction of Jon von Neumann at the Institute of Advanced Studies (IAS), Princeton » First electronic implementation of the stored program concept » The computer organizations is now popular as the “von Neumann Architecture,” which stores program and data in a common memory » Instruction cycle can be: Fetch, Decode, Execute, and Write Back
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 5 von Neumann Machine Processor Registers Memory
Program Data Address
ALU
µ-Instruction Register Data / Instructions CU
Instruction Register
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 6 Flynn’s Classification
Not necessary the most thorough, but the easiest to remember 1966 Based on the multiplicity of data and instructions streams-- Categories = {single instr.(SI), multiple instr.(MI)} X {single data(SD), multiple data(MD)} Ordered pairs generated are: » Single instruction single data (SISD) » Single instruction multiple data (SIMD) » Multiple instructions single data (MISD) » Multiple instructions multiple data (MIMD)
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 7 Definitions and Conceptual Classifications: Flynn’s Classification
SISD - This is the uniprocessor architecture
Processor µIS I/O DS CU PE M (ALU)
IS
CU= Control Unit, PE= Processing Element (same as ALU), M= Memory IS= Instruction Stream, DS= Data Stream, µIS= MicroInstructions Stream Tarek El-Ghazawi, Introduction to High-Performance Computing slide 8 Definitions and Conceptual Classifications: Flynn’s Classification
SIMD Processor
DS DS PE1 LM1 CU • IS memory CU µIS • (Program) • DS DS PEN LMN
CU= Control Unit, PE= Processing Element, M= Memory IS= Instruction Stream, DS= Data Stream, LM=Local Memory
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 9 Definitions and Conceptual Classifications: Flynn’s Classification
MIMD (Shared Memory)
Processor IS µIS DS CU1 PE1 • Shared or • • Distributed • Processor • Memory • IS µIS DS CUN PEN
CU= Control Unit, PE= Processing Element, M= Memory IS= Instruction Stream, DS= Data Stream
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 10 Definitions and Conceptual Classifications: Flynn’s Classification
MISD » Systolic array like
Processor Processor Processor
CU CU CU
µIS µIS µIS Memory PE PE PE DS DS DS DS CU= Control Unit, PE= Processing Element, M= Memory
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 11 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid b2,2 • Each processor accumulates one b2,1 b1,2 element of the product b2,0 b1,1 b0,2 b1,0 b0,1 in time Columns of B Alignments b0,0
Rows of A a0,2 a0,1 a0,0
a1,2 a1,1 a1,0
a2,2 a2,1 a2,0 T = 0 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 12 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one b2,2 element of the product b2,1 b1,2 b2,0 b1,1 b0,2 Alignments in time b1,0 b0,1
b0,0
a0,0*b0,0 a0,0 a0,2 a0,1
a1,2 a1,1 a1,0
a2,2 a2,1 a2,0 T = 1 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 13 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product b2,2 b2,1 b1,2 Alignments in time b2,0 b1,1 b0,2
b1,0 b0,1
a0,0*b0,0 a0,0*b0,1 a0,2 a0,1 + a0,1*b1,0 a0,0
b0,0 a1,0*b0,0 a1,2 a1,1 a1,0
a2,2 a2,1 a2,0
T = 2 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 14 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product b2,2 Alignments in time b2,1 b1,2
b2,0 b1,1 b0,2
a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 a0,2 + a0,1*b1,0 a0,1 + a0,1*b1,1 a0,0 + a0,2*b2,0
b1,0 b0,1 a1,0*b0,0 a1,2 a1,1 + a1,1*b1,0 a1,0 a1,0*b0,1
b0,0 a2,0*b0,0 a2,2 a2,1 a2,0
T = 3 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 15 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product
Alignments in time b2,2
b2,1 b1,2
a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 a0,2 + a0,1*b1,1 a0,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1
b2,0 b1,1 b0,2 a1,0*b0,0 a1,0*b0,2 a1,2 + a1,1*b1,0 a1,1 a1,0*b0,1 a1,0 + a1,2*a2,0 +a1,1*b1,1
b1,0 b0,1 a2,0*b0,0 a2,0 a2,0*b0,1 a2,2 a2,1 + a2,1*b1,0
T = 4 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 16 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product
Alignments in time
b2,2
a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 a0,2 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2
b2,1 b1,2 a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,2 a1,0*b0,1 a1,1 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1
b2,0 b1,1 b0,2 a2,0*b0,0 a2,1 a2,0*b0,1 a2,0 a2,0*b0,2 a2,2 + a2,1*b1,0 + a2,1*b1,1 + a2,2*b2,0 T = 5 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 17 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product
Alignments in time
a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2
b2,2 a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,0*b0,1 a1,2 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 + a1,2*b2,2
b2,1 b1,2 a2,0*b0,0 a2,2 a2,0*b0,1 a2,1 a2,0*b0,2 + a2,1*b1,0 + a2,1*b1,1 + a2,1*b1,2 + a2,2*b2,0 + a2,2*b2,1 T = 6 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 18 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product
Alignments in time
a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2
a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,0*b0,1 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 + a1,2*b2,2 Done b2,2 a2,0*b0,0 a2,0*b0,1 a2,2 a2,0*b0,2 + a2,1*b1,0 + a2,1*b1,1 + a2,1*b1,2 + a2,2*b2,0 + a2,2*b2,1 + a2,2*b2,2 T = 7 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 19 Definitions
Some Styles of Parallelism that could be seen in a program » Data Parallelism - many data items can be processed in the same manner at the same time – SIMD or Vector processors
» Functional Parallelism - program has different independent modules that can execute simultaneously
» Overlapped/Temporal Parallelism - program has a sequence of tasks that can be executed in an overlapped fashion. – the most important form of this overlapped parallelism is PIPELINING.
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 20 Pipelining
Example of pipelining 3-Stage Pipeline » pipeline processor of 3 stages e.g. in Out for floating point addition: Align, Add, S0 S1 S2 Normalize » 4 tasks each has 3 subtasks each performed by a stage » takes 6 clocks instead of 12 » 3 times faster than non-pipelined if S2 T02 T12 T22 T32 number of tasks is large enough S1 T01 T11 T21 T31 T41 » an example is pipelined instruction processing, the stages could be for S0 T00 T10 T20 T30 T40 T50 instruction fetch, decode, operand fetch, execute, or a combinations of these. Clocks » overlapped/temporal parallelism Tarek El-Ghazawi, Introduction to High-Performance Computing slide 21 Operational Models for Parallel Computers
Basic categories of Commercial High-Performance Computers (Historical Perspective): » SIMD [Stand-alone SIMD are Obsolete] » MIMD » Vector Processors » Clusters [Networks of Workstations] » A modern system however may employ many of these styles at different levels of the architecture
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 22 Operational Models for Parallel Computers: Vector Processor
Designed to handle vector operations efficiently, a vector machine » can replace an entire loop which manipulates an array with a single vector instruction » can fetch an entire array from memory in a pipelined fashion (in conjunction with parallel high-speed interleaved memory) » can keep the elements of an entire array in a CPU vector register » can also manipulate scalar operations as it has both vector and scalar instructions and functional units and accepts scalar and vector data » has Classes of operations involving scalar and/or vector operands include: S x V -> V; V -> S; V x V -> V; V -> V
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 23 Operational Models for Parallel Computers: Vector Processor Scalar Units Instruction Pipe 1 Scalar Processing Registers Pipe 2 Unit (IPU) High-Speed Main Memory vector Instruction (Interleaved Controller Memory) Pipe 1 Vector Vector Registers Pipe 2 Access Controller Vector Units » IPU fetches and decodes both scalar and vector instructions
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 24 Operational Models for Parallel Computers: SIMD SIMD machines also known as Array Processors » A recognized architecture which emphasizes data parallelism » Stand alone MPP systems in the 1980’s and 1990’s including Goodyear Aerospace MPP; CM1,2, and 200; MP1&2 » More recently, SIMD hardware units and extensions: Intel MMX, IBM VMX/Altivec » CUM has the program, and CU fetches and decodes instructions » Microinstructions are broadcast to all processors IS » Synchronization is in hardware CU CUM » Independent Branching Problem! Broadcasting Bus mIS DS DS DS PEM PE PEM PE• • • PEM PE
To F.E. and I/O Interconnection Network Tarek El-Ghazawi, Introduction to High-Performance Computing slide 25 Operational Models for Parallel Computers: MIMD
MIMD » A class of MPP’s which emphasizes control/funtional parallelism » Full-blown processors operating asynchronously » Synchronization is often done in software (single O.S.) » SPMD mode to emulate SIMD – app. level sync. » Examples: Clusters, IBM BlueGene; SGI Altix; Cray XC50, Sunway TaihuLight, .. Local or Distributed Processor Shared Memory IS,DS CU PE mIS Processor LM IS,DS Interconnection shared CU PE Network Memory mIS • • LM • Tarek El-Ghazawi, Introduction to High-Performance Computing slide 26 Operational Models for Parallel Computers: MIMD
Taxonomy of MIMD MIMD Shared Address Spaces Multiple Address Space Multiprocessors (MPs) Multicomputers (NORMA) (Scaleable) (MPPs, cluster)
Central Memory MPs Distributed MPs (Non-Scaleable) (Scaleable) UMA/SMP NUMA/DSM COMA (e.g. SGI Power Ch.) (e.g. SGI Altix) (e.g. KSR) Note: a NUMA could be either cache-coherent or not Tarek El-Ghazawi, Introduction to High-Performance Computing slide 27 Some Abbreviations
SPMD - Single Program Multiple Data
SIMD – Single Instruction Multiple Data MIMD – Multiple Instruction Multiple Data MISD – Multiple Instruction Single Data
SMP - Symmetric Multiprocessor DSM - Distributed Shared Memory
UMA - uniform memory access NUMA - non-uniform memory access
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 28 Operational Models for Parallel Computers: MIMD Uniform Memory Access (UMA) MP Model - Equal access time of memory locations by all processors » Symmetric -- equal access to peripherals and all run O.S. Kernel, aka Symmetric Multiprocessor (SMP) » PVP – parallel vector processors
P2 P1 ••• Pn Processors and caches
System Interconnect (Bus, X-bar, MIN)
Shared Memory ••• Modules I/O SM1 SMm Tarek El-Ghazawi, Introduction to High-Performance Computing slide 29 Operational Models for Parallel Computers: MIMD
Non-Uniform Memory Access – One level NUMA Local memory is mapped into a global address space » fastest is access into local memory (LM) or a memory in the same cluster in the hierarchical cluster Model (next slide) » second fast access is into global shared memory (GSM) » third is into a memory in an external cluster » Example is SGI Altix system
LM P Shared local memories IN
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 30 Operational Models for Parallel Computers: MIMD
Hierarchical NUMA GSM GSM ••• GSM GSM – Global Shared Memory Global interconnection network ••• CSM – Cluster P CSM P Shared Memory C C CSM I I CIN – Cluster N N Interconnection Network Cluster 1 Cluster n A more hierarchical example
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 31 Operational Models for Parallel Computers: MIMD
Cache-Only Memory Access (COMA) MP’s » A special case of NUMA » Information eventually migrate to where it will be used » E.g. KSR-1 and KSR-2
IN
D D D C C • • • C P P P
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 32 Operational Models for Parallel Computers: MIMD
Multicomputers – MPPs or clusters » Distributed Memory » No remote memory access (NORMA) » Only Message-Passing Communications » Examples: All clusters, IBM Blue Gene; Cray XC50, Sunway TaihuLight
Message-Passing IN
P P • • • M M
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 33 Interconnection Networks
Why interconnection networks? » Bus System: O(n) delay and O(1) cost » Fully Connected System: O(1) delay and O(n) cost » Bus is cheapest, Fully Connected is fastest. Interconnection networks is a discipline exploring the tradeoffs between the two
P2 P1 P3 P1 P2 Pn • • • Bus System Fully-Connected System
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 34 Interconnection Networks
Network Performance and Cost Metrics
» Distance - between two nodes - number of hubs to get to one from the other » Diameter - max. distance in the network (and max. delay)
» Average Distance - Average delay (avg. no. of hubs)
» Latency - time from request till start of transmission
» Bandwidth - transmission rate when all processors are sending and receiving
» Bisection Bandwidth - rate of transmission (or number of wires) from one half of the machine to the other half
» Node degree - Number of ports per node (cost)
» Connectivity - Min number of parallel paths between any two nodes
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 35 Interconnection Networks
3-Cube Fixed Interconnection Networks Linear Array ••• Ring ••• 2D-Torus 2D Mesh Cray XE6 ••• 2D-Torus ••• “small” systems 3D-Torus large systems •••
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 36 Interconnection Networks
Multistage Interconnection Networks (MIN’s) » Sets an array of switches to establish a connections Switching Elements in Different Settings
An Omega MIN
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 37 Interconnection Networks
Communications Protocol » Circuit switching
» Packet switching – Store and forward routing - sends whole packet at a time – Cut-through routing - divides packet into flits – pipeline packets are not buffered at intermediate nodes Packet blocking mechanisms » Virtual Cut-through – if message is blocked flits are accumulated in node at location of lead packet (requires larger buffer – whole packet must fit) » Wormhole – when a message is blocked, trailing flits are stored at their current node (small flits buffers)
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 38 Interconnection Networks
Current systems replaced FIFO on each link by DMA, buffers and more intelligence Diminishing role of topology » Store&Forward routing: topology important » Introduction of pipelined (wormhole) routing made it less so 101 100 » Cost is in node-network interface » Simplifies programming 001 000
111 110
011 010 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 39 Interconnection Networks
Tarek El-Ghazawi, Introduction to High-Performance Computing slide 40