Computer Architectures

Parallel (High-Performance) Computer Architectures Tarek El-Ghazawi Department of Electrical and Computer Engineering The George Washington University Tarek El-Ghazawi, Introduction to High-Performance Computing slide 1 Introduction to Parallel Computing Systems Outline Definitions and Conceptual Classifications » Parallel Processing, MPP’s, and Related Terms » Flynn’s Classification of Computer Architectures Operational Models for Parallel Computers Interconnection Networks MPP’s Performance Tarek El-Ghazawi, Introduction to High-Performance Computing slide 2 Definitions and Conceptual Classification What is Parallel Processing? - A form of data processing which emphasizes the exploration and exploitation of inherent parallelism in the underlying problem. Other related terms » Massively Parallel Processors » Heterogeneous Processing – In the1990s, heterogeneous workstations from different processor vendors – Now, accelerators such as GPUs, FPGAs, Intel’s Xeon Phi, … » Grid computing » Cloud Computing Tarek El-Ghazawi, Introduction to High-Performance Computing slide 3 Definitions and Conceptual Classification Why Massively Parallel Processors (MPPs)? » Increase processing speed and memory allowing studies of problems with higher resolutions or bigger sizes » Provide a low cost alternative to using expensive processor and memory technologies (as in traditional vector machines) Tarek El-Ghazawi, Introduction to High-Performance Computing slide 4 Stored Program Computer The IAS machine was the first electronic computer developed, under the direction of Jon von Neumann at the Institute of Advanced Studies (IAS), Princeton » First electronic implementation of the stored program concept » The computer organizations is now popular as the “von Neumann Architecture,” which stores program and data in a common memory » Instruction cycle can be: Fetch, Decode, Execute, and Write Back Tarek El-Ghazawi, Introduction to High-Performance Computing slide 5 von Neumann Machine Processor Registers Memory Program Data Address ALU µ-Instruction Register Data / Instructions CU Instruction Register Tarek El-Ghazawi, Introduction to High-Performance Computing slide 6 Flynn’s Classification Not necessary the most thorough, but the easiest to remember 1966 Based on the multiplicity of data and instructions streams-- Categories = {single instr.(SI), multiple instr.(MI)} X {single data(SD), multiple data(MD)} Ordered pairs generated are: » Single instruction single data (SISD) » Single instruction multiple data (SIMD) » Multiple instructions single data (MISD) » Multiple instructions multiple data (MIMD) Tarek El-Ghazawi, Introduction to High-Performance Computing slide 7 Definitions and Conceptual Classifications: Flynn’s Classification SISD - This is the uniprocessor architecture Processor µIS I/O DS CU PE M (ALU) IS CU= Control Unit, PE= Processing Element (same as ALU), M= Memory IS= Instruction Stream, DS= Data Stream, µIS= MicroInstructions Stream Tarek El-Ghazawi, Introduction to High-Performance Computing slide 8 Definitions and Conceptual Classifications: Flynn’s Classification SIMD Processor DS DS PE1 LM1 CU • IS memory CU µIS • (Program) • DS DS PEN LMN CU= Control Unit, PE= Processing Element, M= Memory IS= Instruction Stream, DS= Data Stream, LM=Local Memory Tarek El-Ghazawi, Introduction to High-Performance Computing slide 9 Definitions and Conceptual Classifications: Flynn’s Classification MIMD (Shared Memory) Processor IS µIS DS CU1 PE1 • Shared or • • Distributed • Processor • Memory • IS µIS DS CUN PEN CU= Control Unit, PE= Processing Element, M= Memory IS= Instruction Stream, DS= Data Stream Tarek El-Ghazawi, Introduction to High-Performance Computing slide 10 Definitions and Conceptual Classifications: Flynn’s Classification MISD » Systolic array like Processor Processor Processor CU CU CU µIS µIS µIS Memory PE PE PE DS DS DS DS CU= Control Unit, PE= Processing Element, M= Memory Tarek El-Ghazawi, Introduction to High-Performance Computing slide 11 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid b2,2 • Each processor accumulates one b2,1 b1,2 element of the product b2,0 b1,1 b0,2 b1,0 b0,1 in time Columns of B Alignments b0,0 Rows of A a0,2 a0,1 a0,0 a1,2 a1,1 a1,0 a2,2 a2,1 a2,0 T = 0 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 12 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one b2,2 element of the product b2,1 b1,2 b2,0 b1,1 b0,2 Alignments in time b1,0 b0,1 b0,0 a0,0*b0,0 a0,0 a0,2 a0,1 a1,2 a1,1 a1,0 a2,2 a2,1 a2,0 T = 1 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 13 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product b2,2 b2,1 b1,2 Alignments in time b2,0 b1,1 b0,2 b1,0 b0,1 a0,0*b0,0 a0,0*b0,1 a0,2 a0,1 + a0,1*b1,0 a0,0 b0,0 a1,0*b0,0 a1,2 a1,1 a1,0 a2,2 a2,1 a2,0 T = 2 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 14 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product b2,2 Alignments in time b2,1 b1,2 b2,0 b1,1 b0,2 a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 a0,2 + a0,1*b1,0 a0,1 + a0,1*b1,1 a0,0 + a0,2*b2,0 b1,0 b0,1 a1,0*b0,0 a1,2 a1,1 + a1,1*b1,0 a1,0 a1,0*b0,1 b0,0 a2,0*b0,0 a2,2 a2,1 a2,0 T = 3 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 15 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time b2,2 b2,1 b1,2 a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 a0,2 + a0,1*b1,1 a0,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 b2,0 b1,1 b0,2 a1,0*b0,0 a1,0*b0,2 a1,2 + a1,1*b1,0 a1,1 a1,0*b0,1 a1,0 + a1,2*a2,0 +a1,1*b1,1 b1,0 b0,1 a2,0*b0,0 a2,0 a2,0*b0,1 a2,2 a2,1 + a2,1*b1,0 T = 4 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 16 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time b2,2 a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 a0,2 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2 b2,1 b1,2 a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,2 a1,0*b0,1 a1,1 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 b2,0 b1,1 b0,2 a2,0*b0,0 a2,1 a2,0*b0,1 a2,0 a2,0*b0,2 a2,2 + a2,1*b1,0 + a2,1*b1,1 + a2,2*b2,0 T = 5 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 17 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2 b2,2 a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,0*b0,1 a1,2 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 + a1,2*b2,2 b2,1 b1,2 a2,0*b0,0 a2,2 a2,0*b0,1 a2,1 a2,0*b0,2 + a2,1*b1,0 + a2,1*b1,1 + a2,1*b1,2 + a2,2*b2,0 + a2,2*b2,1 T = 6 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 18 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2 a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,0*b0,1 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 + a1,2*b2,2 Done b2,2 a2,0*b0,0 a2,0*b0,1 a2,2 a2,0*b0,2 + a2,1*b1,0 + a2,1*b1,1 + a2,1*b1,2 + a2,2*b2,0 + a2,2*b2,1 + a2,2*b2,2 T = 7 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 19 Definitions Some Styles of Parallelism that could be seen in a program » Data Parallelism - many data items can be processed in the same manner at the same time – SIMD or Vector processors » Functional Parallelism - program has different independent modules that can execute simultaneously » Overlapped/Temporal Parallelism - program has a sequence of tasks that can be executed in an overlapped fashion. – the most important form of this overlapped parallelism is PIPELINING. Tarek El-Ghazawi, Introduction to High-Performance Computing slide 20 Pipelining Example of pipelining 3-Stage Pipeline » pipeline processor of 3 stages e.g.

Computer Architectures

How Data Hazards Can Be Removed Effectively

Flynn's Taxonomy

Computer Organization & Architecture Eie

Pipeline and Vector Processing

Cluster, Grid and Cloud Computing: a Detailed Comparison

Grid Computing: What Is It, and Why Do I Care?*

Cloud Computing Over Cluster, Grid Computing: a Comparative Analysis

“Grid Computing”

ECE 590: Digital Systems Design Using Hardware Description Language (VHDL) Systolic Implementation of Faddeev's Algorithm in V

On the Efficiency of Register File Versus Broadcast Interconnect For

Computer Systems Architecture

Strategies for Managing Business Disruption Due to Grid Computing