Computer Architectures

Parallel (High-Performance) Computer Architectures Tarek El-Ghazawi Department of Electrical and Computer Engineering The George Washington University Tarek El-Ghazawi, Introduction to High-Performance Computing slide 1 Introduction to Parallel Computing Systems Outline Definitions and Conceptual Classifications » Parallel Processing, MPP’s, and Related Terms » Flynn’s Classification of Computer Architectures Operational Models for Parallel Computers Interconnection Networks MPP’s Performance Tarek El-Ghazawi, Introduction to High-Performance Computing slide 2 Definitions and Conceptual Classification What is Parallel Processing? - A form of data processing which emphasizes the exploration and exploitation of inherent parallelism in the underlying problem. Other related terms » Massively Parallel Processors » Heterogeneous Processing – In the1990s, heterogeneous workstations from different processor vendors – Now, accelerators such as GPUs, FPGAs, Intel’s Xeon Phi, … » Grid computing » Cloud Computing Tarek El-Ghazawi, Introduction to High-Performance Computing slide 3 Definitions and Conceptual Classification Why Massively Parallel Processors (MPPs)? » Increase processing speed and memory allowing studies of problems with higher resolutions or bigger sizes » Provide a low cost alternative to using expensive processor and memory technologies (as in traditional vector machines) Tarek El-Ghazawi, Introduction to High-Performance Computing slide 4 Stored Program Computer The IAS machine was the first electronic computer developed, under the direction of Jon von Neumann at the Institute of Advanced Studies (IAS), Princeton » First electronic implementation of the stored program concept » The computer organizations is now popular as the “von Neumann Architecture,” which stores program and data in a common memory » Instruction cycle can be: Fetch, Decode, Execute, and Write Back Tarek El-Ghazawi, Introduction to High-Performance Computing slide 5 von Neumann Machine Processor Registers Memory Program Data Address ALU µ-Instruction Register Data / Instructions CU Instruction Register Tarek El-Ghazawi, Introduction to High-Performance Computing slide 6 Flynn’s Classification Not necessary the most thorough, but the easiest to remember 1966 Based on the multiplicity of data and instructions streams-- Categories = {single instr.(SI), multiple instr.(MI)} X {single data(SD), multiple data(MD)} Ordered pairs generated are: » Single instruction single data (SISD) » Single instruction multiple data (SIMD) » Multiple instructions single data (MISD) » Multiple instructions multiple data (MIMD) Tarek El-Ghazawi, Introduction to High-Performance Computing slide 7 Definitions and Conceptual Classifications: Flynn’s Classification SISD - This is the uniprocessor architecture Processor µIS I/O DS CU PE M (ALU) IS CU= Control Unit, PE= Processing Element (same as ALU), M= Memory IS= Instruction Stream, DS= Data Stream, µIS= MicroInstructions Stream Tarek El-Ghazawi, Introduction to High-Performance Computing slide 8 Definitions and Conceptual Classifications: Flynn’s Classification SIMD Processor DS DS PE1 LM1 CU • IS memory CU µIS • (Program) • DS DS PEN LMN CU= Control Unit, PE= Processing Element, M= Memory IS= Instruction Stream, DS= Data Stream, LM=Local Memory Tarek El-Ghazawi, Introduction to High-Performance Computing slide 9 Definitions and Conceptual Classifications: Flynn’s Classification MIMD (Shared Memory) Processor IS µIS DS CU1 PE1 • Shared or • • Distributed • Processor • Memory • IS µIS DS CUN PEN CU= Control Unit, PE= Processing Element, M= Memory IS= Instruction Stream, DS= Data Stream Tarek El-Ghazawi, Introduction to High-Performance Computing slide 10 Definitions and Conceptual Classifications: Flynn’s Classification MISD » Systolic array like Processor Processor Processor CU CU CU µIS µIS µIS Memory PE PE PE DS DS DS DS CU= Control Unit, PE= Processing Element, M= Memory Tarek El-Ghazawi, Introduction to High-Performance Computing slide 11 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid b2,2 • Each processor accumulates one b2,1 b1,2 element of the product b2,0 b1,1 b0,2 b1,0 b0,1 in time Columns of B Alignments b0,0 Rows of A a0,2 a0,1 a0,0 a1,2 a1,1 a1,0 a2,2 a2,1 a2,0 T = 0 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 12 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one b2,2 element of the product b2,1 b1,2 b2,0 b1,1 b0,2 Alignments in time b1,0 b0,1 b0,0 a0,0*b0,0 a0,0 a0,2 a0,1 a1,2 a1,1 a1,0 a2,2 a2,1 a2,0 T = 1 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 13 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product b2,2 b2,1 b1,2 Alignments in time b2,0 b1,1 b0,2 b1,0 b0,1 a0,0*b0,0 a0,0*b0,1 a0,2 a0,1 + a0,1*b1,0 a0,0 b0,0 a1,0*b0,0 a1,2 a1,1 a1,0 a2,2 a2,1 a2,0 T = 2 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 14 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product b2,2 Alignments in time b2,1 b1,2 b2,0 b1,1 b0,2 a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 a0,2 + a0,1*b1,0 a0,1 + a0,1*b1,1 a0,0 + a0,2*b2,0 b1,0 b0,1 a1,0*b0,0 a1,2 a1,1 + a1,1*b1,0 a1,0 a1,0*b0,1 b0,0 a2,0*b0,0 a2,2 a2,1 a2,0 T = 3 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 15 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time b2,2 b2,1 b1,2 a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 a0,2 + a0,1*b1,1 a0,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 b2,0 b1,1 b0,2 a1,0*b0,0 a1,0*b0,2 a1,2 + a1,1*b1,0 a1,1 a1,0*b0,1 a1,0 + a1,2*a2,0 +a1,1*b1,1 b1,0 b0,1 a2,0*b0,0 a2,0 a2,0*b0,1 a2,2 a2,1 + a2,1*b1,0 T = 4 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 16 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time b2,2 a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 a0,2 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2 b2,1 b1,2 a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,2 a1,0*b0,1 a1,1 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 b2,0 b1,1 b0,2 a2,0*b0,0 a2,1 a2,0*b0,1 a2,0 a2,0*b0,2 a2,2 + a2,1*b1,0 + a2,1*b1,1 + a2,2*b2,0 T = 5 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 17 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2 b2,2 a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,0*b0,1 a1,2 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 + a1,2*b2,2 b2,1 b1,2 a2,0*b0,0 a2,2 a2,0*b0,1 a2,1 a2,0*b0,2 + a2,1*b1,0 + a2,1*b1,1 + a2,1*b1,2 + a2,2*b2,0 + a2,2*b2,1 T = 6 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 18 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time a0,0*b0,0 a0,0*b0,1 a0,0*b0,2 + a0,1*b1,0 + a0,1*b1,1 + a0,1*b1,2 + a0,2*b2,0 + a0,2*b2,1 + a0,2*b2,2 a1,0*b0,0 a1,0*b0,2 + a1,1*b1,0 a1,0*b0,1 + a1,1*b1,2 + a1,2*a2,0 +a1,1*b1,1 + a1,2*b2,1 + a1,2*b2,2 Done b2,2 a2,0*b0,0 a2,0*b0,1 a2,2 a2,0*b0,2 + a2,1*b1,0 + a2,1*b1,1 + a2,1*b1,2 + a2,2*b2,0 + a2,2*b2,1 + a2,2*b2,2 T = 7 Tarek El-Ghazawi, Introduction to High-Performance Computing slide 19 Definitions Some Styles of Parallelism that could be seen in a program » Data Parallelism - many data items can be processed in the same manner at the same time – SIMD or Vector processors » Functional Parallelism - program has different independent modules that can execute simultaneously » Overlapped/Temporal Parallelism - program has a sequence of tasks that can be executed in an overlapped fashion. – the most important form of this overlapped parallelism is PIPELINING. Tarek El-Ghazawi, Introduction to High-Performance Computing slide 20 Pipelining Example of pipelining 3-Stage Pipeline » pipeline processor of 3 stages e.g.

Computer Architectures

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support