CS 211: Computer Architecture ¾ Starting with Simple ILP Using Pipelining ¾ Explicit ILP - EPIC ¾ Key Concept: Issue Multiple Instructions/Cycle Instructor: Prof

Computer Architecture • Part I: Processor Architectures CS 211: Computer Architecture ¾ starting with simple ILP using pipelining ¾ explicit ILP - EPIC ¾ key concept: issue multiple instructions/cycle Instructor: Prof. Bhagi Narahari • Part II: Multi Processor Architectures Dept. of Computer Science ¾ move from Processor to System level Course URL: www.seas.gwu.edu/~narahari/cs211/ ¾ can utilize all the techniques covered thus far ¾ i.e., the processors used in a multi-processor can be EPIC ¾ move from fine grain to medium/coarse grain ¾ assume all processor issues are resolved when discussing system level Multiprocessor design Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Moving from Fine grained to Coarser Multi-Processor Architectures grained computations. • Introduce Parallel Processing ¾ grains, mapping of s/w to h/w, issues • Overview of Multiprocessor Architectures ¾ Shared-memory, distributed memory ¾ SIMD architectures • Programming and Synchronization ¾ programming constructs, synch constructs, cache • Interconnection Networks • Parallel algorithm design and analysis Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU 1 Hardware and Software Parallelism 10 Software vs. Hardware Parallelism (Example) 11 Software parallelism Hardware Parallelism : (three cycles) -- Defined by machine architecture and hardware multiplicity -- Number of instruction issues per machine cycle -- k issues per machine cycle : k-issue processor L4 Software parallelism : Cycle 1 L1 L2 L3 -- Control and data dependence of programs -- Compiler extensions -- OS extensions (parallel scheduling, shared memory allocation, (communication links) Cycle 2 X2 X1 Implicit Parallelism : -- Conventional programming language -- Parallelizing compiler - Cycle 3 + Explicit Parallelism : -- Parallelising constructs in programming languages -- Parallelising programs development tools -- Debugging, validation, testing, etc. Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU 12 Software vs. Hardware Parallelism (Example) 13 Software vs. Hardware Parallelism (Example) Cycle 1 Hardware parallelism Cycle 1 Hardware parallelism L1 L1 L3 (2-issue processor) (dual processor -- (one memory access, single issue processors) Cycle 2 Cycle 2 L2 one arithmetic operation) L2 L4 Cycle 3 Cycle 3 L3 X1 X1 X2 Cycle 4 Cycle 4 L4 S1 S2 Instructions added Cycle 5 Cycle 5 for IPC X2 L5 L6 } Cycle 6 + Cycle 6 + - Cycle 7 - Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU 2 14 Program Partitioning: 16 Types of Software Parallelism Grains and Latencies Grain : -- Program segment to be executed on a single processor Control Parallelism : -- Coarse-grain, medium-grain, and fine-grain -- Two or more operations performed simultaneously Latency : -- Forms : pipelining or multiple functional units -- Time measure of the communication overhead -- Limitations : pipeline length and multiplicity -- Memory latency of functional units -- Processor latency (synchronization latency) Data Parallelism : Parallelism (Granularity) : -- The same operation performed on many data elements -- Instruction level (fine-grain -- 20 instructions in a segment) -- The highest potential for concurrency -- Loop level (fine grain -- 500 instructions) -- Requires compilation support, parallel programming -- Procedure level (medium grain -- 2000 instructions) languages, and hardware redesign -- Subprogram level (medium grain -- thousands of instructions) -- Job/program level (coarse grain) Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Levels of Program Grains 17 Partitioning and Scheduling 18 Grain Packing : Jobs and programs Level 5 -- How to partition program into program segments Coarse to get the shortest possible execution time ? grain Increasing Higher degree -- What is the optimal size of concurrent grains ? communication Subprograms, job steps, of parallelism demand and Level 4 or parts of a program scheduling Program Graph : overhead Medium -- Each node (n, s) corresponds to the computational unit : grain n -- node name; s -- grain size Procedures, subroutines, Level 3 or tasks -- Each edge between two nodes (v,d) denotes the output variable v and communications delay d Example: Level 2 Nonrecursive loops or unfolded iterations 1. a := 1 10. j := e x f Fine 2. b := 2 11. k := d x f grain 3. c := 3 12. l := j x k 4. d := 4 13. m := 4 x l Level 1 Instructions or statements 5. e := 5 14. n := 3 x m 6. f := 6 15. o := n x i 7. g := a x b 16. p := o x h 8. h := c x d 17. q := p x q 9. a := d x e Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU 3 19 20 Fine-grain program graph (before packing) Coarse-grain program graph (after packing) 1,1 2,1 3,1 4,1 5,1 6,1 A,8 B,4 C,4 D,8 E,6 n,s Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU 21 Multiprocessor Architectures: 26 Program Flow Mechanisms Scheduling of the fine-grain and coarse-grain programs Control Flow : -- Conventional computers -- Instructions execution controlled by the PC -- Instructions sequence explicitly stated in user program Data Flow : -- Data driven execution -- Instructions executed as soon as their input data are available -- Higher degree of parallelism at the fine-grain level 28 Reduction computers : -- Use reduced instructions set -- Demand driven -- Instructions executed when their results are needed 38 42 Bhagi Narahari, Lab. For EmbeddedFine-grain Systems (LEMS), CS, GWU Coarse-grain Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU 4 Multiprocessor Architectures: Scope of Review: Parallel Processing Intro Course • We will focus on parallel control flow • Long term goal of the field: scale number architectures processors to size of budget, desired performance • Successes today: ¾ dense matrix scientific computing (Petrolium, Automotive, Aeronautics, Pharmaceuticals) ¾ file server, databases, web search engines ¾ entertainment/graphics • Machines today: workstations!! Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Parallel Architecture Parallel Framework for Communication • Parallel Architecture extends traditional computer architecture with a • Layers: communication architecture ¾ Programming Model: ¾ Multiprogramming : lots of jobs, no communication ¾ abstractions (HW/SW interface) ¾ Shared address space: communicate via memory ¾ organizational structure to realize abstraction ¾ Message passing: send and recieve messages efficiently ¾ Data Parallel: several processors operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing) ¾ Communication Abstraction: ¾ Shared address space: e.g., load, store, atomic swap ¾ Message passing: e.g., send, recieve library calls ¾ Debate over this topic (ease of programming, large scaling) => many hardware designs 1:1 programming model Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU 5 Shared Address/Memory Multiprocessor Model Example: Small-Scale MP Designs • Communicate via Load and Store • Memory: centralized with uniform access ¾ Oldest and most popular model time (“uma”) and bus interconnect, I/O • Based on timesharing: processes on multiple • Examples: Sun Enterprise 6000, SGI processors vs. sharing single processor Challenge, Intel SystemPro • process: a virtual address space Processor Processor Processor Processor and 1 thread of control ¾ Multiple processes can overlap (share), but ALL One or One or One or One or more levels more levels more levels more levels threads share a process address space of cache of cache of cache of cache • Writes to shared address space by one thread are visible to reads of other threads ¾ I/O System Usual model: share code, private stack, some Main memory shared heap, some private heap Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU SMP Interconnect Large-Scale MP Designs • Processors to Memory AND to I/O • Memory: distributed with nonuniform access • Bus based: all memory locations equal time (“numa”) and scalable interconnect access time so SMP = “Symmetric MP” (distributed memory) ¾ Sharing limited BW as add processors, I/O • Examples:Processor T3E:Processor (see Ch. 1, ProcessorFigs 1-21, pageProcessor 45 of 1 cycle[CSG96])+ cache + cache + cache + cache • Crossbar: expensive to expand Memory I/O Memory I/O Memory I/O Memory I/O • Multistage network (less expensive to 40 cycles 100 cycles Low Latency expand than crossbar with more BW) Interconnection Network High Reliability • “Dance Hall” designs: All processors on the left, all memories on the right Memory I/O Memory I/O Memory I/O Memory I/O Processor Processor Processor Processor + cache + cache + cache + cache Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU 6 Shared Address Model Summary Message Passing Model • Each processor can name every physical • Whole computers (CPU, memory, I/O devices) location in the machine communicate as explicit I/O operations ¾ Essentially NUMA but integrated at I/O devices vs. • Each process can name all data it shares with memory

CS 211: Computer Architecture ¾ Starting with Simple ILP Using Pipelining ¾ Explicit ILP - EPIC ¾ Key Concept: Issue Multiple Instructions/Cycle Instructor: Prof

High Performance Computing Through Parallel and Distributed Processing

Parallel System Performance: Evaluation & Scalability

Scalable Task Parallel Programming in the Partitioned Global Address Space

Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness

Compiling for a Multithreaded Dataflow Architecture : Algorithms, Tools, and Experience Feng Li

Massively Parallel Computers: Why Not Prirallel Computers for the Masses?

Scheduling on Asymmetric Parallel Architectures

CUDA C++ Programming Guide

14. Parallel Computing 14.1 Introduction 14.2 Independent

Parallel Programming in Openmp About the Authors

CUDA Dynamic Parallelism

Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures