Computer Architecture

• Part I: Processor Architectures CS 211: Computer Architecture ¾ starting with simple ILP using pipelining ¾ explicit ILP - EPIC ¾ key concept: issue multiple instructions/cycle Instructor: Prof. Bhagi Narahari • Part II: Multi Processor Architectures Dept. of ¾ move from Processor to System level Course URL: www.seas.gwu.edu/~narahari/cs211/ ¾ can utilize all the techniques covered thus far

¾ i.e., the processors used in a multi-processor can be EPIC ¾ move from fine grain to medium/coarse grain ¾ assume all processor issues are resolved when discussing system level Multiprocessor design

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Moving from Fine grained to Coarser Multi-Processor Architectures grained computations. . .

• Introduce Parallel Processing ¾ grains, mapping of s/w to h/w, issues • Overview of Multiprocessor Architectures ¾ Shared-memory, ¾ SIMD architectures • Programming and Synchronization ¾ programming constructs, synch constructs, cache • Interconnection Networks • Parallel algorithm design and analysis

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

1 Hardware and Software Parallelism 10 Software vs. Hardware Parallelism (Example) 11

Software parallelism Hardware Parallelism : (three cycles) -- Defined by machine architecture and hardware multiplicity -- Number of instruction issues per machine cycle -- k issues per machine cycle : k-issue processor L4 Software parallelism : Cycle 1 L1 L2 L3 -- Control and data dependence of programs -- Compiler extensions -- OS extensions (parallel scheduling, allocation, (communication links) Cycle 2 X2 X1 : -- Conventional programming language -- Parallelizing compiler - Cycle 3 + : -- Parallelising constructs in programming languages -- Parallelising programs development tools -- Debugging, validation, testing, etc. Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

12 Software vs. Hardware Parallelism (Example) 13 Software vs. Hardware Parallelism (Example)

Cycle 1 Hardware parallelism Cycle 1 Hardware parallelism L1 L1 L3 (2-issue processor) (dual processor -- (one memory access, single issue processors) Cycle 2 Cycle 2 L2 one arithmetic operation) L2 L4

Cycle 3 Cycle 3 L3 X1 X1 X2

Cycle 4 Cycle 4 L4 S1 S2

Instructions added Cycle 5 Cycle 5 for IPC X2 L5 L6 }

Cycle 6 + Cycle 6 + -

Cycle 7 - Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

2 14 Program Partitioning: 16 Types of Software Parallelism Grains and Latencies

Grain : -- Program segment to be executed on a single processor Control Parallelism : -- Coarse-grain, medium-grain, and fine-grain

-- Two or more operations performed simultaneously Latency : -- Forms : pipelining or multiple functional units -- Time measure of the communication overhead -- Limitations : pipeline length and multiplicity -- Memory latency of functional units -- Processor latency (synchronization latency)

Data Parallelism : Parallelism (Granularity) : -- The same operation performed on many data elements -- Instruction level (fine-grain -- 20 instructions in a segment) -- The highest potential for concurrency -- Loop level (fine grain -- 500 instructions) -- Requires compilation support, parallel programming -- Procedure level (medium grain -- 2000 instructions) languages, and hardware redesign -- Subprogram level (medium grain -- thousands of instructions) -- Job/program level (coarse grain)

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Levels of Program Grains 17 Partitioning and Scheduling 18

Grain Packing : Jobs and programs Level 5 -- How to partition program into program segments Coarse to get the shortest possible execution time ? grain Increasing Higher degree -- What is the optimal size of concurrent grains ? communication Subprograms, job steps, of parallelism demand and Level 4 or parts of a program scheduling Program Graph : overhead Medium -- Each node (n, s) corresponds to the computational unit : grain n -- node name; s -- grain size Procedures, subroutines, Level 3 or tasks -- Each edge between two nodes (v,d) denotes the output variable v and communications delay d Example: Level 2 Nonrecursive loops or unfolded iterations 1. a := 1 10. j := e x f Fine 2. b := 2 11. k := d x f grain 3. c := 3 12. l := j x k 4. d := 4 13. m := 4 x l Level 1 Instructions or statements 5. e := 5 14. n := 3 x m 6. f := 6 15. o := n x i 7. g := a x b 16. p := o x h 8. h := c x d 17. q := p x q 9. a := d x e Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

3 19 20 Fine-grain program graph (before packing) Coarse-grain program graph (after packing)

1,1 2,1 3,1 4,1 5,1 6,1 A,8

B,4 C,4 D,8

E,6

n,s

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

21 Multiprocessor Architectures: 26 Program Flow Mechanisms Scheduling of the fine-grain and coarse-grain programs

Control Flow : -- Conventional computers -- Instructions execution controlled by the PC -- Instructions sequence explicitly stated in user program

Data Flow : -- Data driven execution -- Instructions executed as soon as their input data are available -- Higher degree of parallelism at the fine-grain level

28 Reduction computers : -- Use reduced instructions set -- Demand driven -- Instructions executed when their results are needed 38

42

Bhagi Narahari, Lab. For EmbeddedFine-grain Systems (LEMS), CS, GWU Coarse-grain Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

4 Multiprocessor Architectures: Scope of Review: Parallel Processing Intro Course

• We will focus on parallel control flow • Long term goal of the field: scale number architectures processors to size of budget, desired performance • Successes today: ¾ dense matrix scientific computing (Petrolium, Automotive, Aeronautics, Pharmaceuticals) ¾ file server, databases, web search engines ¾ entertainment/graphics • Machines today: workstations!!

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Parallel Architecture Parallel Framework for Communication

• Parallel Architecture extends traditional computer architecture with a • Layers: communication architecture ¾ Programming Model: ¾ Multiprogramming : lots of jobs, no communication ¾ abstractions (HW/SW interface) ¾ Shared address space: communicate via memory ¾ organizational structure to realize abstraction ¾ Message passing: send and recieve messages efficiently ¾ Data Parallel: several processors operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing) ¾ Communication Abstraction:

¾ Shared address space: e.g., load, store, atomic swap

¾ Message passing: e.g., send, recieve library calls

¾ Debate over this topic (ease of programming, large scaling) => many hardware designs 1:1 programming model

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

5 Shared Address/Memory Multiprocessor Model Example: Small-Scale MP Designs

• Communicate via Load and Store • Memory: centralized with uniform access ¾ Oldest and most popular model time (“uma”) and bus interconnect, I/O • Based on timesharing: processes on multiple • Examples: Sun Enterprise 6000, SGI processors vs. sharing single processor Challenge, Intel SystemPro • : a virtual address space Processor Processor Processor Processor and 1 of control ¾ Multiple processes can overlap (share), but ALL One or One or One or One or more levels more levels more levels more levels threads share a process address space of cache of cache of cache of cache • Writes to shared address space by one thread are visible to reads of other threads

¾ I/O System Usual model: share code, private stack, some Main memory shared heap, some private heap

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

SMP Interconnect Large-Scale MP Designs

• Processors to Memory AND to I/O • Memory: distributed with nonuniform access • Bus based: all memory locations equal time (“numa”) and scalable interconnect access time so SMP = “Symmetric MP” (distributed memory) ¾ Sharing limited BW as add processors, I/O • Examples:Processor T3E:Processor (see Ch. 1, ProcessorFigs 1-21, pageProcessor 45 of 1 cycle[CSG96])+ cache + cache + cache + cache • Crossbar: expensive to expand Memory I/O Memory I/O Memory I/O Memory I/O • Multistage network (less expensive to 40 cycles 100 cycles Low Latency expand than crossbar with more BW) Interconnection Network High Reliability • “Dance Hall” designs: All processors on

the left, all memories on the right Memory I/O Memory I/O Memory I/O Memory I/O

Processor Processor Processor Processor + cache + cache + cache + cache

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

6 Shared Address Model Summary Message Passing Model • Each processor can name every physical • Whole computers (CPU, memory, I/O devices) location in the machine communicate as explicit I/O operations ¾ Essentially NUMA but integrated at I/O devices vs. • Each process can name all data it shares with memory system other processes • Send specifies local buffer + receiving process on remote • Data transfer via load and store computer • Receive specifies sending process on remote computer + • Data size: byte, word, ... or cache blocks local buffer to place data • Uses virtual memory to map virtual to local or ¾ Usually send includes process tag remote physical and receive has rule on tag: match 1, match any ¾ Synch: when send completes, when buffer free, when • Memory hierarchy model applies: now request accepted, receive wait for send communication moves data to local proc. cache • Send+receive => memory-memory copy, where each (as load moves data from memory to cache) supplies local address, AND does pairwise sychronization! ¾ Latency, BW (cache block?), Bhagi Narahari, Lab. For Embedded when Systems (LEMS), communicate? CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Message Passing Model Communication Models • Shared Memory • Send+receive => memory-memory copy, sychronization ¾ Processors communicate with shared address space on OS even on 1 processor ¾ Easy on small-scale machines • History of message passing: ¾ Advantages: ¾ Model of choice for uniprocessors, small-scale MPs ¾ Network topology important because could only send to immediate neighbor ¾ Ease of programming ¾ Lower latency ¾ Typically synchronouns, blocking send & receive ¾ Easier to use hardware controlled caching ¾ Later DMA with non-blocking sends, DMA for receive into buffer until processor does receive, and then data is tranfered to local • Message passing memory ¾ Processors have private memories, communicate via messages ¾ Later SW libraries to allow arbitrary communication ¾ Advantages: • Example: IBM SP-2, RS6000 workstations in racks ¾ Less hardware, easier to design

¾ Network Inteface Card has Intel 960 ¾ Focuses attention on costly non-local operations ¾ 8X8 Crossbar switch as communication building block • Can support either SW model on either HW base ¾ 40 MByte/sec per link

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

7 Popular Flynn Architecture Categories SISD : A Conventional Computer Instructions • SISD (Single Instruction Single Data) ¾ Uniprocessors • MISD (Multiple Instruction Single Data) ¾ ??? • SIMD (Single Instruction Multiple Data) Data Input Processor Data Output ¾ Examples: Illiac-IV, CM-2 Processor

¾ Simple programming model

¾ Low overhead

¾ Flexibility ¾ All custom integrated circuits ÎSpeed is limited by the rate at which computer • MIMD (Multiple Instruction Multiple Data) can transfer information internally. ¾ Examples: Sun Enterprise 5000, Cray T3D, Ex:PC, Macintosh, Workstations SGI Origin

¾ Flexible

Bhagi Narahari, Lab. For¾ EmbeddedUse off-the-shelf Systems (LEMS), microsCS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

The MISD Architecture SIMD Architecture Instruction Instruction Stream Stream A

Instruction Stream B Instruction Stream C Data Output Processor Data Input Processor stream A A Data stream A A Output Data Output Data Processor Stream Data Input Processor stream B Input B stream B B Stream Processor Processor Data Output Data Input C stream C C stream C

Ci<= Ai * Bi ÎMore of an intellectual exercise than a practical configuration. Few built, but commercially not available Ex: CRAY machine vector processing, Thinking machine cm*

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

8 MIMD Architecture Shared Memory MIMD machine Instruction Instruction Instruction Stream A Stream B Stream C ProcessorProcessor ProcessorProcessor ProcessorProcessor AA BB CC

Data Output Data Input Processor stream A M M M stream A A E E E M B M B M B O U O U O U Data Output R S R S R S Data Input Processor stream B Y Y Y stream B B

Processor Data Output Data Input Global Memory System C stream C Global Memory System stream C Comm: Source PE writes data to GM & destination retrieves it Î Easy to build, conventional OSes of SISD can be easily be ported Unlike SISD, MISD, MIMD computer works asynchronously. Î Limitation : reliability & expandability. A memory component or any processor failure affects the whole system. Shared memory (tightly coupled) MIMD Î Increase of processors leads to memory contention. Distributed memory (loosely coupled) MIMD Ex. : Silicon graphics .... Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Distributed Memory MIMD Data Parallel Model

IPC IPC channel channel • Operations can be performed in parallel on each ProcessorProcessor ProcessorProcessor ProcessorProcessor B C element of a large regular data structure, such as an AA B C array

M M M • 1 Control Processsor broadcast to many PEs E E E B B M B M M U U O U O O ¾ When computers were large, could amortize the S S R S R R Y Y Y control portion of many replicated PEs • Condition flag per PE so that can skip Memory Memory MemoryMemory MemoryMemory System A • Data distributed in each memory System A SystemSystem B B SystemSystem C C • Early 1980s VLSI => SIMD rebirth: z Communication : IPC on High Speed Network. 32 1-bit PEs + memory on a chip was the PE z Network can be configured to ... Tree, Mesh, Cube, etc. • Data parallel programming languages lay out data to z Unlike Shared MIMD processor Î easily/ readily expandable Î Highly reliable (any CPU failure does not affect the whole system) Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

9 Data Parallel Model Convergence in Parallel Architecture

• Complete computers connected to • Vector processors have similar ISAs, scalable network via communication but no data placement restriction assist SIMD led to Data Parallel Programming • • Different programming models place different languages requirements on communication assist • Advancing VLSI led to single chip FPUs and ¾ Shared address space: tight integration with whole fast µProcs (SIMD less attractive) memory to capture memory events that interact • SIMD programming model led to with others + to accept requests from other nodes Single Program Multiple Data (SPMD) model ¾ Message passing: send messages quickly and respond to incoming messages: tag match, allocate ¾ All processors execute identical program buffer, transfer data, wait for receive posting • Data parallel programming languages still ¾ Data Parallel: fast global synchronization useful, do communication all at once: • Hi Perf Fortran shared-memory, data parallel; “Bulk Synchronous” phases in which all Msg. Passing Inter. message passing library; communicate after a global barrier both work on many machines, different implementations Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Fundamental Issues Fundamental Issue #1: Naming

• 3 Issues to characterize parallel • Naming: how to solve large problem fast machines ¾ what data is shared 1) Naming/Program Partitioning ¾ how it is addressed 2) Synchronization ¾ what operations can access data ¾ how processes refer to each other 3) Latency and Bandwidth • Choice of naming affects code produced by a compiler; via load where just remember address or keep track of processor number and local virtual address for msg. passing • Choice of naming affects replication of data; via load in cache memory hierachy or via SW replication and consistency

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

10 Fundamental Issue #1: Naming Fundamental Issue #2: Synchronization

• Global physical address space: • To cooperate, processes must any processor can generate, address and coordinate access it in a single operation • Message passing is implicit coordination ¾ memory can be anywhere: virtual addr. translation handles it with transmission or arrival of data • Global virtual address space: if the address • Shared address space of each process can be configured to => additional operations to explicitly contain all shared data of the parallel program coordinate: • Segmented shared address space: e.g., write a flag, awaken a thread, locations are named interrupt a processor uniformly for all processes of the parallel program

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Fundamental Issue #3: Small-Scale—Shared Memory Latency and Bandwidth

• Bandwidth ¾ Need high bandwidth in communication ¾ Cannot scale, but stay close • Caches serve to: Processor Processor Processor Processor ¾ Match limits in network, memory, and processor ¾ Increase ¾ Overhead to communicate is a problem in many machines bandwidth versus One or One or One or One or more levels more levels more levels more levels • Latency bus/memory of cache of cache of cache of cache ¾ Affects performance, since processor may have to wait ¾ Reduce latency of ¾ Affects ease of programming, since requires more thought to access overlap communication and computation ¾ Valuable for both • Latency Hiding I/O System private data and Main memory ¾ How can a mechanism help hide latency? shared data ¾ Examples: overlap message send with computation, prefetch data, switch to other tasks • What about cache

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagiconsistency? Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

11 The Problem of Cache Coherency What Does Coherency Mean?

CPU CPU CPU

Cache Cache Cache • Informally: ¾ “Any read must return the most recent write” A' 100 A' 550 A' 100 ¾ Too strict and too difficult to implement B' 200 B' 200 B' 200 • Better: ¾ “Any write must eventually be seen by a read” Memory Memory Memory ¾ All writes are seen in proper order (“serialization”) A 100 A 100 A 100 • Two rules to ensure this:

B 200 B 200 B 440 ¾ “If P writes x and P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart” ¾ Writes to a single location are serialized: I/O I/O I/O Output A Input seen in one order gives 100 440 to B ¾ Latest write will be seen

(a) Cache and (b) Cache and (c) Cache and ¾ Otherewise could see writes in illogical order memory coherent: memory incoherent: memory incoherent: A’ = A & B’ = B A’ ≠ A (A stale) B’ ≠ B (B' stale) (could see older value after a newer value)

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Cache Cohernecy Solutions Synchronization

• Why Synchronize? Need to know when it is • more detail ..…after we cover Cache and safe for different processes to use shared Memory design data ¾ Snooping Solution (Snoopy Bus): • Issues for Synchronization: ¾ Directory-Based Schemes ¾ Uninterruptable instruction to fetch and update memory (atomic operation); ¾ User level synchronization operation using this primitive; ¾ For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari,synchronization Lab. For Embedded Systems (LEMS), CS, GWU

12 Hardware level synchronization Uninterruptable Instruction to Fetch and Update Memory

• Key is to provide uninterruptible • Atomic exchange: interchange a value in a register for instruction or instruction sequence a value in memory capable of atomically retrieving a value 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable ¾ S/W mechanisms then constructed from these H/W primitives ¾ Set register to 1 & swap ¾ New value in register determines success in getting lock • Special load: load linked 0 if you succeeded in setting the lock (you were first) 1 if other processor had already claimed access • Special store: store conditional ¾ Key is that exchange operation is indivisible ¾ If contents of mem changed before store • Test-and-set: tests a value and sets it if the value conditional, then store conditional fails passes the test ¾ Store conditional returns value specifying • Fetch-and-increment: it returns the value of a memory success or failure location and atomically increments it ¾ 0 => synchronization variable is free

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Coordination/Synchronization Constructs Synchronization Constructs

• For shared memory and message passing • Barrier synchronization two types of synchronization activity ¾ for sequence control ¾ Sequence control ... to enable correct operation ¾ processors wait at barrier till all (or subset) have ¾ Access control ... to allow access to common completed resources ¾ hardware implementations available • synchronization activities constitute an ¾ can also implement in s/w overhead! • Critical section access control mechanisms • For SIMD these are done at machine (H/W) ¾ Test&Set Lock protocols level ¾ Semaphores

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

13 Barrier Synchronization Barrier Synch. . . Example

• Many programs require that all For i := 1 to N do in parallel processes come to a “barrier” point A[i] := k* A[i]; before proceeding further B[i] := A[i] + B[i]; ¾ this constitutes a synchronization point endfor • Concept of Barrier ¾ When processor hits a barrier it cannot BARRIER POINT proceed further until ALL processors have hit for i := 1 to N do in parallel the barrier point

¾ note that this forces a global synchronization point C[i] := B[i] + B[i-1] + B[i-2]; • Can implement in S/W or Hardware ¾ in s/w can implement using a shared variable; proc checks value of shared variable

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Barrier Synchronization: Implementation Synchronization: Message Passing

• Bus based • Synchronous vs. Asynchronous ¾ each processor sets single bit when it arrives at barrier • Synchronous: sending and receiving process ¾ collection of bits sent to AND (or OR) gates synch in time and space ¾ send outputs of gates to all processors ¾ system must check if (i) receiver ready, (ii) path ¾ number of synchs/cycle grows with N (proc) if change available and (iii) one or more to be sent to same or in bit at one proc can be propagated in since cycle multiple dest ¾ takes O( log N ) in reality ¾ ¾ delay in performance due to barrier measured how? also known as blocking send-receive • Multiple Barrier lines ¾ send and receive process cannot continue past instruction till message transfer complete ¾ a barrier bit sent to each processor ¾ each can set bit for each barrier line • Asychronous: send&rec do not have to synch ¾ X1,...,Xn in processor; Y1,...,Yn is barrier setting

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

14 Lock Protocols Semaphores

Test&Set (lock) • P(S) for shared variable/section S temp <- lock ¾ test if S>0 & enter critical section and lock := 1 decrement S else wait return (temp); • V(S) ¾ increment S and exit Reset (lock) lock :=0 • note that P and V are blocking synchronization constructs Process waits for lock to be 0 • can allow number of concurrent accesses to S can remove indefinite waits by ???

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Semaphores : Example Next– Distributed Memory MPs

Z= A*B + [ (C*D) * (I+G) ] • Multiple processors connected through var S_w, S_y are semaphores an interconnection network initial: S_w=S_y= 0 P1: begin P2: begin • Network properties play vital role in U = A*B W = C*D system performance P( S_y) V(S_w) • Next… Z = U + Y end end ¾ Interconnection networks definitions ¾ Examples of routing on static topology P3: begin networks – you are required to read the notes X= I+G for some detailed discussion on this P(S_w) Y= W*X V(S_y) end

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

15 Interconnection Networks 34 Basic Concepts and Definitions 35

Two types : -- direct networks with static interconnections: point-to-point direct connections between system elements Node degree : the number of edges (links) connected to a node -- indirect networks with dynamic interconnections : switched dynamically programmable channels Diameter : the shortest path between any two nodes

Relevant aspects : Bisection : -- scalability -- channel bisection width ( b ) is the number of edges -- communication efficiency (latency) along a network bisection -- flexibility of reconfiguration -- wire bisection width ( B = bw ) is the number of wires -- complexity along a network bisection -- cost Data routing functions : simple (primitive) and complex

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Network Performance 38 39 Static Connection Networks

Ring Functionality : data routing, interrupt handling, synchronization, Linear Array request/response combining, . . .

Network Latency : the worst case time delay for a message to be transferred through the network

Bandwidth : maximum data transfer rate (Mbits/sec)

Hardware complexity : implementation costs for components

Scalability : modular expansions with increasing machine resources

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

16 49 Bus Connection 50 Dynamic Connection Networks

P1 P2 Pn Characteristics : connections established dynamically based on program demands I/O C1 C2 Cn Subsystem Types : bus system, multistage interconnection network, crossbar switch network Interconnection Bus Priorities : arbitration logic

Contention : conflict resolution M1 M2 Mn Secondary storage

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

54 Interconnection Networks Crossbar network

• Topology of interconnection network (ICN) determines routing delays P1 ¾ need efficient routing algorithms • Switching techniques also determine P 2 latency ¾ packet, circuit, wormhole

Pn • Details on some static topology networks covered in notes ¾ You are required to read these notes M1 M2 Mn

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

17 SIMD Architectures SIMD Architectures…contd

• Single Instruction stream, Multiple Data • Control Unit (CU) stream ¾ Broadcasts instructions to processors ¾ Each processor executes same instruction on different data ¾ Has memory for program • Efficient in applications requiring large ¾ Executes control flow instructions (branches) data processing • Processing elements (PE) ¾ Low level image processing ¾ Data distributed among PE memories ¾ Multimedia processing ¾ Each PE can be enabled or disabled using ¾ Scientific computing Mask • Synchronization Implicit ¾ MASK instruction broadcast by CU ¾ All processors are in lock step with control unit

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

SIMD. . . PE Organization SIMD Masking Schemes

• Simple processors • All PEs execute same instruction (broadcast ¾ Do not need to fetch instructions by CU) ¾ Can be simple microcontrollers ¾ Masking scheme allows subset of PE to • CPU ignore/suspend • Local memory to store data ¾ Only processors enabled execute instruction • General purpose registers ¾ Masking/status register denotes if PE enabled • Address register – addr of PE ¾ If Reg=1 then active PE else inactive • Data transfer registers for network (DTR_in, ¾ CU can broadcast MASK vector

DTR_out) ¾ One bit for each PE or use Log N bits to enable sets • Status flag – enabled/disabled ¾ Data/conditional masks • Index register – used in mem access ¾ Allow each PE to set its Mask register depending on data ¾ Eg: If A< 0 then S=1 (sets Mask to 1 if value of A in its local memory ¾ Offset by x_i in mem i of PE i is less than 0)

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

18 Processing on SIMD Matrix Multiplication on SIMD using CU

• CU broadcasts instructions • Assume we have N processors in SIMD configuration • PE executes – can be simple decode and execute units • Algorithm to multiply N by N matrix using the CU to broadcast an element • CU can also broadcast a data value For i := 1 to N do • Time taken to process a task is time to For j := 1 to N do complete tasks on all processors C[i,j] := 0 • All processors execute same instruction for k :=0 to N do in one cycle C[i,j] := C[i,j] + A[i,k]*B[k,j] ¾ Note also that processors are hardware synchronized (tied to same clock ?) • Note each row of A is required N times for C[i,1],C[i,2],…..C[i,N]

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Matrix Multiplication on SIMD using CU Sample Code

• Assume each processor P_k stores For i :=1 to N do column k of matrix B In Parallel for ALL processors P_k • CU can broadcast the current value of A (i.e., enable all processors) Broadcast i /* send value of i to all proc */ • Each processor k computes C[i,k] for all C[i] := 0 /*initialize C[i] to 0 in all proc k */ values of I For j:=1 to N do ¾ Processor k computes column k of result Broadcast j matrix Broadcast A[i,j] MULT A[i,j], B[j] Æ temp ADD C[i], temp Æ C[i] Endfor (j loop) Endfor

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

19 Time Analysis Storage Rules

• There are N2 iterations at control unit • In previous example, the algorithm required that ¾ Time taken is N2 a row of B be executed upon at each cycle ¾ Since B was stored column wise, this was not a • Instructions are broadcast to all PE problem • Essentially the k loop has been • What if a column of B has to be processed at parallelized each cycle ¾ Using k processors ¾ Since an entire column is stored in one processor, this • Requires each processor store N requires N cycles elements of B and N elements of result ¾ No and waste of N processors matrix • Need to come up with better ways to store matrices • Ideal speedup and efficiency ¾ Allow row or column fetching in parallel ¾ Got speedup of N using N processors for 100% efficiency ¾ Skew Storage Rules allow this

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Progression to MIMD MIMD Issues

• Multiple instructions, Multiple Data • H/W • shared memory or distributed memory ¾ Processors more complex, more memory • Each processor executes its own ¾ flexible communication program • S/W ¾ processor must store inst and data ¾ each processor creates and terminates processes -> language constructs needed ¾ larger memory required ¾ O/S at each node ¾ more complex (than SIMD) processors ¾ Coordination/Synchronization Constructs ¾ can also have heterogeneous processors ¾ shared memory

¾ message passing ¾ load balancing and program partitioning ¾ algorithm design: exploit functional parallelism Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

20 Moving from multiprocessor to Language Constructs distributed systems

• Similarity to concurrent programming • language constructs to express parallelism must ¾ define subtasks to be executed in parallel ¾ start and stop execution ¾ coordinate and specify interaction • examples: ¾ FORK-JOIN (subsumes all other models) ¾ Cobegin-Coend (Parbegin-Parend) ¾ Forall/Doall

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

57 58 Local Calls (Subroutines) Remote Calls (Sockets)

Main Local program computer

Main Remote call ABC (a, b, c) program computer

Library IP number, port send (a, b, c) receive (a, b, c) ABC (a, b, c) ABC (a, b, c) network

return return

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

21 59 60 Java Remote Method Invocation (RMI) Object Request Broker (ORB) Java Remote Method Invocation (RMI)

Local Local Remote Remote computer computer computer computer Main Main program program

call ABC (a, b, c) call ABC (a, b, c) network network

ORB Platform ORB Platform JVM JVM

ABC (a, b, c) ABC (a, b, c) ABC (a, b, c) ABC (a, b, c)

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Scalability Parallel Algorithms

• Performance must scale with • Solving problems on a multiprocessor ¾ system size architecture requires design of parallel ¾ problem/workload size algorithms • Amdahl’s Law • How do we measure efficiency of a parallel algorithm ? ¾ perfect speedup cannot be achieved since there is a inherently sequential part to every ¾ 10 seconds on Machine 1 and 20 seconds on program machine 2 – which algorithm is better ? • Scalability measures ¾ Efficiency ( speedup per processor )

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

22 Parallel Algorithm Complexity Parallel Computation Models

• Parallel time complexity • Shared Memory ¾ Express in terms of Input size and System ¾ Protocol for shared memory ?..what happens size (num of processors) when two processors/processes try to ¾ T(N,P): input size N, P processors access same data ¾ Relationship between N and P ¾ EREW: Exclusive Read, Exclusive Write ¾ Independent size analysis – no link between N and P ¾ CREW: Concurrent Read, Exclusive Write ¾ Dependent size – P is a function of N; eg. P=N/2 ¾ CRCW: Concurrent read, Concurrent write • Speedup: how much faster than • Distributed Memory sequential ¾ Explicit communication through message ¾ S(P)= T(N,1)/T(N,P) passing • Efficiency: speedup per processor ¾ Send/Receive instructions ¾ S(P)/P

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Formal Models of Parallel Computation P-RAM model

• Alternating Turing machine • P programs, one per processor • P-RAM model • One memory ¾ In distributed memory it becomes P ¾ Extension of sequential Random Access Machine (RAM) model memories • P accumulators • RAM model • One read/write tape ¾ One program • Depending on shared memory protocol ¾ One memory we have ¾ One accumulator ¾ CREW P-RAM ¾ One read/write tape ¾ EREW PRAM ¾ CRCW PRAM

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

23 PRAM Model PRAM Algorithms -- Summing

• Assumes synchronous execution • Add N numbers in parallel using P • Idealized machine processors ¾ How to parallelize ? ¾ Helps in developing theoretically sound solutions ¾ Actual performance will depend on machine characteristics and language implementation

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Parallel Summing Parallel Sorting on CREW PRAM

• Using N/2 processors to sum N numbers • Sort N numbers using P processors in O(Log N) time ¾ Assume P unlimited for now. • Independent size analysis: • Given an unsorted list (a1, a2,…,an) ¾ Do sequential sum on N/P values and then created sorted list W, where W[i]

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

24 Parallel Sorting on CREW PRAM Parallel Algorithms

• Using P=N2 processors • Design of parallel algorithm has to take • For each processor P(i,j) compare ai>aj system architecture into consideration ¾ If ai>aj then R[i,j]=1 else 0 • Must minimize interprocessor ¾ Time = O(1) communication in a distributed memory • For each “row” of processors P(i,j) for system j=1 to j=N do parallel sum to compute ¾ Communication time is much larger than rank computation ¾ Compute R[i]= sum of R[i,j] ¾ Comm. Time can dominate computation if not ¾ Time = O(log N) problem is not “partitioned” well • Write ai into W[R(i)] • Efficiency • Total time complexity= O(log N)

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Next topics…

• Memory design ¾ Single processor – high performance processors

¾ Focus on cache ¾ Multiprocessor cache design • “Special Architectures” ¾ Embedded Systems ¾ Reconfigurable architectures – FPGA technology ¾ Cluster and Networked Computing

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

25