Multiprocessors - Flynn’s Taxonomy (1966) Flynn’s Taxonomy (c’ed)

• Multiple Instruction stream, Single Data stream (MISD) • Single Instruction stream, Single Data stream (SISD) – Don’t know of any – Conventional uniprocessor • Multiple Instruction stream, Multiple Data stream (MIMD) – Although ILP is exploited – The most general • Single -> Single Instruction stream • The data is not “streaming” – Covers: • Single Instruction stream, Multiple Data stream (SIMD) • Shared-memory multiprocessors • multicomputers (including networks of – Popular for some applications like image processing workstations cooperating on the same problem) – One can construe vector processors to be of the SIMD type. – MMX extensions to ISA reflect the SIMD philosophy • Also apparent in “multimedia” processors (Equator Map-1000) –“Data Parallel” Programming paradigm

Multiprocessors CSE 471 Aut 01 1 Multiprocessors CSE 471 Aut 01 2

Shared-memory Multiprocessors Shared-memory Multiprocessors (c’ed)

• Shared-Memory = Single shared-address space (extension • Non-: NUMA of uniprocessor; communication via Load/Store) – NUMA-CC: cache coherent (directory-based protocols or SCI) • Uniform Memory Access: UMA – NUMA without (enforced by software) – Today, almost uniquely in shared-bus systems – COMA: Cache Memory Only Architecture – The basis for SMP’s (Symmetric ) – Clusters – Cache coherence enforced by “snoopy” protocols • Distributed : DSM – Form the basis for clusters (but in clusters access to memory of – Most often network of workstations. other clusters is not UMA) – The shared address space is the “virtual address space” – O.S. enforces coherence on a page per page basis

Multiprocessors CSE 471 Aut 01 3 Multiprocessors CSE 471 Aut 01 4

Message-passing Systems Shared-memory vs. Message-passing

• Processors communicate by messages • An old debate that is not that much important any longer – Primitives are of the form “send”, “receive” • Many systems are built to support a mixture of both – The user (programmer) has to insert the messages paradigms – Message passing libraries (MPI, OpenMP etc.) –“send, receive” can be supported by O.S. in shared-memory systems • Communication can be: –“load/store” in virtual address space can be used in a message- – Synchronous: The sender must wait for an ack from the receiver passing system (the message passing library can use “small” (e.g, in RPC) messages to that effect, e.g. passing a pointer to a memory area in – Asynchronous: The sender does not wait for a reply to continue another computer) – Does a network of workstations with a page being the unit of coherence follow the shared-memory paradigm or the message passing paradigm?

Multiprocessors CSE 471 Aut 01 5 Multiprocessors CSE 471 Aut 01 6 The Pros and Cons Caveat about Parallel Processing

• Shared-memory pros • Multiprocessors are used to: – Ease of programming (SPMD: Single Program Multiple Data – computations paradigm) – Solve larger problems – Good for communication of small items • Speedup – Less overhead of O.S. – Time to execute on 1 / Time to execute on N processors – Hardware-based cache coherence • Speedup is limited by the communication/computation • Message-passing pros ratio and synchronization – Simpler hardware (more scalable) • Efficiency – Explicit communication (both good and bad; some programming languages have primitives for that), easier for long messages – Speedup / Number of processors – Use of message passing libraries

Multiprocessors CSE 471 Aut 01 7 Multiprocessors CSE 471 Aut 01 8

SMP (Symmetric MultiProcessors aka Multis) Amdahl’s Law for Parallel Processing Single shared-bus Systems

• Recall Amdahl’s law – If x% of your program is sequential, speedup is bounded by 1/x Proc. • At best linear speedup (if no sequential section) Caches • What about superlinear speedup? – Theoretically impossible Shared-bus –“Occurs” because adding a processor might mean adding more overall memory and caching (e.g., fewer page faults!) – Have to be careful about the x% of sequentiality. Might become I/O adapter lower if the data set increases. Interleaved • Speedup and Efficiency should have the number of Memory processors and the size of the input set as parameters

Multiprocessors CSE 471 Aut 01 9 Multiprocessors CSE 471 Aut 01 10

Multiprocessor with Centralized Switch Multiprocessor with Decentralized Switches

Proc Proc Proc Proc Proc Proc Proc Proc Cache Cache Mem Mem Mem Mem Mem Mem

Interconnection network Switch

Mem Mem Mem Mem Mem Mem

Proc Proc Proc Proc Proc Proc

Multiprocessors CSE 471 Aut 01 11 Multiprocessors CSE 471 Aut 01 12