<<

Sharing memory in MIMD machines Multicomputers and Multiprocessors

n Multiprocessors (shared memory) n Multiprocessors u Any can use usual load / store operations to access any u UMA memory word u NUMA 8Complex hardware — bus becomes a bottleneck with more than n Multicomputers - distributed shared memory 10-20 CPUs (DSM) 4 Simple communication between processes — shared memory u motivation locations u consistency models 4 Synchronization is well-understood, uses classical techniques u implementation (semaphores, etc.) F implementation problems n Multicomputers (no shared memory) • granularity u Each CPU has its own private memory • page replacement 4 Simple hardware with network connection F implementation approaches 8Complex communication — processes have to use message- u thrashing passing, have to deal with lost messages, blocking, etc. u RPCs help, but aren’t perfect (no globals, can’t pass large data structures, etc.) 1 2

Bus-based multiprocessors NUMA Machines

CPU CPU CPU n Symmetric Memory CPU Memory CPU Memory Multiprocessor Cache Cache (SMP) u Multiple CPUs (2–30), one shared physical memory, connected by Bus Bus a bus LAN n Caches must be kept consistent n NUMA = Non- u Each CPU has a “snooping” cache, which “snoops” what’s n NUMA multiprocessor happening on the bus u Multiple CPUs, each with its own memory F On read hit, fetch data from local cache u CPUs share a single F On read miss, fetch data from memory, and store in local cache u Accesses to local memory locations are much faster (maybe 10x) • Same data may be in multiple caches than accesses to remote memory locations F (Write through) On write, store in memory and local cache n No hardware caching, so it matters which data is stored in which • Other caches are snooping; if they have that word they memory invalidate their cache entry u A reference to a remote page causes a hardware , which • After write completes, the memory is up-to-date and the traps to the OS, which decides whether or not to move the page to word is only in one cache 3 the local machine 4

Distributed Shared Memory (DSM) overview Advantages of DSM

n Simpler abstraction - programmer does not have to worry about data movement, may be easier to implement than RPC since the address space is the same n Basic idea (Kai Li, 1986) n easier portability - sequential programs can in principle be run directly u Collection of workstations, connected by a LAN, all sharing a single on DSM systems paged, virtual address space n possibly better performance u Each page is present on exactly one machine, and a reference to a local page is done in the usual fashion u locality of data - data moved in large blocks which helps programs with good locality of reference u A reference to a remote page causes a hardware page fault, which traps to the OS, which sends a message to the remote machine to u on-demand data movement get the page; the faulting instruction can then complete u larger memory space - no need to do paging on disk

4 To programmer, DSM machine looks like a conventional shared- n flexible communication - no need for sender and receiver to exist, can memory machine join and leave DSM system without affecting the others

u Easy to modify old programs n process migration simplified - one process can easily be moved to a 8 Poor performance — lots of pages being sent back and forth over the different machine since they all share the address space network

5 6 Comparison of shared memory systems Maintaining memory coherency

n DSM systems allow concurrent access to shared data

n concurrency may lead to unexpected results - what if the read does not return the value stored by the most recent write (write did not propagate)?

n Memory is coherent if the value returned by the read operation is always the value the programmer expected

n To maintain coherency of shared data a mechanism that controls (and synchronizes) memory accesses is used.

n This mechanism only allows a restricted set of memory u SMP = single bus multiprocessor access orderings

F Hardware access to all of memory, hardware caching of blocks n memory - the set of allowable memory u NUMA access orderings

F Hardware access to all of memory, software caching of pages u Distributed Shared Memory (DSM) 7 8 F Software access to remote memory, software caching of pages

Consistency models Consistency models (cont.)

n Causal consistency (Hutto and Ahmad 1990) n Notation rij[a] - operation number J performed by process Pi on memory element a u the processes are seen in the same order if they are potentially

n Strict consistency (strongest model) causally related u Value returned by a read operation is always the same as the u read/write (two write) operations on the same item are causally value written by the most recent write operation related

F Writes become instantly available to all processes u causality is transitive - if a process carries out an operation B that causally depends on the preceding op A - all consequent ops by these u Requires absolute global time to correctly order operations — not possible process are causally related to A (even if they are on different items) n PRAM consistency (Lipton & Sandberg 1988) n Sequential consistency (Lamport 1979) u All processes see only memory writes done by a single process in the u All processes see all memory access operations in the same order same (correct) order F Interleaving of operations doesn’t matter, if all processes see the same ordering u PRAM = pipelined RAM F Writes done by a single process can be pipelined; it doesn’t have u Read operation may not return result of most recent write operation! to wait for one to finish before starting another F writes by different processes may be seen in different order on a F Running a program twice may give different results each time third process u If order matters, use semaphores! u Easy to implement — order writes on each processor independent of 9 all others 10

Consistency models (cont.) Comparison of consistency models n Processor consistency (Goodman 1989) u PRAM + (cont.) u coherency on the same data item - all processes agree on the order of write operations to the same data item n Models differ in how restrictive they are, difficulty to implement, ease of n Weak consistency (Dubois 1988) programming, and performance

u Consistency need only apply to a group of memory accesses rather n Strict consistency — most restrictive, but impossible to implement than individual memory accesses n Sequential consistency — widely used, intuitive semantics, not much u Use synchronization variables to make all memory changes visible to extra burden on programmer all other processes (e.g., exiting critical section) u But does not allow much concurrency F all access to synchronization variables must be sequentially consistent n Causal & PRAM consistency — allow more concurrency, but have non-intuitive semantics, and put more of a burden on the programmer F write operations are completed before access to synchvar to avoid doing things that require more consistency F access to non-synchvar is allowed only after sycnhvar access is completed n Weak and Release consistency — intuitive semantics, but put extra n Release consistency (Gharachorloo 1990) burden on the programmer u two synchronization vars F acquire - all changes to synchronized vars are propagated to the process F release - all changes to syncrhonized vars are propagated to other processes 11 12 F programmer has to write accesses to these variables Implementation issues Granularity

n Granularity - size of shared memory unit

n how to keep track of the location of remote data n Page-based DSM

n how to overcome the communication delays and high u Single page — simple to implement overhead associated with execution of communication u Multiple pages — take advantage of locality of reference, amortize protocols network overhead over multiple pages

n how to make shared data concurrently accessible at several F Disadvantage — false sharing nodes to improve system performance n Shared-variable DSM u Share only those variables that are need by multiple processes u Updating is easier, can avoid false sharing, but puts more burden on the programmer

n Object-based DSM u Retrieve not only data, but entire object — data, methods, etc. u Have to heavily modify old programs

13 14

Implementing sequential consistency Implementing sequential consistency on page-based DSM on page-based DSM (cont.) n Can a page move? …be replicated? n Replicated, migrating pages n Nonreplicated, nonmigrating pages u All requests for the page have to be sent to the owner of the page u All requests for the page have to be sent to the owner of the page u Each time a remote page is accessed, it’s copied to the processor u Easy to enforce sequential consistency — owner orders all access that accessed it request u Multiple read operations can be done concurrently u No concurrency u Hard to enforce sequential consistency — must invalidate (most n Nonreplicated, migrating pages common approach) or update other copies of the page during a write operation u All requests for the page have to be sent to the owner of the page n Replicated, nonmigrating pages u Each time a remote page is accessed, it migrates to the processor that accessed it u Replicated at fixed locations u Easy to enforce sequential consistency — only processes on that u All requests to the page have to be sent to one of the owners of the processor can access the page page u No concurrency u Hard to enforce sequential consistency — must update other copies of the page during a write operation

15 16

Thrashing n Occurs when system spends a large amount of time transferring shared data blocks from one node to another (compared to time spent on useful computation) u interleaved data access by two nodes causes data block to move back and forth u read-only blocks invalidated as soon as they are replicated n handling thrashing u application specifies when to prevent other nodes from moving block - has to modify application u “nailing” block after transfer for a minimum amount of time t - hard to select t, wrong selection makes inefficient use of DSM F adaptive nailing? u tailoring coherence semantics (Minin) - requires programmers

17