Distributed Shared Memory

Sharing memory in MIMD machines Multicomputers and Multiprocessors n Multiprocessors (shared memory) n Multiprocessors u Any process can use usual load / store operations to access any u UMA memory word u NUMA 8Complex hardware — bus becomes a bottleneck with more than n Multicomputers - distributed shared memory 10-20 CPUs (DSM) 4 Simple communication between processes — shared memory u motivation locations u consistency models 4 Synchronization is well-understood, uses classical techniques u implementation (semaphores, etc.) F implementation problems n Multicomputers (no shared memory) • granularity u Each CPU has its own private memory • page replacement 4 Simple hardware with network connection F implementation approaches 8Complex communication — processes have to use message- u thrashing passing, have to deal with lost messages, blocking, etc. u RPCs help, but aren’t perfect (no globals, can’t pass large data structures, etc.) 1 2 Bus-based multiprocessors NUMA Machines CPU CPU CPU n Symmetric Memory CPU Memory CPU Memory Multiprocessor Cache Cache Cache (SMP) u Multiple CPUs (2–30), one shared physical memory, connected by Bus Bus a bus LAN n Caches must be kept consistent n NUMA = Non-Uniform Memory Access u Each CPU has a “snooping” cache, which “snoops” what’s n NUMA multiprocessor happening on the bus u Multiple CPUs, each with its own memory F On read hit, fetch data from local cache u CPUs share a single virtual memory F On read miss, fetch data from memory, and store in local cache u Accesses to local memory locations are much faster (maybe 10x) • Same data may be in multiple caches than accesses to remote memory locations F (Write through) On write, store in memory and local cache n No hardware caching, so it matters which data is stored in which • Other caches are snooping; if they have that word they memory invalidate their cache entry u A reference to a remote page causes a hardware page fault, which • After write completes, the memory is up-to-date and the traps to the OS, which decides whether or not to move the page to word is only in one cache 3 the local machine 4 Distributed Shared Memory (DSM) overview Advantages of DSM n Simpler abstraction - programmer does not have to worry about data movement, may be easier to implement than RPC since the address space is the same n Basic idea (Kai Li, 1986) n easier portability - sequential programs can in principle be run directly u Collection of workstations, connected by a LAN, all sharing a single on DSM systems paged, virtual address space n possibly better performance u Each page is present on exactly one machine, and a reference to a local page is done in the usual fashion u locality of data - data moved in large blocks which helps programs with good locality of reference u A reference to a remote page causes a hardware page fault, which traps to the OS, which sends a message to the remote machine to u on-demand data movement get the page; the faulting instruction can then complete u larger memory space - no need to do paging on disk 4 To programmer, DSM machine looks like a conventional shared- n flexible communication - no need for sender and receiver to exist, can memory machine join and leave DSM system without affecting the others u Easy to modify old programs n process migration simplified - one process can easily be moved to a 8 Poor performance — lots of pages being sent back and forth over the different machine since they all share the address space network 5 6 Comparison of shared memory systems Maintaining memory coherency n DSM systems allow concurrent access to shared data n concurrency may lead to unexpected results - what if the read does not return the value stored by the most recent write (write did not propagate)? n Memory is coherent if the value returned by the read operation is always the value the programmer expected n To maintain coherency of shared data a mechanism that controls (and synchronizes) memory accesses is used. n This mechanism only allows a restricted set of memory u SMP = single bus multiprocessor access orderings F Hardware access to all of memory, hardware caching of blocks n memory consistency model - the set of allowable memory u NUMA access orderings F Hardware access to all of memory, software caching of pages u Distributed Shared Memory (DSM) 7 8 F Software access to remote memory, software caching of pages Consistency models Consistency models (cont.) n Causal consistency (Hutto and Ahmad 1990) n Notation rij[a] - operation number J performed by process Pi on memory element a u the processes are seen in the same order if they are potentially n Strict consistency (strongest model) causally related u Value returned by a read operation is always the same as the u read/write (two write) operations on the same item are causally value written by the most recent write operation related F Writes become instantly available to all processes u causality is transitive - if a process carries out an operation B that causally depends on the preceding op A - all consequent ops by these u Requires absolute global time to correctly order operations — not possible process are causally related to A (even if they are on different items) n PRAM consistency (Lipton & Sandberg 1988) n Sequential consistency (Lamport 1979) u All processes see only memory writes done by a single process in the u All processes see all memory access operations in the same order same (correct) order F Interleaving of operations doesn’t matter, if all processes see the same ordering u PRAM = pipelined RAM F Writes done by a single process can be pipelined; it doesn’t have u Read operation may not return result of most recent write operation! to wait for one to finish before starting another F writes by different processes may be seen in different order on a F Running a program twice may give different results each time third process u If order matters, use semaphores! u Easy to implement — order writes on each processor independent of 9 all others 10 Consistency models (cont.) Comparison of consistency models n Processor consistency (Goodman 1989) u PRAM + (cont.) u coherency on the same data item - all processes agree on the order of write operations to the same data item n Models differ in how restrictive they are, difficulty to implement, ease of n Weak consistency (Dubois 1988) programming, and performance u Consistency need only apply to a group of memory accesses rather n Strict consistency — most restrictive, but impossible to implement than individual memory accesses n Sequential consistency — widely used, intuitive semantics, not much u Use synchronization variables to make all memory changes visible to extra burden on programmer all other processes (e.g., exiting critical section) u But does not allow much concurrency F all access to synchronization variables must be sequentially consistent n Causal & PRAM consistency — allow more concurrency, but have non-intuitive semantics, and put more of a burden on the programmer F write operations are completed before access to synchvar to avoid doing things that require more consistency F access to non-synchvar is allowed only after sycnhvar access is completed n Weak and Release consistency — intuitive semantics, but put extra n Release consistency (Gharachorloo 1990) burden on the programmer u two synchronization vars F acquire - all changes to synchronized vars are propagated to the process F release - all changes to syncrhonized vars are propagated to other processes 11 12 F programmer has to write accesses to these variables Implementation issues Granularity n Granularity - size of shared memory unit n how to keep track of the location of remote data n Page-based DSM n how to overcome the communication delays and high u Single page — simple to implement overhead associated with execution of communication u Multiple pages — take advantage of locality of reference, amortize protocols network overhead over multiple pages n how to make shared data concurrently accessible at several F Disadvantage — false sharing nodes to improve system performance n Shared-variable DSM u Share only those variables that are need by multiple processes u Updating is easier, can avoid false sharing, but puts more burden on the programmer n Object-based DSM u Retrieve not only data, but entire object — data, methods, etc. u Have to heavily modify old programs 13 14 Implementing sequential consistency Implementing sequential consistency on page-based DSM on page-based DSM (cont.) n Can a page move? …be replicated? n Replicated, migrating pages n Nonreplicated, nonmigrating pages u All requests for the page have to be sent to the owner of the page u All requests for the page have to be sent to the owner of the page u Each time a remote page is accessed, it’s copied to the processor u Easy to enforce sequential consistency — owner orders all access that accessed it request u Multiple read operations can be done concurrently u No concurrency u Hard to enforce sequential consistency — must invalidate (most n Nonreplicated, migrating pages common approach) or update other copies of the page during a write operation u All requests for the page have to be sent to the owner of the page n Replicated, nonmigrating pages u Each time a remote page is accessed, it migrates to the processor that accessed it u Replicated at fixed locations u Easy to enforce sequential consistency — only processes on that u All requests to the page have to be sent to one of the owners of the processor can access the page page u No concurrency u Hard to

Distributed Shared Memory

2.5 Classification of Parallel Computers

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism

Trafficdb: HERE's High Performance Shared-Memory Data Store

Scheduling of Shared Memory with Multi - Core Performance in Dynamic Allocator Using Arm Processor

The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors

Lecture 17: Multiprocessors

Use of Shared Memory in the Context of Embedded Multi-Core Processors: Exploration of the Technology and Its Limits

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use

Instruction Level Parallelism Example

CIS 501 Computer Architecture This Unit: Shared Memory

Programming with Shared Memory PART I