Memory Consistency Model

Define memory correctness for parallel execution Memory Consistency and Multiprocessor Performance Execution appears to the that of some correct execution of some theoretical parallel computer which has n sequential processors

Particularly, remote writes must appear in a local processor in some correct sequence

1 2 Adapted from UCB CS252 S01, Copyright 2001 USB

Sequential Consistency Model Memory Consistency Examples Sequential consistency Initially A == 0 and B == 0 Memory read/writes are globally serialized; P1 P2 assume every cycle only one processor can A = 1; B = 1; proceed for one step, and write result appears L1: if (B == 0) ... L2: if (A == 0) ... on other processors immediately

Processors do not reorder local reads and Impossible for both if statements L1 & L2 to be true? writes „ Note the number of possible total orderings is an exponential function of #inst

3 4

Performance of Sequential Consistency Relaxed Memory Consistency SC must delay all memory accesses until Total storing order all invalidates done „ Only writes are globally serialized; assume every cycle at most one write can proceed, and the write What if write invalidate is delayed & result appears immediately processor continues? „ Processors may reorder local reads/writes without RAW dependence How about memory level parallelism? Processor consistency „ Writes from one processor appear in the same order on all other processors „ Processors may reorder local reads/writes without RAW dependence

5 6

1 Relaxed Memory Consistency Memory Consistency and ILP An architect has designed an SMP that uses Speculate on loads, recovered on violation processor consistency „ With ILP and SC what will happen on this? P1 code P2 code P1 exec P2 exec P1 P2 P3 A = 1 B = 1 spec. load B spec. load A print B print A store A store B A = get_data(); if (updated) if (flag) retire load B flush at load A updated = 1; flag = 1; process_data(A); SC can be maintained, but expensive, so use TSO or PC „ Speculative execution and rollback can still improve An programmer found that P3 uses stale data of performance A, sometimes. Initially updated==0 and flag==0. Performance: ILP + Strong MC ≅ Weak MC Can you defend the architect?

7 8

Memory Consistency in Real Programs Synchronization Not really an issue for many common programs; Why Synchronize? Need to know when it is safe for they are synchronized different processes to use shared data „ A program is synchronized if all access to shared data Issues for Synchronization: are ordered by synchronization operations „ Uninterruptable instruction to fetch and update memory write (x) (atomic operation); ... „ Synchronization operation needs this kind of primitive; release (s) {unlock} „ For large scale MPs, synchronization can be a bottleneck; ... techniques to reduce contention and latency of acquire (s) {lock} synchronization ... read(x)

9 10

Uninterruptable Instructions to Fetch and Update Memory Parallel App: Commercial Workload Online transaction processing workload (OLTP) (like Atomic exchange: interchange a value in a register TPC-B or -C) for a value in memory Decision support system (DSS) (like TPC-D) Web index search (Altavista) Test-and-set: tests a value and sets it if the value Benc h mark % Time % Time % Time passes the test User Kernel I/O time Mode (CPU I dle) Fetch-and-increment: it returns the value of a OLTP 71% 18% 11% memory location and atomically increments it DSS (range) 82-94% 3-5% 4-13%

DSS (avg) 87% 4% 9%

Altavista > 98% < 1% <1% 11 12

2 Alpha 4100 SMP OLTP Performance as vary L3$ size 4 CPUs 100 300 MHz Apha 211264 @ 300 MHz 90 L1$ 8KB direct mapped, write through 80 70 L2$ 96KB, 3-way set associative Idle 60 PAL Code L3$ 2MB (off chip), direct mapped 50 Memory Access L2/L3 Cache Access Memory latency 80 clock cycles 40 Instruction Execution Cache to cache 125 clock cycles 30 20 10

0 1 MB 2 MB 4 MB 8MB

13 L3 Cache Size 14

L3 Miss Breakdown Memory CPI as increase CPUs 3.25 3 3 Instruction Instruction 2.75 Conflict/Capacity Capacity/Conflict 2.5 Cold 2.5 Cold False Sharing 2.25 False Sharing 2 True Sharing 2 True Sharing 1.75 1.5 1.5 1.25 1 1 0.75 0.5 0.5 0.25 0 0 1 MB 2 MB 4 MB 8 MB 12468 Cache size 15 Processor count L3: 2MB 2-way 16

OLTP Performance as vary L3$ block size SGI Origin 2000 16 15 Insruction A pure NUMA 14 2 CPUs per node, 13 Capacity/Conflict

12 Cold Scales up to 2048 processors 11 10 False Sharing Design for scientific computation vs. commercial 9 processing 8 True Sharing 7 Scalable bandwidth is crucial to Origin 6 5 4 3 2 1 0 32 64 128 256 Block size in bytes 17 18

3 Parallel App: Scientific/Technical FFT 3-hop miss to remote FFT Kernel cache FFT Kernel: 1D complex number FFT 5.5 Remote miss to home 5.0 „ 2 matrix transpose phases => all-to-all communication Miss to local memory 4.5 „ Sequential time for n data points: O(n log n) Hit 4.0 „ Example is 1 million point data set 3.5 LU Kernel: dense matrix factorization 3.0 2.5 „ Blocking helps cache miss rate, 16x16 3 2.0 „ Sequential time for nxn matrix: O(n ) 1.5 „ Example is 512 x 512 matrix 1.0 0.5 0.0 8163264 Processor count

19 20

LU Parallel App: Scientific/Technical LU kernel 5.5 Barnes App: Barnes-Hut n-body algorithm solving a problem 3-hop miss to remote 5.0 cache in galaxy evolution 4.5 Remote miss to home „ n-body algs rely on forces drop off with distance; if far enough away, can ignore (e.g., gravity is 1/d2) 4.0 Miss to local memory „ Sequential time for n data points: O(n log n) 3.5 „ Example is 16,384 bodies Hit 3.0 Ocean App: Gauss-Seidel multigrid technique to solve a set 2.5 of elliptical partial differential eq.s’ 2.0 „ red-black Gauss-Seidel colors points in grid to consistently update 1.5 points based on previous values of adjacent neighbors 1.0 „ Multigrid solve finite diff. eq. by iteration using hierarch. Grid 0.5 „ Communication when boundary accessed by adjacent subgrid 2 0.0 „ Sequential time for nxn grid: O(n ) 8 163264 „ Input: 130 x 130 grid points, 5 iterations Processor count

21 22

Barnes App Ocean App Barnes Ocean

5.5 3-hop miss to remote 5.5 3-hop miss to remote cache 5.0 cache 5.0 Remote miss Remote miss to home 4.5 4.5 4.0 Local miss Miss to local memory 4.0 3.5 3.5 Cache hit 3.0 Hit 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 reference cycles per Average Average cycles per memory reference per memory cycles Average 0.5 0.0 8163264 0.0 8163264 Processor count 23 24

4 Multiprocessor Conclusion Some optimism about future

„ Parallel processing beginning to be understood in some domains

„ More performance than that achieved with a single-chip microprocessor

„ MPs are highly effective for multiprogrammed workloads

„ MPs proved effective for intensive commercial workloads, such as OLTP (assuming enough I/O to be CPU-limited), DSS applications (where query optimization is critical), and large-scale, web searching applications

„ On-chip MPs appears to be growing 1) embedded market where natural parallelism often exists an obvious alternative to faster less silicon efficient, CPU. 2) diminishing returns in high-end microprocessor encourage designers to pursue on-chip

25

5