Introduction and Cache Coherency

EEC 581 Computer Architecture Multiprocessor and Memory Coherence Department of Electrical Engineering and Computer Science Cleveland State University Memory Hierarchy in a Multiprocessor Shared cache Bus-based shared memory P P P P P P $ $ $ Cache Memory Memory Fully-connected shared memory Distributed shared memory (Dancehall) P P P P P $ $ $ $ $ Memory Memory Interconnection Network Interconnection Network Memory Memory 2 1 Cache Coherency Closest cache level is private Multiple copies of cache line can be present across different processor nodes Local updates Lead to incoherent state Problem exhibits in both write-through and writeback caches Bus-based globally visible Point-to-point interconnect visible only to communicated processor nodes 3 Example (Writeback Cache) P P P Rd? Rd? Cache Cache Cache X= -100 X= -100 X=X= - 100505 X= -100 Memory 4 2 Example (Write-through Cache) P P P Rd? Cache Cache Cache X= -100 X= 505 X=X= - 100505 X=X= -505100 Memory 5 Defining Coherence An MP is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Implicit definition of coherence Write propagation Writes are visible to other processes Write serialization All writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity) E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi 6 3 Sounds Easy? A=0 B=0 P0 P1 P2 P3 T1 A=1 B=2 T2 A=1 A=1 B=2 B=2 T3 A=1 A=1 B=2 B=2 B=2 A=1 T3 A=1 A=1 B=2 B=2 B=2 B=2 A=1 A=1 See A’s update before B’s See B’s update before A’s 7 Bus Snooping based on Write-Through Cache All the writes will be shown as a transaction on the shared bus to memory Two protocols Update-based Protocol Invalidation-based Protocol 8 4 Bus Snooping (Update-based Protocol on Write-Through cache) P P P Cache Cache Cache X= - 505100 X= 505 Bus transaction X=X= -505100 Memory Bus snoop Each processor’s cache controller constantly snoops on the bus Update local copies upon snoop hit 9 Bus Snooping (Invalidation-based Protocol on Write-Through cache) Load X P P P Cache Cache Cache X=X= -505100 X= 505 Bus transaction X=X= -505100 Memory Bus snoop Each processor’s cache controller constantly snoops on the bus Invalidate local copies upon snoop hit 10 5 A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache PrRd / --- PrWr / BusWr Valid BusWr / --- PrRd / BusRd Invalid Observed / Transaction Processor-initiated Transaction Bus-snooper-initiated Transaction PrWr / BusWr 11 Cache Coherence Protocols for WB caches A cache has an exclusive copy of a line if It is the only cache having a valid copy Memory may or may not have it Modified (dirty) cache line The cache having the line is the owner of the line, because it must supply the block 12 6 Cache Coherence Protocol (Update-based Protocol on Writeback cache) P P P Store X Cache Cache Cache X= - 100505 X= - 100505 X= - 100505 update update Bus transaction Memory • Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic will be incurred 13 Cache Coherence Protocol (Update-based Protocol on Writeback cache) Load X P P Store X P Cache Cache Cache X= 505333 X= 505333 X= 505333 Hit ! update update Bus transaction Memory • Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic will be incurred 14 7 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) P P P Store X Cache Cache Cache X= -100 X= -100 X= - 100505 invalidate invalidate Bus transaction Memory • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location 15 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Load X P P P Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit Bus transaction Memory Bus snoop • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location 16 8 Cache Coherence Protocol (Invalidation-based Protocol on Writeback cache) Store X P P P Store X Store X Cache Cache Cache X= 505333987444 X= 505 Bus transaction Memory Bus snoop • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location 17 MSI Writeback Invalidation Protocol Modified Dirty Only this cache has a valid copy Shared Memory is consistent One or more caches have a valid copy Invalid Writeback protocol: A cache line can be written multiple times before the memory is updated. 18 9 MSI Writeback Invalidation Protocol Two types of request from the processor PrRd PrWr Three types of bus transactions post by cache controller BusRd PrRd misses the cache Memory or another cache supplies the line BusRd eXclusive (Read-to-own) PrWr is issued to a line which is not in the Modified state BusWB Writeback due to replacement Processor does not directly involve in initiating this operation 19 MSI Example P1 P2 P3 Cache Cache Cache X=10 S Bus BusRd MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- --- BusRd Memory 20 10 MSI Example P1 P2 P3 Cache Cache Cache X=10 S X=10 S Bus BusRd MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- --- BusRd Memory P3 reads X S --- S BusRd Memory 21 MSI Example P1 P2 P3 Cache Cache Cache X=10--- SI X=10X=-25 SM Bus BusRdX MEMORY X=10 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- --- BusRd Memory P3 reads X S --- S BusRd Memory P3 writes X I --- M BusRdX Memory Does not come from memory if having “BusUpgrade” 22 11 MSI Example P1 P2 P3 Cache Cache Cache ---X=-25 IS X=-25 MS Bus BusRd MEMORY X=10X=-25 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- --- BusRd Memory P3 reads X S --- S BusRd Memory P3 writes X I --- M BusRdX Memory P1 reads X S --- S BusRd P3 Cache 23 MSI Example P1 P2 P3 Cache Cache Cache X=-25 S X=-25 S X=-25 MS Bus BusRd MEMORY X=10X=-25 Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier P1 reads X S --- --- BusRd Memory P3 reads X S --- S BusRd Memory P3 writes X I --- M BusRdX Memory P1 reads X S --- S BusRd P3 Cache P2 reads X S S S BusRd Memory 24 12 MESI Writeback Invalidation Protocol To reduce two types of unnecessary bus transactions BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block BusRd that gets the line in S state when there is no sharers (that lead to the overhead above) Introduce the Exclusive state One can write to the copy without generating BusRdX Illinois Protocol: Proposed by Pamarcos and Patel in 1984 Employed in Intel, PowerPC, MIPS 25 MESI Writeback Invalidation Protocol Processor Request (Illinois Protocol) PrWr / --- PrRd / --- PrRd, PrWr / --- Exclusive Modified PrWr / BusRdX PrWr / BusRdX PrRd / BusRd (not-S) Invalid Shared S: Shared Signal PrRd / --- Processor-initiated PrRd / 26BusRd (S) 13 MESI Writeback Invalidation Protocol Bus Transactions (Illinois Protocol) • Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data • Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) • Most of the MESI implementations simply write to memory ExclusiveBusRd / Flush Modified Or ---) BusRdX / --- BusRd / Flush BusRdX / Flush Invalid Shared BusRd / Flush* BusRdX / Flush* 27 Bus-snooper-initiated Flush*: Flush for data supplier; no action for other sharers MESI Writeback Invalidation Protocol (Illinois Protocol) PrWr / --- PrRd / --- PrRd, PrWr / --- ExclusiveBusRd / Flush Modified (or ---) PrWr / BusRdX PrWr / BusRdX PrRd / BusRd BusRdX / --- BusRd / Flush (not-S) BusRdX / Flush Invalid Shared BusRd / Flush* BusRdX / Flush* S: Shared Signal PrRd / --- Processor-initiated PrRd / 28BusRd (S) Bus-snooper-initiated Flush*: Flush for data supplier; no action for other sharers 14 MOESI Protocol Add one additional state ─ Owner state Similar to Shared state The O state processor will be responsible for supplying data (copy in memory may be stale) Employed by CPU0 CPU1 Sun UltraSparc AMD Opteron L2 L2 In dual-core Opteron, cache-to-cache System Request Interface transfer is done through a system Crossbar request interface (SRI) running at full CPU speed Mem Hyper- Controller Transport 29 Implication on Multi-Level Caches How to guarantee coherence in a multi-level cache hierarchy Snoop all cache levels? Intel’s 8870 chipset has a “snoop filter” for quad-core Maintaining inclusion property Ensure data in the outer level must be present in the inner level Only snoop the outermost level (e.g. L2) L2 needs to know L1 has write hits Use Write-Through cache Use Write-back but maintain another “modified-but-stale” bit in L2 30 15 Inclusion Property Not so easy … Replacement: Different bus observes different access activities, e.g.

Introduction and Cache Coherency

Parallel Computer Architecture

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Thread-Level Parallelism – Part 1

Efficient Synchronization Mechanisms for Scalable GPU Architectures

CIS 501 Computer Architecture This Unit: Shared Memory

Shared-Memory Multiprocessors Gates & Transistors

Tardis: Time Traveling Coherence Algorithm for Distributed Shared Memory

Gpgpu & Accelerators

CIS 501 Computer Architecture This Unit: Shared Memory Multiprocessors Readings Multiplying Performance

Snooping Cache Coherence I

Page 1 Cache Coherence in More Detail

Memory Coherence in Shared Virtual Systems