1

EE 457 Unit 7c

Cache Coherency 2 Parallel Processing Paradigms

• SISD = Single Instruction, Single Data – Uniprocessor • SIMD = Single Instruction, Multiple Data – Multimedia/Vector Instruction Extensions, Graphics Units (GPU’s) • MIMD = Multiple Instruction, Multiple Data – CMP, CMT, Parallel Programming

SISD SIMD MIMD

Instruc. Stream Data Stream PE MU CU PE

Shared CU PE MU CU PE MU CU PE MU

PE MU CU PE 3 Typical CMP Organization

Chip Multi- Processor P P P P For EE 457 this is just a shared bus L1 L1 L1 L1

Private L1's require maintaining coherency via ______. Interconnect (On-Chip Network)

L2 L2 L2 L2 Bank Bank/ Bank Bank/

For EE457, just one bank.

Main Memory 4 Coherency • Most multi-core processors are systems where each processor has its own cache • Problem: Multiple cached copies of same memory block – Each processor can get their own copy, change it, and perform calculations on their own different values…INCOHERENT! • Solution: Snoopy caches… Example of incoherence if P2 Writes X we if P2 Reads X it now have two will be using a versions. How do we P1 Reads X P2 Reads X P1 Writes X 1 2 3 4a “stale” value of X 4b reconcile them?

P1 P2 P1 P2 P1 P2 P1 P2 P1 P2 $ $ $ $ $ $ $ $ $ $

M M M M M Block X 5 Snoopy or Snoopy 6 Solving Cache Coherency • If no writes, multiple copies are fine • Two options: When a block is modified – Go out and update everyone else’s copy – Invalidate all other sharers and make them come back to you to get a fresh copy • “Snooping” caches using invalidation policy is most common – Caches monitor activity on the bus looking for invalidation messages – If another cache needs a block you have the latest version of, forward it to mem & others

Coherency using “snooping” & invalidation

P1 wants to writes X, if P2 attempts to read/write x, it will P1 forwards data to so it first sends Now P1 can safely P1 & P2 Reads X miss, & request the to P2 and memory 1 2 “invalidation” over 3 write X 4 5 the bus for all sharers block over the bus at same time P1 P2 P1 P2 P1 P2 P1 P2 P1 P2

$ $ $ $ $ $ $ $ $ $

Invalidate block X if you have M it M M M M Block X 7 Coherence Definition

• A memory system is coherent if the value returned on a Load instruction is always the value given by the latest Store instruction with the same address • This simple definition allows to understand the basic problems of private caches in MP systems

P P P P P P

X X X’ X X’ X

X X’ X

Original State Write-Through Cache Write-Back Cache

ISCA ‘90 Tutorial “Memory System Architectures for Tightly-coupled Multiprocessors”, Michel Dubois and Faye A. Briggs © 1990. 8 Write Through Caches

• The bus interface unit of each processor “watches” the bus address lines and invalidates the cache when the cache contains a copy of the block with modified word • The state of a memory block b in cache i can be described by the following state diagram – State INV: there is no copy of block b in cache i or if there is, it is invalidated – State VAL: there is a valid copy of block b in cache i

ISCA ‘90 Tutorial “Memory System Architectures for Tightly-coupled Multiprocessors”, Michel Dubois and Faye A. Briggs © 1990. 9 Write Through Snoopy Protocol

• R(k): Read of block b by processor k • W(k): Write into block b by processor k • Solid lines: action taken by the local processor • Dotted lines: action taken by a remote processor (incoming bus request)

R(i), W(i)

INV VAL R(i) i = Local cache W(i) j = Remote cache

W(j) 10 Bus vs. Processor Actions

• Cache block state (state and transitions maintained for each cache block) – Format of transitions: Input Action / Output Action – Pr = Processor Initiated Action – Bus = Consequent action on the bus

Bus = Action (initiated by another BusWrite / -- processor) appearing on the bus and BusReadX / -- noticed by our snoopy cache

PrWr / BusWrite VAL INV BusWrite / -- PrRd / -- BusReadX / --

PrRd / BusRd RdX = Since I do not have the block, I PrWr / BusRdX need to read the block. But since my intent is to write, I ask that others invalid their copies

Michel Dubois, Murali Annavaram and Per Stenström © 2011. 11 Action Definitions

Acronyms Description PrRd Processor Read PrWr Processor Write BusRd Read request for a block BusWrite Write a word to memory and invalidate other copies BusUpgr Invalid other copies BusUpdate Update other copies BusRdX Read block and invalidate other copies Flush Supply a block to a requesting cache S Shared line is activated ~S Shared line is deactivated

Michel Dubois, Murali Annavaram and Per Stenström © 2011. 12 Cache Block State Notes

• Note that these state diagrams are high-level – A state transition may take multiple clock cycles – The state transition conditions may violate all-inclusive or mutually-exclusive VAL requirements – There may be several other intermediate states – Events such as replacements may not have been covered 13 Coherence Implementation

P P

L1 L1 L1 Data L1 Data Tags Tags … Snoop Snoop Tag Tag Replica Replica

L1 Shared Bus L1 Dual directory of tags is maintained L2 L2 L2 L2 to facilitate Bank Bank/ Bank Bank/ snooping 14 Write Back Caches

• Write invalidate protocols (“Ownership Protocols”) • Basic 3-state (MSI) Protocol – I = INVALID: Replaced (not in cache) or invalidated – RO (Read-Only) = Shared: Processors can read their copy. Multiple copies can exist. Each processing having a copy is called a “Keeper” – RW (Read-Write) = Modified: Processors can read/write its copy. Only one copy exists. Processor is the “Owner”

ISCA ‘90 Tutorial “Memory System Architectures for Tightly-coupled Multiprocessors”, Michel Dubois and Faye A. Briggs © 1990. 15 Write Invalidate Snoopy Protocol

W(i) W(i) R(j) RW RO R(i) R(i)

W(j) W(j)

W(i) R(i) INV

ISCA ‘90 Tutorial “Memory System Architectures for Tightly-coupled Multiprocessors”, Michel Dubois and Faye A. Briggs © 1990. 16 Remote Read

Local View W(i)

W(i) R(j) If you have the RW RO R(i) R(i) only couple and another processor wants to read the W(j) W(j) data

W(i) R(i) INV

Remote View W(i) W(i) R(j) RW RO R(i) R(i) The other processor goes from invalid to W(j) W(j) read-only

W(i) R(i) INV 17 Local Write

Local View W(i) W(i) R(j) RW RO R(i) R(i) Upgrade your access W(j) W(j)

W(i) R(i) INV

Remote View W(i) W(i) R(j) RW RO R(i) R(i) Invalidate others’ copy so no one else has the block W(j) W(j)

W(i) R(i) INV 18 Remote Read

Local View W(i) W(i) R(j) RW RO R(i) R(i) No change

W(j) W(j)

W(i) R(i) INV

Remote View W(i) W(i) R(j) RW RO R(i) R(i) Remote processor gets a copy too W(j) W(j)

W(i) R(i) INV 19 Action Definitions

Acronyms Description PrRd Processor Read PrWr Processor Write BusRd Read request for a block BusWrite Write a word to memory and invalidate other copies BusUpgr Invalid other copies BusUpdate Update other copies BusRdX Read block and invalidate other copies Flush Supply a block to a requesting cache S Shared line is activated ~S Shared line is deactivated

Michel Dubois, Murali Annavaram and Per Stenström © 2011. 20 Write Invalidate Snoopy Protocol

PrRd / -- PrWr / -- M (RW)

PrRd / -- BusRd / BusRd / -- PrWr / Flush BusUpgr

BusRdX / PrWr/ Flush S BusRdX (RO) BusUpgr / -- BusRdX /-- PrRd / BusRd

I (INV)

BusRd / -- BusUpgr / -- BusRdX / --

Michel Dubois, Murali Annavaram and Per Stenström © 2011. 21 Remote Read PrRd / -- PrRd / -- Local View PrWr / -- Remote View PrWr / -- M M (RW) (RW)

PrRd / -- PrRd / -- BusRd / BusRd / -- PrWr / BusRd / BusRd / -- PrWr / Flush BusUpgr Flush BusUpgr

BusRdX / BusRdX / PrWr/ PrWr/ Flush Flush S BusRdX S BusRdX (RO) (RO) BusUpgr / -- BusUpgr / -- BusRdX /-- PrRd / BusRdX /-- PrRd / BusRd BusRd

I I (INV) (INV) BusRd / -- BusRd / -- BusUpgr / -- BusUpgr / -- BusRdX / -- BusRdX / --

I demote myself from Modified to Shared to let you promote yourself from Invalid to Shared

Michel Dubois, Murali Annavaram and Per Stenström © 2011. 22 Local Write PrRd / -- PrRd / -- Local View PrWr / -- Remote View PrWr / -- M M (RW) (RW)

PrRd / -- PrRd / -- BusRd / BusRd / -- PrWr / BusRd / BusRd / -- PrWr / Flush BusUpgr Flush BusUpgr

BusRdX / BusRdX / PrWr/ PrWr/ Flush Flush S BusRdX S BusRdX (RO) (RO) BusUpgr / -- BusUpgr / -- BusRdX /-- PrRd / BusRdX /-- PrRd / BusRd BusRd

I I (INV) (INV) BusRd / -- BusRd / -- BusUpgr / -- BusUpgr / -- BusRdX / -- BusRdX / --

I promote myself from Shared to Modified. Sorry, please demote yourself from Shared to Invalid

Michel Dubois, Murali Annavaram and Per Stenström © 2011. 23 Write Invalid Snoopy Protocol

• Read miss: – If the block is not present in any other cache, or if it is present as a Shared copy, then the memory responds and all copies remain shared – If the block is present in a different cache in Modified state, then that cache responds, delivers the copy and updates memory at the same time; both copies become Shared • Read Hit – No action is taken 24 Write Invalid Snoopy Protocol

• Write hit: – If the local copy is Modified then no action is taken – If the local copy is Shared, then an invalidation signal must be sent to all processors which have a copy 25 Write Invalid Snoopy Protocol

• Write miss: – If the block is Shared in other cache or not present in other caches, memory responds in both cases, and in the first case all shared copies are invalidated – If the block is Modified in another cache, that cache responds, then Invalidates its copy • Replacement – If the block is Modified, then memory must be updated 26 Coherency Example

Processor Bus Activity P1 $ P1 Block P2 $ P2 Block Memory Activity Content State Content State Contents (M,S,I) (M,S,I) - - - - A P1 reads BusRd A S - - A block X P2 reads BusRd A S A S A block X P1 writes BusUpgr B M - I A block X=B P2 reads BusRd / B S B S B block X Flush 27 Updated Coherency Example

Processor Bus Activity P1 $ P1 Block P2 $ P2 Block Memory Activity Content State Content State Contents (M,S,I) (M,S,I) - - - - A P1 reads BusRd A S - - A block X P1 writes BusUpgr B M - - A X=B P2 writes BusRdX / - I C M B X=C Flush P1 reads BusRd C S C S C block X 28 Problem with MSI

PrRd / -- PrWr / -- • Read miss followed M by write causes two (RW)

PrRd / -- bus accesses BusRd / BusRd / -- PrWr / Flush BusUpgr

• Solution: MESI BusRdX / PrWr/ Flush S BusRdX – New “Exclusive” state (RO) BusUpgr / -- that indicates you BusRdX /-- PrRd / BusRd have the only copy and can freely modify I (INV) BusRd / -- BusUpgr / -- BusRdX / -- 29 Exclusive State & Shared Signal

• Exclusive state avoid need to perform BusUpgr when moving from Shared to Modified even when no other copy exists • New state definitions: – Exclusive = only copy of unmodified (clean) cache block – Shared = multiple copies exist of modified (dirty) cache block • New “Shared” handshake signal is introduced on the bus – When a read request is placed on the bus, other snooping caches assert this signal if they have a copy – If signal is not asserted, the reader can assume exclusive access 30 Updated MESI Protocol

• Convert RO to two states: Shared & Exclusive

W(j) E R(i)

R(i)• ~S W(i) W(i) RW R(j) R(j) RO R(i) R(i) (M)

W(j) W(j) S R(i) R(i) W(i) INV (I) W(j)

R(i)•S 31 Updated MESI Protocol

• Final Resulting Protocol

W(i) W(i) R(i) M E R(i) (RW) W(j)

W(j) R(i)• ~S W(i) W(i) R(j)

R(j)

I S R(i) (INV) W(j)

R(i)•S 32 MESI

Processor Bus Activity P1 $ P1 Block P2 Block P3 Block Memory Activity Content State State State Contents (MESI) (MESI) (MESI) - - - - A P1 reads BusRdX A E - - A block X P1 writes - B M - - A X=B BusRd / P2 reads X B S S - B Flush P3 reads BusRd B S S S B block X

When P3 reads and the block is in the shared state, the slow memory supplies the data.

We can add an “Owned” state where one cache takes “ownership” of a shared block and supplies it quickly to other readers when they request it. The result is MOESI. 33 Owned State

• In original MSI, lowering from M to S or I causes a flush of the block – This also causes an updating of main memory which is slow • It is best to postpone updating main memory until absolutely necessary – The M=>S transition is replaced by M=>O – Main memory is left in the stale state until the Owner needs to be invalidated in which case it is flushed to main memory – In the interim, any other cache read request is serviced by the owner quickly • Summary: Owner is responsible for… – Supplying a copy of the block when another cache requests it – Transferring ownership back to main memory when it is invalidated 34 MOESI PrRd / -- PrWr / -- No need to do BusUpgr PrWr/BusUpgr M PrWr / -- BusRd / Flush PrWr/ BusRdX PrWr / BusRd / PrRd / -- BusUpgr BusRd / Flush ..or.. Flush

O S E

BusRdX / BusUpgr / -- PrRd•S / BusRdX / BusRd / Flush BusRdX /-- BusRd Flush Flush PrRd / -- I

BusUpgr / -- PrRd • ~S / BusRdX/Flush BusRd / -- BusRd BusUpgr / -- BusRdX / -- 35 Characteristics of Cached Data

` Shared, Modified O Ownership

Shared, Unmodified Exclusive, Modified S M Validity Exclusiveness Exclusive, Unmodified E

Invalid I

A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus, P. Sweazy and A. J. Smith © 1986. 36 MOESI State Pairs

“Intervenient” M O

“Only Cached Copy” “Shareable Data”

E S “Data Matches Owner”

I

A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus, P. Sweazy and A. J. Smith © 1986. 37

DIRECTORY-BASED COHERENCE 38 Symmetric Shared-memory Processor (SMP)

SMP = most multicore systems Multi-core • Share a single memory CPU Core Core Core Core • Using a single address space (w/regi (w/regi (w/regi (w/regi sters) sters) sters) sters) •

L1 Cache L1 Cache L1 Cache L1 Cache (UMA) latency from all cores

L2 Cache L2 Cache L2 Cache L2 Cache • Only up to approx. 32 cores

on-chip interconnect What about AWS x1.32xlarge with 64 cores? Multi-socket! L3 Cache • It uses 4 x (16-core CPU) • Memory/address space is Main Memory Addresses 0x0000 to 0xFFFF still shared, but latency is not uniform… (NUMA, see next) Shared-memory: each core can access/address the entire memory; Symmetric: uniform access time 39 Distributed Shared-memory System (DSM)

CPU interconnect (ring, mesh, e.g., Intel Xeon UPI or AMD Epyc Infinity Fabric …)

Multi-core Multi-core CPU CPU Core Core Core Core Core Core Core Core (w/regi (w/regi (w/regi (w/regi (w/regi (w/regi (w/regi (w/regi sters) sters) sters) sters) sters) sters) sters) sters)

L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache

L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache

on-chip interconnect on-chip interconnect

L3 Cache L3 Cache

Main Memory Main Memory Addresses 0x00000 to 0x0FFFF Addresses 0x10000 to 0x1FFFF

Shared-memory: each core can access/address the entire memory; Distributed: more bandwidth; faster local memory 40 Datacenter Cluster

TCP/IP over Ethernet

I/O subsystem and network card I/O subsystem and network card Multi-core Multi-core CPU CPU Core Core Core Core Core Core Core Core (w/regi (w/regi (w/regi (w/regi (w/regi (w/regi (w/regi (w/regi sters) sters) sters) sters) sters) sters) sters) sters)

L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache

L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache

on-chip interconnect on-chip interconnect

L3 Cache L3 Cache

Main Memory Main Memory Addresses 0x0000 to 0xFFFF Addresses 0x0000 to 0xFFFF

Memory is not shared! Nodes have different address spaces; use message-passing protocols or RPCs to exchange data 41 Directory-Based Coherence: Why? Memory is distributed to increase bandwidth Distributed shared-memory (DSM) used in multi-CPU systems • Each CPU has its own memory and DDR4 channels (34 GB/s) • We can write more data in parallel (more bandwidth) Snooping broadcasts are not scalable Snooping protocols require broadcasts to L1/L2 caches of all cores of all CPUs at every miss (read/write). Every core has to handle every miss event in the system (ignoring most)… not scalable! Solution: Directory-Based Coherence Protocols • Each CPU has a directory with state of blocks of its memory • It knows which local/remote cores have copies of the blocks • It forwards invalidate/data-fetch requests to those cores only Easy to implement at L3 cache; used also for SMPs (e.g., Intel i7) 42 Example of Directory at L3 cache

Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 (w/regi (w/regi (w/regi (w/regi (w/regi (w/regi sters) sters) sters) sters) sters) sters)

0 1 2 3 L1 0 1 2 3 0 1 4 5 L1 0 1 2 3 ✘0 1 4 5 L1 ✘ 0 1 2 3 L2 0 1 2 3 0 1 4 5 L2 0 1 2 3 0 1 4 5 L2 miss 0,1,2,3 miss 0,1,4,5 invalidate 0 invalidate 0 0 1 2 3 0 1 2 3 0 1 2 3 L3 Cache L3 Cache L3 Cache 4 5 4 5 a) Core1 reads blocks 0,1,2,3; b) Core2 reads blocks 0,1,4,5; c) To write to block 0 in L1, Core1 asks read miss to directory; data received read miss to directory; data received the directory to send an invalidate message to nodes with the block; Core 1 Core 2 Core 1 Core 2 Core2 receives it and invalidates; (w/regi (w/regi (w/regi (w/regi then, Core1 can modify block 0. sters) sters) sters) sters) Key Points 0 1 2 3 1 4 5 L1 0 1 2 3 0 1 4 5 L1 • The directory forwards 0 1 2 3 1 4 5 L2 0 1 2 3 0 1 4 5 L2 invalidate/data requests 0 miss 0 0 • … only to cores with the specific 0 1 2 3 0 1 2 3 L3 Cache L3 Cache block (better , more 4 5 4 5 latency) d.1) Core2 sends read miss to directory; d.2) Directory sends modified version which asks Core1 to writeback to L3 of the block to Core2 43 Directory-Based Protocol: At each cache

Uncache local read miss: send miss to directory d Shared local read hit No core remote invalidate command Many has block clean local read miss: send miss to directory, copies receive

Each L1/L2 controller uses these state transitions to decide the state

: : send miss dir to of blocks in its local L1/L2 caches.

invalidate command: invalidate -

write backwrite Very similar to snooping protocol, but invalidation/fetch interacts with the directory (at the CPU where it is

local local miss write stored in memory). remote fetch remote Fetch/invalidate commands received Modified only for blocks managed by this core Cached at owner HOW TO READ transition initiated by local event: actions taken core local write miss: local write/read hit write back, send miss to directory transition initiated by block directory commands 44 Directory-Based Protocol: At each directory

Uncache remote read miss at i: S = {i}, send data to i d Shared No core Copies at has block core set remote read miss at i:

S S = S + {i}, send data to i

: :

i

i = = {}

S Each L1/L2 controller uses these state transitions to decide the state of blocks in its local L1/L2 caches.

}, }, send data to Very similar to snooping protocol, i

= { but invalidation/fetch interacts with

S S remote miss write at remote

remote writeback: remote the directory (at the CPU where it is stored in memory). Fetch/invalidate commands received Modified only for blocks managed by this core Cached at owner HOW TO READ core remote write miss at i: send fetch-invalidate to owner, S = {i}, send data to i transition initiated by remote events from core i 45

BACKUP 46 Write Invalidate Snoopy Protocol

W(i) W(i) R(j) RW RO R(i) Dual directory of R(i) tags is maintained to facilitate snooping W(j) W(j)

W(i) R(i) INV

W(i) W(i) R(j) RW RO R(i) R(i)

W(j) W(j)

W(i) R(i) INV 47 Write Through Caches

• The bus interface unit of each processor “watches” the bus address lines and invalidates the cache when the cache contains a copy of the block with the modified word 48 Cache Hierarchy

• A hierarchy of cache can help mitigate P the cache miss penalty

• L1 Cache L1 Cache – 64 KB – 2 cycle access time – Common Miss Rate ~ 5% L2 Cache • L2 Cache – 1 MB – 20 cycle access time L3 Cache – Common Miss Rate ~ 1% • Main Memory – 300 cycle access time Memory 49 Credits

• Some of the material in this presentation is taken from: – : A Quantitative Approach • John Hennessy & David Patterson • Some of the material in this presentation is derived from course notes and slides from – Prof. Michel Dubois (USC) – Prof. Murali Annavaram (USC) – Prof. David Patterson (UC Berkeley)