Foundations What is the meaning of when you have multiple Shared-Memory Systems access ports into global memory? and Coherence What if you have caches? Memory

w w r Pa a3 Pb b3 Pc c4 ra2 wb2 rc3 ra1 rb1 wc2 wc1

Sequential consistency: Final state (of memory) is as if all RDs and WRTs were executed in some fixed serial order (per order also maintained) Æ Lamport

[This notion borrows from similar notions of sequential consistency in transaction processing systems.]

6.173 Fall 2010

Agarwal - 1 - - 2 -

Page 1 One other cache nasty to watch out for Foundations MEM foo1 cache cache foo2 foo3 foo1 A hardware designers physical foo2 foo4 foo3 perspective of sequential foo4 consistency foo home

P Memory P

Flush foo* from cache, wait till done

w1c Does it always work?

w r a3 a2 r r c4 a w 1 b3 r c3 w c2 b2 w Pa r Pc c1 b1 Pb w

Key: Using fence to wait until flush is done is the key mechanism that guarantees sequential consistency

We will revisit this in more detail shortly

- 3 - - 4 -

Page 2 One other cache nasty to watch out for xxx MEM yyy Summary of New Multicore Instructions foo1 xxx cache cache foo2 foo3 foo1 xxx yyy foo4 foo2 foo1 foo3 Cache xxx •Send message foo4 line foo home • Receive message

P P • Synchronization Flush yyy from cache, – Barrier wait till done –Test and set Flush foo* from cache, wait till done – F&A and relatives (e.g., F&Op, CmpXch) • Flush cache line Correct final value: foo1 yyy • Memory fence

Wrong final value: foo1 xxx

Problem called “False Sharing” Leads to bugs with sw coherence Leads to poor perf. with hw coherence Solutions? Pad shared data structures so multiple shared items do not fall into same cache line

- 5 - - 6 -

Page 3 Outlline Recall, Shared Memory Algorithmic Model Memory architecture

Cache coherence in small multicores Shared Memory Cache coherence in manycores

wrt read

P P . . . P P

- 7 - - 8 -

Page 4 Shared-Memory Structure in Shared Memory Structures Cutting Edge Multicores in Parallel Computers

Monolithic Memory Distributed M M M M Network C C C C Network C P . . . P C P P C C P . . . P P P r lle ro nt Multicore Chip co Distributed - local y Distributed or C em M M M M M Network Like legos, P Memory C C M can move Ps, Network M . . . P C M C Cs and Ms C Memory C C P P C C P around P C P . . . P P P Ring P But, what about multicores chips? Chip

- 9 - - 10 -

Page 5 Shared-Memory Structure in Caches and Cutting Edge Multicores Cache Coherence Tile processor 64 cores

Network

M M C M C C P P P

r le ol tr Distributed Multicore Chip on c ry M M o M M C C C C em P P P P M Memory Memory C C C C P P P P Network C C C C C C Memory P P P P Memory C C P C C C C . . . P P P P P P P Mesh Chip - 11 - - 12 -

Page 6 A World Without Caches With Caches

Network Network

M M M M M C M C C C C C rd P P rd P P P P

- 13 - - 14 -

Page 7 How are Caches Different from Fast Local Memory (SRAM)? Key insight why use a cache when local mem exists

Anatomy of a common case LD operation Network LD A

M M HW: 1 cycle C M When done in HW, we C C SW: 10 cycles call the store a cache! P P P If A replicated in local store versus then fetch from local store Else send message to get A from Network DRAM

HW: 100 cycles

M M SW: 110 cycles m M m m P P P Can do all of this Discuss in hardware too. This is what typical caches do

- 15 - - 16 -

Page 8 Cache Coherence Problem Solving the Coherence Problem

– Small multicores Network >Software coherence > Snooping caches

M –Manycores M M C C >Software coherence C ? > full map directories P P wrt P > limited pointers >chained pointers ·singly linked ·doubly linked > limitless schemes Coherence problem > Hierarchical methods

We will study Coherence structures Coherence protocols Cache side state diagrams Directory side state diagrams

- 17 - - 18 -

Page 9 Software Coherence Saw this before Hardware Cache Coherence Snooping Caches

MEM Shared Memory foo1 cache cache foo2 a foo3 foo1 foo2 foo4 foo3 a foo4 foo home Bus or Ring 3 a snoop cache 2 Broadcast flush cache Match P fence P cache a a x 4 cache 5 tags tags GET_foo_LOCK Dual Invalidate 1 . ported write . MUNGE y . Processor x . Processor z Flush foo* from cache Fence: wait till changes that result from flush • Works for small multicores (mem off chip) are visible to everyone RELEASE_foo_LOCK • Broadcast address on shared write • Everyone listens (snoops) on bus/ring to see if any of their own addresses match Can stick the locking, • Invalidate copy on match flushes and fences in library code • How do you know when to broadcast, to provide clean abstractions invalidate – State associated with each cache line – Key benefit: no global state in main mem Let’s look at this in more detail next… - 19 - - 20 -

Page 10 Hardware Cache Coherence Update versus Invalidate Protocols Invalidate versus Update Snooping Caches Shared Memory Tradeoffs between a • Update protocols Bus or Ring a • Ownership protocols 3 a snoop cache 2 Broadcast cache Update better when poor write locality Match cache a a 4 cache 5 tags tags Update Dual Invalidate better otherwise ported 1 write Processor Processor Competitive snooping idea --

–Do write updates • Broadcast address on shared write –If more than a “few” updates, then use • Everyone listens (snoops) on bus/ring to see if ownership any of their own addresses match “Few” Æ Switch mode when cost of all • If address matches updates so far = cost of invalidation

– Invalidate local copy (called invalidate or The cost of this approach is no worse than ownership protocol) twice the optimal (try to prove this) OR – Update local copy with new data from bus “Competitive algorithms are cool” (writer must broadcast value along with address) Only a cache side state machine needed Discuss paper - 21 - - 22 -

Page 11 State diagram for Snooping Caches ownership protocols Definitions For each address a Shared Memory “Invalid” Cache side state machine a Store state with cache tags

invalid Bus or Ring a Ext. bus request 3 a My bus response snoop cache 2 Broadcast cache Match cache a a 4 cache 5 tags tags Update Dual “Shared” “Modified” My local response ported 1 My local request write Processor Processor read-clean write-dirty

shared-data • For each address • Assume ^cache blocksize is one word for now; Let’s deal with the cache block complexity later “MSI” Variants such as MESI, MOESI

- 23 - - 24 -

Page 12 State diagram for cache block in State diagram for update ownership protocols protocols My local request My local request a: address a: address Ext. bus request Ext. bus request : value My bus response My bus response My local response

invalid Local Write invalid Local Write Broadcast a; Fetch block Broadcast a, ; Fetch block Local Local Read Read Remote Write Fetch Fetch Local replace block Remote block Write/local Update replace memory Update memory read-clean Remote Write write-dirty read-clean write-dirty Update local copy Remote Read Update memory Local Write Local Write Broadcast a, Broadcast a Local Write Remote Write Broadcast a, Update local copy

In ownership protocol: writer owns exclusive copy

- 25 - - 26 -

Page 13 Maintaining coherence in manycores

• Software coherence – saw this before • Hardware coherence > full map directories > limited pointers >chained pointers ·singly linked ·doubly linked >limitless schemes > Hierarchical methods

- 27 -

Page 14