Why Parallel Architecture

Performance results [§5.4] What cache line size is performs best? Which protocol is best to use?

Questions like these can be answered by simulation. However, getting the answer right is part art and part science.

Parameters need to be chosen for the simulator. The authors selected a single-level 4-way set-associative 1 MB cache with 64- byte lines.

The simulation assumes an idealized memory model, which assumes that references take constant time. Why is this not realistic?

The simulated workload consists of 6 parallel programs from the SPLASH-2 suite and one multiprogrammed workload, consisting of mainly serial programs.

Effect of coherence protocol [§5.4.3] Three coherence protocols were compared:

• The Illinois MESI protocol (“Ill”, left bar). • The three-state invalidation protocol (3St) with bus upgrade for SM transitions. (This means that instead of rereading data from main memory when a block moves to the M state, we just issue a bus transaction invalidating the other copies.) • The three-state invalidation protocol without bus upgrade (3St-BusRdX). (This means that when a block moves to the M state, we reread it from main memory.)

Lecture 15 Architecture of Parallel Computers 1

In our parallel programs, which protocol seems tobe best? © 2003 © Edward F. Gehringer 506 CSC/ECE Lecture Notes,Spring 2003 are rare very (less than 1per 1K references). no bus istraffic generated on E The reason for this? The advantage of thefour-state protocol is that multiprocessor workload. Somewhat surprisingly, theresult turns out tobethe same the for

Traffic (MB/s)

1 1 1 1 1 2 2 4 6 8 0 2 4 6 8 0 0 0 0 0 0 0 0 0 0 0 0

Barnes/III Barnes/3St D A

Barnes/3St-RdEx a d t d a r

e b s

LU/III u s s b

LU/3St u

s

LU/3St-RdEx

Ocean/III  Ocean/3S M transitions. But E

Ocean/3St-RdEx x

Radiosity/III

Radiosity/3St Radiosity/3St-RdEx d

Radix/III l

Radix/3St t  Radix/3St-RdEx x M transitionsM

Raytrace/III Ill

Raytrace/3St t Raytrace/3St-RdEx Ex 2 Effect of cache line size [§5.4.4] Recall from Lecture 11 that cache misses can be classified into four categories:

• Cold misses (called “compulsory misses” in the previous discussion) occur the first time that a block is referenced. • Conflict misses are misses that would not occur if the cache were fully associative with LRU replacement. • Capacity misses occur when the cache size is not sufficient to hold data between references. • Coherence misses are misses caused by the coherence protocol.

Coherence misses can be divided into those caused by true sharing and those caused by false sharing. False-sharing misses are those caused by having a line size larger than one word. Can you explain?

True-sharing misses, on the other hand, occur when a processor writes some words into a cache block, invalidating the block in another processor’s cache, after which the other processor reads one of the modified words.

How could we attack each of the four kinds of misses?

• To reduce capacity misses, we could • To reduce conflict misses, we could • To reduce cold misses, we could • To reduce coherence misses, we could

Lecture 15 Architecture of Parallel Computers 3 If we increase the line size, the number of coherence misses might go up or down. Why? As we increase the line size, false-sharing misses increase. However, true-sharing misses decrease, because more data is brought in on each miss. Increasing the line size has other disadvantages.

• It increases conflict misses. Why?

• It increases bus traffic. Why?

So it is not clear which line size will work best.

0.6

Upgrade False sharing 0.5 True sharing Capacity Cold 0.4 )

% (

e t 0.3

a r

s s i M

0.2

0.1 8 0 2 8 6 / 4 3 8 2 8 6 2 8 6 4 8 6 6 6 4 5 / / / 8 y 6 3 6 1 3 2 5 2 5 t 1 1 6 2 y s / / / u / / i 2 / / / / t 1 2 1 2 i e s s u s L u u / / / / 1 s y y y s / t t t n e e s s L o L L u u i i i e i r y o n n e e s s s L L i t n d i a r r r n n o o o d s a i i i a a r r B a a o d d d a a R B B i B R a a a B B d R R R a R

Results for the first three applications seem to show that which line size is best?

12

Upgrade False sharing 10 True sharing Capacity Cold 8 )

% (

e t

a 6 r

s s i M

4

2 8 8 6 6 2 4 0 8 8 6 8 8 6 8 6 8 6 4 6 2 4 8 2 4 6 2 / / / 2 5 1 2 5 2 5 1 6 3 6 1 3 6 3 x n / / / / / e / / / i / 1 1 2 2 1 2 x c x x / a n n / / / e e / / e d i i i n x x a c c c e e e a a n n a i i d d d a r a a c c c a t e e a a a d d a a e r r r R y a a c c t t t e e c a O a r r R R R a y y y c c t t O O R R O a a y y a R O O a a R R R R R

However, this is not the whole story. Larger line sizes also create more bus traffic.

[§5.4.5] Invalidate vs.update With this in mind, which line size would you say is best? © 2003 © Edward F. Gehringer 506 CSC/ECE Lecture Notes,Spring 2003 Update and invalidation schemes canbe combined (see §5.4.5). For example, in aproducer-consumer pattern, When there is ahigh degree of sharing, When there are not “externalmany rereads,” Why might this notbe true? Why might this betrue? superior to write-invalidate schemes. firstAt glance, it might seem that update schemes would always be

Traffic (bytes/instructions)

0 0 0 0 0 0 0 0 0 ...... 1 0 0 0 0 1 1 1 0 Which is better, an update anor invalidation protocol? 1 2 4 6 8 2 4 6 8

Barnes/8

Barnes/16 D A d a d t a r

Barnes/32 e

b s s u s

Barnes/64 b u s

Barnes/128

Barnes/256

Radiosity/8

Radiosity/16

Radiosity/32 2

Radiosity/64 4

Radiosity/128 28

Radiosity/256

Raytrace/8

Raytrace/16

Raytrace/32

Raytrace/64

Raytrace/128

Raytrace/256 6 Let’s look at real programs.

0.60 2.50 False sharing True sharing 0.50 Capacity 2.00 Cold 0.40 ) ) 1.50 % % ( (

e e t t a a

r 0.30 r

s s s s i i

M M 1.00 0.20

0.50 0.10

0.00 0.00 v v v x v x d d d d i i n n n n p p p p i i i i / / / m / m u u u u / / / / / / x e n U i x n x e i n c a i U L d a c a d a e L d a r e a e a c t a r c R c t y R O R y O a O a R R Where there are many coherence misses,

If there were many capacity misses,

So let’s look at bus traffic …

Lecture 15 Architecture of Parallel Computers 7 Note that in two of the Upgrade/update rate (%) applications, updates in an update protocol are much more prevalent LU/inv than upgrades in an invalidation protocol. LU/upd Each of these operations produces bus traffic; Ocean/inv therefore, the update protocol causes more Ocean/mix traffic. Ocean/upd The main problem is that one processor tends to write a block multiple Raytrace/inv times before another processor reads it. Raytrace/upd This causes several bus transactions instead of Upgrade/update rate (%) one, as there would be in an invalidation protocol. Radix/inv

In addition, updates Radix/mix cause problems in non- bus-based Radix/upd multiprocessors.

Extending Cache Coherence [§6.6] There are several ways in which cache coherence can be extended. These ways interact with other aspects of the architecture.

Considering these issues yields a good understanding of where caches fit in the overall hardware/software architecture.

© 2003 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2003 8 Shared-cache designs [§6.6.1] Why not let one cache serve multiple processors? This could be a 1st- P1 ... Pn level cache, a 2nd-level cache, etc. Switch

What would be the advantages? $ (interleaved) • It eliminates the need for cache coherence. Main memory (interleaved)

• It may decrease the miss ratio. Why?

• It may reduce the need for a large cache. Why?

• It allows more effective use of long cache lines.

• It allows very fine-grained sharing, because communication latency between processors is very low. Examples of such designs include the Alliant FX-8 (ca. 1985), which shared a cache among 8 processors, and the Encore Multimax (ca. 1987), which shared a cache among two processors.

Today, shared caches are being considered for single-chip multiprocessors, with 4–8 processors sharing an on-chip first-level cache.

What are some disadvantages of shared caches?

• Latency.

Lecture 15 Architecture of Parallel Computers 9 • Bandwidth.

• Speed of cache.

• Destructive interference—the converse of overlapping working sets.

Using shared second-level caches moderates both the advantages and disadvantages of shared first-level caches.

Virtually indexed caches and cache coherence [§6.6.2] Why not just store information in the cache by virtual address rather than physical address?

What would be the advantage of this?

[A TLB is like a cache that P a g e N u m b e r F r a m e N u m b e r translates a virtual address of the form

(page #, displacement) 2 0 3 4 7 to a physical address of the form

(frame #, displacement). ]

With large caches, it becomes increasingly difficult to do address translation in parallel with cache access. Why?

Recall this diagram from Lecture 10. d What happens to the set number as k j the cache increases in size? Page Set Byte within number number block

© 2003 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2003 10 Rather than slow down every instruction by serializing address translation and cache access, it makes sense just to use a virtually addressed cache.

But virtually addressed caches have their own problems.

One problem occurs when memory is shared. Two processes may have different virtual names for the same block.

Process 1’s Process 2’s page table page table

Page

What’s wrong with this?

Why are duplicate copies of information bad? Think about what happens on a process switch. Assume

• The processor has been running Process 1. • It then switches to Process 2. • Later it switches back to Process 1.

This is called the synonym problem.

Solution:

• For each cache block, keep both a virtual tag and a physical tag. This allows a virtual address to be found, given a physical address, or vice versa.

Lecture 15 Architecture of Parallel Computers 11 • When a cache miss occurs, look up the physical address to find its virtual address (if any) within the cache. • If the block is already in the cache, rename it to a new virtual address rather than reading in the block from memory.

Let’s see how this works in practice.

1. Look up referenced Notes: virtual address in cache. Step 4: We are looking for another copy of the same 2. Is virtual Yes 3. Cache hit. address found Proceed with block being used by in cache? reference. another process, with a different No virtual address. 4. Use translated physical address to look up physical Step 6: Block at tags. virtual address needs to be written back if it has been modified. 5. Is physical Yes 7. Possibility address found of synonym in in cache? cache The newly fetched block’s physical No index may be the 8. Move cache block B from same as another 6. Cache miss; fetch data synonym virtual index to from lower level in memory replace block C at current block in cache; if so, hierarchy, writing back virtual index, writing back block(s) if necessary. that block needs to block C if necessary. be written back too.

Step 8: If another process q was using this block and tries to access it again, it will get a miss on virtual address, but will find it by physical address. What will it therefore do with this block?

© 2003 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2003 12 Note that only physical addresses are put on the bus. Therefore, they can be snooped, and blocks can be invalidated (or updated) just like in coherence protocols for physically addressed caches.

Figure 6.24 in CS&G illustrates the organization of a virtually addressed cache:

Page Page # offset

ASID Virtual address V-tag V-index Blk. P-tag offset memory g g a a t t - - e e g g

t Cache t V P a a - a a - t t t t o - - o

data t t S S - V P - r r t t memory P P

V-tag Blk. memory offset P-tag P-index Physical address

Page-frame # Page offset

Consider first a virtual address

• “ASID” means “address-space identifier,” which is often called a process identifier (PID) in other books. • The diagram shows the “V-index” overlapping the (virtual) page number. What is the significance of this?

• Is the V-index actually stored in the cache?

Lecture 15 Architecture of Parallel Computers 13 • What is the “state” field in the tag memories? It contains clean/dirty and valid/invalid bits, as well as replacement status, if it’s not a direct-mapped cache. Now consider the physical address.

• What is the role of the P-index?

• How long is the P-index, relative to the V-index?

Note that each P-tag entry points to a V-tag entry and vice versa; this makes it possible to find a physical address given a virtual address, or vice versa.

In multilevel caches, the L1 cache can be virtually tagged for rapid access, while the L2 cache is physically tagged to avoid synonyms across processors.

Example: Suppose P1’s P1 P2 page 7 (addresses 700016 2 to 7FFF16) is stored in page frame 1216. 7 Suppose also that P2 is using the same page, but calls it page 2.

P1 is the first to access block 0 on this page. Assuming cache lines are 16 bytes long, addresses 7000 to 700F are now cached. Let’s assume the cache is direct mapped, and has 1024 (= 40016) lines. Which line will this block be placed in?

Now suppose that P2 accesses this block. What will happen?

Where does this number get used or stored?

TLB coherence

[§6.6.3] A TLB is effectively just a cache of recently used page-table entries. Each processor has its own TLB.

How can two TLBs come to contain entries for the same page?

Name some times when TLB entries need to be changed.

• • • • • Which of the above lead to coherence problems?

Solution 1: Virtually addressed caches

When a cache is virtually addressed, when is address translation needed?

Given that accesses to page mappings are rare in virtually addressed caches, what table might we be able to dispense with?

If we don’t have a , we don’t have a TLB coherence problem.

Well, we still have the problem of keeping the swap-out and protection information coherent. Assuming no TLB, where will this information be found?

Lecture 15 Architecture of Parallel Computers 15 This problem will be handled by the cache-coherence hardware when the OS updates the page tables.

Solution 2: TLB shootdown With the TLB-shootdown approach, a processor that changes a TLB entry tells other processors that may be caching the same TLB entries to invalidate their copies.

Step 1. A processor p that wants to modify a page table disables interprocessor interrupts and locks the page table. It also clears its active flag, which indicates whether it is actively using any page table.

Step 2. Processor p sends an interrupt to other processors that might be using the page table, describing the TLB actions to be performed.

Step 3. Each processor that receives the interrupt clears its active flag.

Step 4. Processor p busy-waits till the active flags of all interrupted processors are clear, then modifies the page table. Processor p then releases the page-table lock, sets its active flag, and resumes execution.

Step 5. Each interrupted processor busy-waits until none of the page tables it is using are locked. After executing the required TLB actions and setting its active flag, it resumes execution.

This solution has an overhead that scales linearly with the number of processors. Why?

Solution 3: Invalidate instructions

Some processors, notably the PowerPC, have a “TLB_invalidate_entry” instruction.

This instruction broadcasts the page address on the bus so that the snooping hardware on other processors can automatically invalidate the corresponding TLB entries without interrupting the processor.

A processor simply issues a TLB_invalidate_entry instruction immediately after changing a page-table entry.

© 2003 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2003 16 This works well on a shared-bus system; if two processors change the same entry at the same time, only one change can be broadcast first on the bus.