Cache organization in multicore architectures Caching in the Piranha [OHL §2.2.1] Each core has its own L1 caches, but L2 cache is typically shared.

The L2 cache typically needs to be interleaved—what does that mean?

In the DEC Piranha, there is no inclusion between the L1 and L2 caches.

Answer these questions about the Piranha. In the Piranha, the aggregate L1 capacity is 1 MB, while the L2 capacity is also 1 MB. Why do you think it doesn’t use the inclusion property?

The L2 is no-write-allocate. What does that mean?

The L2 behaves as a large victim cache. What does that mean?

So, even when a clean line is replaced from the L1 cache, it may cause a write-back. Explain.

L1 caches can own a line.

The owner of a block is either

 the L2 cache

 an L1 in the exclusive state, or

 one of the L1s, when it is in the shared state

On every L2 access, the L1 and L2 tag states are checked in parallel.

Lecture 26 Architecture of Parallel Computers 1 A duplicate copy of the L1 tags and state are kept at the L2 controllers. Any idea on why this is done?

Niagara cache-coherence protocol [OHL §2.2.2.2] L1 caches are write-through.

 fetch-allocate  no write-allocate.

L1 lines are either valid or invalid.

The L2 cache has a directory that shadows the L1 tags. Why does it need this?

If a load misses in the L1 cache, then sent to the L2 cache are …

 the load request that missed in the L1,  the line from the L1 that is going to be replaced by this load.

Answer these questions. When these items get to the L2, what happens?

When an L1 cache is written to, the L2 directory is accessed, and invalidates the line in the L1 caches that hold it.

The L2 cache is updated first. Only then is the L1 cache written to. How does this guarantee coherence?

What could happen if the L1 cache were updated first?

© 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 2 A two-state protocol? Does this give up something compared to the MSI protocol?

Multithreading on the Niagara [OHL §2.2.2.1] The Niagara does not employ branch prediction. How can it survive without branch prediction?

Let’s look at how fine-grained threading works

First, consider Figure 2.9, on p. 37.

Each line of the diagram represents a different instruction. Exception: The boxes surrounded by dotted lines show one instruction that is being issued and the following instruction that is being fetched.

While one instruction from a thread is being issued, the next instruction is being fetched. (There is a mistake in the leftmost box; the lower instruction should be Fto-add.)

After the pipeline “works on” one thread by issuing/fetching a pair of instructions, it works on the next thread assuming that another thread is available.

Why would a thread not be available?

If all threads are available, they are executed round-robin.

The scheduler “assumes” that all loads are cache hits. Two cases.

 Suppose the load misses; what happens then?

Lecture 26 Architecture of Parallel Computers 3  Suppose the load is a cache hit; what happens?

Why do we say that the instruction after the load “is issued speculatively”?

Now, consider Figure 2.10, on p. 38. Here, only two threads are available. Explain why two consecutive instructions from thread t1 are executed.

The limits of throughput improvement [OHL §2.3.3] Fig. 2.17 (p. 54) shows that throughput is not always improved by increasing the number of threads. Why not?

Improving Latency Automatically [OHL Ch. 3] What advantages do CMPs have vs. SMPs if we are interested in automatically parallelizing a program?

One way of parallelizing programs automatically is to use “helper” threads.

A helper thread is a lobotomized thread that only performs certain kinds of actions, e.g.,

 making branch predictions early, and  prefetching data into on-chip caches.

Why does this help?

Why is the benefit limited?

© 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 4 

Thread-level speculation [OHL §3.2] Another automatic technique is to divide the program up into several threads.

The only practical way to do this is to divide on the basis of …

 loop iterations, or  procedure calls.

The idea is that (in the case of loop iterations), often subsequent iterations will be “almost independent.” Here is an example (found at www.crhc.uiuc.edu/ece412/lectures/lecture26.pdf).

for (i = 0; i < I_MAX; i++){ for (j = 0; j < J_MAX; j++){ a[i][j] = b[i][j] + c[i][j]; b[j][i] = compute_b(input); } }

As long as i and j are not too close, the iterations will be independent.

So, we can assign successive iterations to different threads.

Why is hardware support needed?

This hardware must handle five special situations (see Fig. 3.1, p. 66):

1. Forward data. Data must be forwarded from one thread to another quickly.

Lecture 26 Architecture of Parallel Computers 5 2. Detect too-early reads. If a data value is read by a later thread and afterwards, written by an earlier thread, a violation has occurred. Hardware must notice this and, e.g., restart the later thread.

3. Discard speculative changes after a violation. When a change is made to a variable by thread T, and then thread T needs to be restarted, this change must be undone.

4. Retire speculative writes in correct order. After threads finish, their state must be merged into the process’s state in correct order. Writes from later threads must be merged in later.

5. Keep earlier threads from seeing later threads’ changes. A thread must see only changes made by earlier threads. This is complicated by the fact that a processor that was running an earlier thread will later be running a later thread.

One possibility is to use about 4 different threads to handle four consecutive iterations of a loop.

Size of threads is an important issue. Why?

 Limited buffer size.

 True dependences.

 Restart overhead.

 Parallelization overhead.

Typically, a few thousand instructions is the right length for threads.

© 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 6