
<p>Cache organization in multicore architectures Caching in the Piranha [OHL §2.2.1] Each core has its own L1 caches, but L2 cache is typically shared.</p><p>The L2 cache typically needs to be interleaved—what does that mean? </p><p>In the DEC Piranha, there is no inclusion between the L1 and L2 caches.</p><p>Answer these questions about the Piranha. In the Piranha, the aggregate L1 capacity is 1 MB, while the L2 capacity is also 1 MB. Why do you think it doesn’t use the inclusion property? </p><p>The L2 is no-write-allocate. What does that mean?</p><p>The L2 behaves as a large victim cache. What does that mean?</p><p>So, even when a clean line is replaced from the L1 cache, it may cause a write-back. Explain. </p><p>L1 caches can own a line.</p><p>The owner of a block is either</p><p> the L2 cache </p><p> an L1 in the exclusive state, or</p><p> one of the L1s, when it is in the shared state</p><p>On every L2 access, the L1 and L2 tag states are checked in parallel.</p><p>Lecture 26 Architecture of Parallel Computers 1 A duplicate copy of the L1 tags and state are kept at the L2 controllers. Any idea on why this is done? </p><p>Niagara cache-coherence protocol [OHL §2.2.2.2] L1 caches are write-through.</p><p> fetch-allocate no write-allocate.</p><p>L1 lines are either valid or invalid.</p><p>The L2 cache has a directory that shadows the L1 tags. Why does it need this? </p><p>If a load misses in the L1 cache, then sent to the L2 cache are …</p><p> the load request that missed in the L1, the line from the L1 that is going to be replaced by this load.</p><p>Answer these questions. When these items get to the L2, what happens? </p><p>When an L1 cache is written to, the L2 directory is accessed, and invalidates the line in the L1 caches that hold it.</p><p>The L2 cache is updated first. Only then is the L1 cache written to. How does this guarantee coherence? </p><p>What could happen if the L1 cache were updated first? </p><p>© 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 2 A two-state protocol? Does this give up something compared to the MSI protocol? </p><p>Multithreading on the Niagara [OHL §2.2.2.1] The Niagara does not employ branch prediction. How can it survive without branch prediction? </p><p>Let’s look at how fine-grained threading works</p><p>First, consider Figure 2.9, on p. 37.</p><p>Each line of the diagram represents a different instruction. Exception: The boxes surrounded by dotted lines show one instruction that is being issued and the following instruction that is being fetched.</p><p>While one instruction from a thread is being issued, the next instruction is being fetched. (There is a mistake in the leftmost box; the lower instruction should be Fto-add.)</p><p>After the pipeline “works on” one thread by issuing/fetching a pair of instructions, it works on the next thread assuming that another thread is available.</p><p>Why would a thread not be available? </p><p>If all threads are available, they are executed round-robin.</p><p>The scheduler “assumes” that all loads are cache hits. Two cases.</p><p> Suppose the load misses; what happens then? </p><p>Lecture 26 Architecture of Parallel Computers 3 Suppose the load is a cache hit; what happens? </p><p>Why do we say that the instruction after the load “is issued speculatively”? </p><p>Now, consider Figure 2.10, on p. 38. Here, only two threads are available. Explain why two consecutive instructions from thread t1 are executed. </p><p>The limits of throughput improvement [OHL §2.3.3] Fig. 2.17 (p. 54) shows that throughput is not always improved by increasing the number of threads. Why not? </p><p>Improving Latency Automatically [OHL Ch. 3] What advantages do CMPs have vs. SMPs if we are interested in automatically parallelizing a program? </p><p>One way of parallelizing programs automatically is to use “helper” threads.</p><p>A helper thread is a lobotomized thread that only performs certain kinds of actions, e.g.,</p><p> making branch predictions early, and prefetching data into on-chip caches.</p><p>Why does this help? </p><p>Why is the benefit limited? </p><p>© 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 4 </p><p></p><p>Thread-level speculation [OHL §3.2] Another automatic technique is to divide the program up into several threads.</p><p>The only practical way to do this is to divide on the basis of …</p><p> loop iterations, or procedure calls.</p><p>The idea is that (in the case of loop iterations), often subsequent iterations will be “almost independent.” Here is an example (found at www.crhc.uiuc.edu/ece412/lectures/lecture26.pdf).</p><p> for (i = 0; i < I_MAX; i++){ for (j = 0; j < J_MAX; j++){ a[i][j] = b[i][j] + c[i][j]; b[j][i] = compute_b(input); } }</p><p>As long as i and j are not too close, the iterations will be independent.</p><p>So, we can assign successive iterations to different threads.</p><p>Why is hardware support needed? </p><p>This hardware must handle five special situations (see Fig. 3.1, p. 66):</p><p>1. Forward data. Data must be forwarded from one thread to another quickly.</p><p>Lecture 26 Architecture of Parallel Computers 5 2. Detect too-early reads. If a data value is read by a later thread and afterwards, written by an earlier thread, a violation has occurred. Hardware must notice this and, e.g., restart the later thread. </p><p>3. Discard speculative changes after a violation. When a change is made to a variable by thread T, and then thread T needs to be restarted, this change must be undone.</p><p>4. Retire speculative writes in correct order. After threads finish, their state must be merged into the process’s state in correct order. Writes from later threads must be merged in later. </p><p>5. Keep earlier threads from seeing later threads’ changes. A thread must see only changes made by earlier threads. This is complicated by the fact that a processor that was running an earlier thread will later be running a later thread. </p><p>One possibility is to use about 4 different threads to handle four consecutive iterations of a loop.</p><p>Size of threads is an important issue. Why?</p><p> Limited buffer size. </p><p> True dependences. </p><p> Restart overhead. </p><p> Parallelization overhead. </p><p>Typically, a few thousand instructions is the right length for threads.</p><p>© 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 6</p>
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-