ECE 587 Final Exam Solution

Name: ______Time allowed: 90 minutes Total Points: 80 Points Scored: _____

Problem No. 1 (20 points)

Consider a computer system with a two-level hierarchy, consisting of split L1 instruction and data caches and a unified L2 cache. The cache parameters and statistics for a program are as follows:

(i) 40% instructions in the program are load/store instructions, (ii) L1 data cache hit ratio is 95%, (iii) L1 instruction cache hit ratio is 100%, (iv) L2 cache hit ratio is 70%, (v) DRAM access latency is 200 cycles, (vi) The program’s CPI assuming a perfect L2 cache with no misses, is 0.8, (vii) Each cache in the system is a blocking cache that only processes one miss at a time, and blocks all other accesses until the miss returns from the next cache level or the memory.

Answer the following questions:

(a) (8 points) What is the L2 cache miss rate in terms of misses per 1000 instructions?

L2 cache accesses per instruction = L1 instruction cache miss rate + L2 data cache miss rate Based on the given cache statistics: L1 instruction cache miss rate = 0, because L1 instruction cache hit ratio is 100% L1 data cache miss rate = L1 data accesses/instruction * L1 data cache miss ratio = 40% * (1- 95%) = 0.02 or 20 misses per 1000 instructions Hence: L2 cache accesses/instruction = 0.02 L2 cache miss rate = L2 cache accesses/instruction * L2 cache miss ratio = 0.02 * (1-70%) = 0.006 = 6 misses per 1000 instruction

(b) (4 points) What is the actual CPI of the program?

Actual CPI = CPI with perfect L2 + (L2 cache miss rate * L2 cache miss penalty) L2 cache miss rate = 0.006 (calculated in part (a)) L2 cache miss penalty = DRAM access latency = 200 cycles. Therefore: Actual CPI = 0.8 + (0.006 * 200) = 2.0

(c) (8 points) The system designers want to reduce the CPI of the program by 0.3. For that purpose, they are considering the addition of an L3 cache to the cache hierarchy. The hit latency of the L3 cache is 40 cycles. Simulation experiments suggest that the L3 cache will have a miss ratio of 50%. Will the addition of L3 cache be able to achieve the desired reduction in CPI? Justify your answer with calculations.

Desired CPI = Original CPI – desired reduction = 2.0 – 0.3 = 1.7

The addition of L3 cache changes the average L2 cache miss penalty. Previously, each L2 miss had a constant penalty of 200 cycles. After the addition of L3 cache, accesses which hit in the L3 cache will incur a penalty of 40 cycles (L3 hit latency), whereas accesses which miss in the L3 cache will incur a penalty of 240 cycles (L3 latency + DRAM latency).

Therefore: New average L2 miss penalty = (L3 hit ratio * 40) + (L3 miss ratio * 240) = (50% * 40) + (50% * 240) = 140 cycles

Hence: New CPI = 0.8 + (0.006*140) = 1.64 Since the new CPI is lower than the desired CPI, we conclude that the addition of L3 cache is able to achieve the desired reduction in CPI.

Problem No. 2 (15 points)

The designers of a new have adopted the Pentium M methodology to decide whether a particular optimization should be included in the processor or not. They are considering four different candidate optimizations, the first two targeted towards performance and the next two targeted towards battery life. To test these optimizations, they simulate the optimizations on an important industry benchmark. The following table shows the results of the simulation experiment:

Configuration Optimization Execution time Average Decision Target (seconds) Power (watts) (chosen/not chosen) Baseline 10 10 Optimization-1 Performance 9.8 10.5 Chosen Optimization-2 Performance 9 13 Chosen Optimization-3 Battery Life 9.5 11 Not chosen Optimization-4 Battery Life 11 8 Chosen

For each optimization, the table shows the optimization target (performance or battery life), the execution time for the benchmark when using that particular optimization and the average power consumption of the processor when using the optimization. The table also provides the execution time and average power for the baseline processor. For each optimization, indicate whether the optimization should be included in the processor or not by writing down “chosen” or “not chosen” in the last column of the table.

The correct answers are shown in the last column of the table. In each case, we first calculate the speedup by dividing the execution time of the baseline by the execution time with the optimization. Then we calculate the increase in power. Finally we apply the power/performance criteria based on the optimization target. For example, for Optimization-2, speedup is 10/9 or 11% and power increase is 13/10 or 30%. This optimization meets the “Performance” criteria (at least 1% performance increase for every 3% power increase) and is therefore chosen.

Problem No. 3 (30 points)

(a) (4 points) In a simultaneous-multithreaded (SMT) processor, do we need to have separate branch predictors for each thread? Justify your yes/no answer.

It is not necessary to have a separate for each thread. Multiple threads can share the common branch predictor hardware by hashing the predictor table entries with their individual branch PCs. However, for performance reasons, it may be preferable to have separate predictors per thread. With a common predictor, the aliasing problem could become more severe due to branches from different threads accessing the same predictor table entries, causing frequent mispredictions.

(b) (4 points) Prefetching into the L2 cache can be done by using runahead execution or by using stream buffers. Which of these two techniques will have more accurate prefetching? Why?

Runahead execution will result in more accurate prefetching as compared to stream buffers. In runahead execution, the processor speculates past a missing load instruction and starts fetching and pre-executing instructions which are independent of the missing load. This allows data and instructions to be brought into the processor caches, based on the actual path that the program is most likely to take. In contrast, stream buffer simply predicts streaming access patterns without any real knowledge about the addresses accessed by the program in future.

(c) (4 points) In runahead execution, what would be the negative consequences of not implementing a runahead cache?

During runahead mode, the processor executes and retires instructions speculatively, since the branches dependent on the missing load cannot be resolved and the missing load and any of its dependents may have unresolved exceptions. Therefore, it is important that the store instructions during runahead mode should not commit their results to the . In the absence of runahead cache, the store instructions would not be allowed to pseudo-retire and stay stuck in the store buffer, thereby causing the store buffers to fill up, causing execution stalls. With a runahead cache, the store instructions can write their results to the runahead cache and then pseudo-retire. The runahead cache would then take care of store-load forwarding.

(d) (3 points) In the runahead mode, how does the processor ensure that the dependents of the missing load do not consume bogus source operands?

There are INV (invalid) bits associated with each register. The missing load sets the INV bits for its destination register. Each instruction during runahead mode checks the INV bit for all its source registers. If any of the INV bits is set, the instruction pseudo-retires without consuming the data in the source register. (e) (4 points) State one disadvantage of using a trace cache as compared to a traditional instruction cache.

Trace cache stores redundant data that also exists in the instruction cache, which causes larger area overhead.

(f) (3 points) What is the main function of a store buffer?

Checking of load addresses against store addresses, allowing for store to load forwarding.

(g) (4 points) Why do out-of-order processors commit the result of a store instruction to the memory hierarchy ONLY after the store instruction reaches the head of the re-order buffer?

To enable precise interrupts. If an earlier instruction (in the program order) incurs an exception after the store has written its result to the memory hierarchy, then the state would become imprecise.

(h) (4 points) State one advantage of the PAg predictor as compared to a PAp predictor.

PAg predictor is more area-efficient than the Pap predictor.

Problem No. 4 (15 points)

(a) (5 points) Is the following statement true or false: “If the accuracy of prefetcher-A is higher than the accuracy of prefetcher-B, then the coverage of prefetcher-A must also be higher than the coverage of prefetcher-B”? Justify your answer.

This statement is FALSE. Accuracy depends on both the prefetch hits (number of useful prefetches) and the number of prefetches initiated by the prefetcher. In comparison, coverage depends on the prefetch hits, but is independent of the number of prefetches. Prefetcher-B may be “less accurate” than preftecher-A, because it may have initiated additional prefetches as compared to prefetcher A, many of which may not have resulted in prefetch hits. However, as long as some of those additional prefteches result in prefetch hits, then the coverage of prefetcher-B would become higher than prefetcher-A.

Example: Number of misses before prefetching = 10 Prefetcher A initiates 2 prefetches which both result in hits: Accuracy = 2/2 (100%), Coverage = 2/10 = 20% Prefetcher B initiates 5 prefetches, 4 of which result in hits: Accuracy = 4/5 (80%), Coverage = 4/10 = 40% Conclusion: Prefetcher-A has higher accuracy than prefetcher-B but lower coverage than prefetcher-B.

(b) (4 points) Does the DRAM row buffer utilization increase/decrease/remain unchanged as we switch from single core to multi-core processors? Justify your answer.

DRAM row buffer utilization usually decreases as we switch to multi-core processors. DRAM traffic from multiple cores has lower row locality, because memory accesses from a second core often cause a DRAM row opened by a first core to be closed before the row has been fully utilized by the second core.

(c) (6 points) Consider a system with a two-level cache hierarchy. The L2 cache is 128KB, 8-way set-associative with a block size of 128 bytes. To cut down the number of main memory (DRAM) accesses, the system designers are exploring the following two optimizations to the cache hierarchy:

Optimization-1: Add one way to each set in the L2 cache, resulting in a 144KB 9-way set- associative L2 cache. Optimitization-2: Add a 128-entry fully-associative victim cache.

Which one of the above two optimizations would be more effective in reducing the number of DRAM accesses? Why?

Optimization-2 would be more effective in reducing the number of DRAM accesses. Since each cache block is 128 bytes, both the optimizations add the same amount of cache capacity (128 * 128 = 16KB). The victim cache has the advantage of being fully-associative, which makes it more effective. Any cache misses that are prevented by adding one way to each set would also be prevented by the victim cache. On top of that, the victim cache would be more effective in dealing with “hot” sets, where adding just one way may not be able to prevent thrashing.