EECS 470 Final Report: PotatoLakeZ

James Read, Donato Mastropietro, Skyler Hau, Nathan Richards, Pratham Dhanjal [jamread, donatom, hausky, nricha, pratham]@umich.edu

1 Table of Contents

Table of Contents 2

Introduction 3

Design 3 Out of Order Execution 3 2-way Superscalar 4 Pipeline Design and Stages 5 Fetch 5 Dispatch 5 Issue 5 Execute 1 6 Execute 2 6 Complete 6 Retire 6 Modules 6 16-entry Reorder Buffer 6 8-entry Reservation Station 6 8-Entry Load-Store Queue 7 Memory Systems 7 Memory Arbiter 7 128B Instruction Cache 7 128B Data Cache 7 FIFO Memory Buffer 8 8 FIFO Free List 8 Map Table 8 Architected Structures 8

Testing 9 Module Tests 9 runSingle script 9 runAll script 9 `define DEBUG 10

Analysis 10 CPI - A Breakdown 10 Synthesis Clock Periods 11 Cache Analysis 11 LSQ Analysis 12 Branch Predictor Analysis 13 Component Size Analysis 13 Critical Path 14 Reconsiderations and Improvements 14

2 Introduction This project implements a synthesizable out-of-order processor using the SystemVerilog hardware description language. This processor uses the RV32IM instruction set without any system calls, fences, or division and is implemented using the R10K register renaming convention. The project’s most important special feature is that it is superscalar, capable of performing two instructions in parallel. Our team had roughly 2 months to complete this project.

Advanced Features: ❖ 2-way superscalar ❖ Write Back Data Cache ❖ Cache Associativity > 1 ❖ Loads Issue Out-of-Order with respect to stores ❖ Dynamic Branch Prediction (Tournament Predictor) ❖ Automated Regression Testing and Visual Debugging Mechanisms

Design

Out of Order Execution

To show that the processor successfully implements an out of order design, a simple test program was constructed with a multiply, followed by a dependent (on the multiply) add, followed by an independent (from the multiply) add. The output from the reservation station shows that the independent add issues before the dependent add, which clearly displays the out of order implementation of the processor.

Figure 1: an ADDI instruction leaving the RS before the ADD instruction that follows it, output by out debug output

3 2-way Superscalar

An embarrassingly parallel program has been written to prove that we have successfully implemented a 2-way . This program consisted of 12 unrelated add instructions that were then looped over some number of times. The graph shows that as a program continues to run the same parallel instructions the average CPI starts to amortize to ~0.6, which is well below the minimum CPI that a single wide processor could achieve.

Figure 2: Indications of Superscalar behavior

4 Pipeline Design and Stages

Figure 3: High-level PotatoLakeZ pipeline design with legend. The red arrows represent the critical path

Fetch

The processor uses a 2-wide fetch, and receives information about structural hazards from dispatch and retire (discussed in respective sections). Mis-aligned instructions (with regards to an 8B cache line) requests are handled by re-fetching the most recent PC and invalidating all packets besides the mis-aligned instruction.

Dispatch

Our processor has a 2-wide dispatch with some variation from the standard MIPS R10K design. Within the dispatch stage, the tournament branch predictor is instantiated. Dispatch will ensure that the hazardous instructions will be re-sent to fetch. These hazards will include full signals from the ROB and RS, an empty signal from the Free List, and hazard signals from the load and store queues.

Issue

The 2-wide issue of this processor is restricted by a lack of internal forwarding within the load and store queues. If a load and store are dispatched simultaneously, there is no mechanism to correctly return the load/store queue positions since those are written on the clock-edge. Two loads or two stores will not suffer from this lack of internal forwarding, however.

5 Execute 1

In the execute 1 stage the width of the processor expands to 5-wide: 2 ALUs, 2 multipliers, and a 1 wide port to the LSQ. We decided to pipeline our multiplier in 4-separate stages so as to balance reducing our clock period and the latency associated with producing the result.

Execute 2

The execute 2 stage consists solely of a ROB sized buffer to funnel the 5-wide execute 1 stage back down to the 2-wide width of the rest of the pipeline. A buffer the size of the ROB was chosen due to the fact that there would not need to be any stall logic implemented for the buffer since the maximum number of in flight instructions at one point would be rob size. We realized that having such a large data structure at the tail end of our execute stage would most likely slow down the clock period of our processor, so it was decided to split it to its own stage to ensure that the critical path would not be the execute stage. Interestingly enough, this was not our system’s bottleneck.

Complete

After instructions are buffered out of the execute 2 stage, physical register tags are broadcasted to the CDB and values are written to the physical .

Retire

The retire stage updates the architected versions of the free list and map table, in addition to sending store requests to the data cache. Depending on the position of instructions in the pipeline, either “first” or “second” in the superscalar width (2-wide), some of the values are forwarded to the map table and free list if a mispredicted branch or memory violation is being retired.

Modules

16-entry Reorder Buffer

The reorder buffer (ROB) follows the MIPS R10K conventions and contains much of the same overhead. We ended up with 16 entries, but we did test with other sizes such as 8 and 32. Unfortunately, testing with these different sizes exposed some unexpected bugs that we were unable to squash before the deadline. Our ROB uses a circular buffer style queue with head and tail pointers. The implementation relies heavily on the size being a power of two, as to exploit the ability for head and tail pointers to wrap around. The determination of whether the ROB is full relies on head and tail pointer arithmetic. We defined a `WRAP macro in cases where we needed to perform arithmetic on the head/tail without storing it into a bus, and we want to account for wrap-around. The ROB mainly communicates with the dispatch, complete, and retire stage. Additionally, it receives information about loads and branches that require a flush of the system. This will be discussed in further detail later.

8-entry Reservation Station

The reservation station (RS) is a standard part of the R10K architecture and follows such conventions. It has 8 entries, which generally provide enough bandwidth to avoid stalling the rest of the processor. The reservation station uses a pair of priority selectors/encoders for dispatch and issue. The issue policy is stricter than a typical MIPS R10K machine, as it will never issue 2 memory operations at the same time. This is because our data cache is blocking and our load and store queues do not allow for internal forwarding when they associatively CAM during the execute stage (content-addressable memory). For this reason, we have to let a single load get all the way through the system before we allow the other one to issue. In general, the RS will need to check the status

6 of memory before it allows a load to issue. Multiple in-flight loads are not handled by our memory system design. The repercussions of this stricter issue policy will be discussed in the analysis section.

8-Entry Load-Store Queue

We chose to implement separate load and store queues, each with a size of 8. The main reason behind this design choice was to make store-to-load forwarding and detecting memory violations easier. When a load or store is dispatched, an entry in the respective queue is allocated and is marked as busy. To keep track of the order of each memory operation, the current tail of the opposite queue is stored in the ROB entry during dispatch. For example when a load is dispatched, the current tail of the store queue is saved into the load’s entry in the ROB. Because we dispatch instructions in order, we ensure that this ordering is correct. This allows loads to be issued out of order past pending stores as we can detect during execution if they cause a memory violation by their LQ/SQ position. Our issuing scheme is non-speculative as we don’t predict if a memory hazard will occur prior to issuing, only during execution. As the load/store is executed, the rest of the entry in the queue is filled with the memory address, and value (stores only). When a load executes, it CAMs the store queue to check if a store that dispatched before the load that has the same memory address has already executed. If so, the value from that store is forwarded to the load. When stores execute, they CAM the load queue and check if a load has executed out of order and received the wrong value (memory violation). If so, the violating load is flagged in the ROB. To handle memory violations, the instructions in the ROB before the load are retired as normal, but when the violating load is at the head of the ROB, the ROB, RS, LQ and SQ are flushed and the map table and free list are restored to the architected state. Any pending memory requests must also complete before the flushing can happen to avoid invalid instructions floating in the pipeline. In addition to memory hazards, when the load queue or store queue is full, a signal is sent to dispatch and any loads/stores attempting to dispatch are sent back to fetch to ensure the instructions are not lost. Additionally, we do not have logic for handling a load and store dispatching at the same time, so when this happens, one of the instructions is invalidated and sent back to the fetch stage.

Memory Systems

Memory Arbiter Because the memory module is single ported, requests must be arbitrated. Our priority scheme is static and strictly prioritizes data memory requests over instruction memory. However, if the instruction cache is in the middle of handling a memory request, the data request cannot interrupt the request there. The arbiter virtualizes communication from each cache to memory. This means that each cache is unaware of the other cache. The caches will just think that memory has not yet responded with a ticket for its request.

128B Instruction Cache The 16-line instruction cache (I$) is direct-mapped and was not parameterized to handle different associativity levels. The instruction cache is blocking and does not perform prefetching. Aside from communicating with fetch through the , the I$ also has to take in information regarding structural hazards. Structural hazards include a branch prediction that fetches a different, non-sequential PC, load/store hazards, and full/empty hazards with the ROB, RS, and/or Free List. After the I$ receives a request, it sends out a request to the memArbiter module. The memArbiter can receive two requests at any given time, from D$ or I$. Because the memArbiter prioritizes data accesses, as well as the lack of miss status handling registers, our processor incurs heavy latencies from non-overlapping misses. These reflective observations will be further discussed in the analysis section of our report.

128B Data Cache The 128B data cache (D$) is parameterized to handle different associativities. Unfortunately, there are some edge-cases that break the parameterization. The D$ has a write-on-allocate, write-back policy and used an LRU eviction policy. Transactions to memory involve an enumerated SystemVerilog type called a BUS_COMMAND

7 (BUS_NONE, BUS_LOAD, BUS_STORE). A BUS_NONE request is a blank request to memory, BUS_LOAD requests a line of memory to the cache, and a BUS_STORE stores dirty lines back to memory. Everytime a cache eviction occurs, there is a BUS_STORE request sent to memory and a BUS_LOAD requesting a new line. Our design enumerates 5 different types of misses:

❖ NO_MISS ❖ LOAD_MISS_CLEAN ❖ LOAD_MISS_DIRTY ❖ STORE_MISS_CLEAN ❖ STORE_MISS_DIRTY

The D$ will send this metadata, along with any dirty lines to a memory buffer. The memory buffer performs different BUS_COMMAND operations depending on the miss type. When memory responds to a miss, it will, in parallel, send the data to execute and the D$.

FIFO Memory Buffer There are times where the cache will have to send more than one command in case it needs to write dirty data back to the cache. The 2-entry memory buffer takes in requests that stem from the misses mentioned in the section above. It will only ever require, at most, a BUS_STORE followed by a BUS_LOAD on a dirty miss.

Branch Predictor

The branch predictor is implemented as a tournament predictor which chooses between a G-Share predictor and a simple pattern history table (PHT) for the direction of the branch. G-Share uses a global branch history record and the branch PC (Program Counter) bits together to index to it’s 2-bit counters. Since both prediction methods were of interest to us, we eventually chose to combine the two in the tournament prediction method. This method also offered better performance.

FIFO Free List

The free list is implemented as a FIFO queue of physical register numbers. The least recently used register is assigned to an instruction needing a register on dispatch. Registers are added to the end of the queue on retirement. To handle instructions without a destination register from dispatch in a uniform way, there is a bus line called num_no_regs which is raised high for the indices of instructions that have no destination. The free list then gives tags back to the dispatch stage based on this signal.

Map Table

The map table stores the mappings of the architectural registers to physical registers. This is updated at dispatch to new mappings as instructions “grab” physical register tags. At complete, the tag then has a plus bit set, letting the rest of the system know that the value in that register has been updated.

Architected Structures

When the processor reaches a mispredicted branch or a violating load, we roll back the free list and map table using these architected states of each, which are only changed during retirement. Due to the superscalarness of the processor, we may have to forward some of the values being retired to their respective structures. This is done through a series of logic that selects between rolling back from either the architected state or the retire packets. The architected state is also updated during a mispredicted branch.

8 There are also some subtle differences between flushing the pipeline upon a mispredicted branch and a violating load. For branches, we still want to commit their values (if they are jump instructions) before flushing, whereas for loads, the state after rollback should appear as if the load never got into the system.

Testing

Module Tests

For most modules, we created an individual testbench for basic functionality tests. This gave us a baseline for each module’s function, which sped up the integration of the overall pipeline. As we added more features to each module, we were able to refer to the baseline tests to guarantee that it was functioning properly. runSingle script

Once the pipeline was assembled, we used our runSingle script which would check the output from any given program on our pipeline against ground truth output from the project 3 pipeline. This allowed us to very quickly identify problems within our implementation. This script could also run programs with the synthesized version of the processor, by specifying a command line argument. The script also supports verbose outputs, which shows the differences between the ground truth and the output from our processor, in a color coded fashion.

Figure 4: Verbose output from the runSingle script. Yellow output is the program output, and the blue output is the writeback output runAll script

Our runAll script would run all programs in the test_progs folder on our processor. This was essential in the final stages of debugging, as we could quickly run all programs and see where new changes to our processor either fixed things, or broke things further. The best use case for the runAll script was during the phase where the memory latency was hardcoded to detect bugs in memory logic.

9 `define DEBUG

Throughout our processor, we used the DEBUG macro which, if defined, would output ~92 lines of output per cycle, giving a detailed look into what was happening on each cycle. Due to the sheer amount of output being generated, larger programs took a significantly longer time to complete with DEBUG defined. Having the ability to turn the debug output off for regression testing saved us a significant amount of time.

Analysis

CPI - A Breakdown

Figure 5: CPI average by overall number of instructions

The figure for the average CPI indicates the insertion sort performs the best in terms of bandwidth. The likely reason for the high CPI, even in the presence of a 2-wide system, is due to our memory latencies and stalling logic, which is discussed later in this section and in Reconsiderations and Improvements. ​ ​

10 Synthesis Clock Periods

Table 1: Synthesis clock periods of different components

Overall, we felt that the synthesis clock periods of our individual components were very competitive and we were proud of that result. Especially early on, our main goal was to create optimised designs, which we think these clock periods reflect.

Cache Analysis

Given the timeline of this class and the pace at which our group was operating, we were unable to implement a non-blocking data cache. As expected, a higher associativity yields a higher hit rate (Figure 6). Unfortunately, we were unable to test with a fully associative-cache or a direct-mapped cache. This is because the implementation makes use of 2-dimensional logical buses. When the cache is fully-associative or direct-mapped, the set index or the way index will be zero. Since the logic always indexes into two dimensions of the cache, it is likely those combinational assignments do not work with the parameterization. This could have been fixed by using 1-wide buses and doing a bit of extra multiplication with the set and way values. The discrepancy between the hit rates in Figure 7 shows that the I$ would have benefited from prefetching. It would possibly benefit from a victim cache as well, as there were likely hot sets involved in function calls within the C programs. The D$ miss rate is actually quite substantial because our cache misses are not overlapped; these drawbacks are discussed in our Reconsiderations and Improvements section. ​ ​

Figure 6 : Hit rate for different cache associativities

11 Figure 7: Overall Cache hit percentages with total hits

LSQ Analysis

To prove that our LSQ design worked successfully, We tracked multiple metrics of our LSQ during each test run. The metrics include the number of loads and stores retired, the number of memory violations, and the number of store-to-load forwards. We then calculated the rates for memory violations (memory violations / number of loads), store-to-load forwarding (forwards / number of loads) and store+loads hazards (number of hazards / number of loads + number of stores). The averages for each are shown in the figure below.

Figure 8: Various rates for features implemented in the LSQ

We were disappointed to see the small store-to-load forwarding rate as the feature resulted in a very minimal program speedup. On average, the percentage speedup due to forwarding was a measly 0.01% (number of forwards * D$ miss rate * memory latency / total clock cycles). We think this small speedup is partly due to our cache being blocking, resulting in less memory operations being handled at the same time. Another performance hit is likely from wasted execution time on loads that will be squashed in the future due to memory hazards. We would have liked to reduce these numbers by implementing a speculative load issuing scheme rather than our non-speculative design.

12 Branch Predictor Analysis

For many different kinds of workloads, the branch predictor was able to predict the direction of a conditional branch very accurately. However, the accuracy can be extremely high for certain applications like matrix multiplication while only being moderately accurate for others. Overall, we were very satisfied with how well our tournament branch predictor worked but were surprised by some of the statistics.

Figure 9: Comparison of conditional branch prediction given different workloads. The branch predictor size is 16. ​ ​ For the programs in the picture above, the size of the branch predictor did not affect the accuracy of the branch predictor. As used here, size means the maximum number of branches that can be stored by the predictor and two to the number of bits used by the branch program counter. The sizes tested were 8, 16, 32 and 64. Our tournament predictor is composed of a G-Share predictor and a predictor composed of a simple array of 2-bit saturating counters. In our design, the hybrid predictor will default to the simple predictor and switch to the GSHARE predictor if the simple predictor does not do well.

Component Size Analysis

The reason we believe that the smaller ROB size had a better average CPI than the larger ROB size is due to the small size of the Icache. When a branch mispredicts with a large ROB size, the I$ will most likely get filled with a large number of instructions related to the prediction of the branch. When this branch resolves in retires and the ROB is cleared, we will have to refetch those instructions to the Icache, whereas with a smaller ROB size, there is still room for those instructions in the Icache, so they may not have gotten kicked out by the instructions of the mispredicted branch. Figure 10 demonstrates the point.

13 Figure 10: Average CPI for varying component sizes

Critical Path

After running synthesis on our processor, we found that the critical path was the path from issue to execute, execute to the load queue, the load queue to the D$, and finally, the D$ to the memory queue. This is the worst case serial transit from a load that misses in the store queue (no value to forward), misses in the D$, and must be sent to memory. While our group considered adding a flip-flop between this path, there was a realization that due to the high volume of loads/stores, the extra cycle to memory would reduce our period; this would not be enough to offset the increased CPI resulting from another stop before memory. Refer to Figure 3 for a high-level overview of the critical path. Refer to Figure 6 to see how the main components synthesized. This figure indicates that the bottlenecks lie in the inter-component travel rather than intra-component CAM logic. To fix this critical path, sending wires in parallel to the load queue, D$, and memory buffer would have likely reduced the clock period.

Reconsiderations and Improvements

Given the time spent on our advanced features and the benefits they offered, it is clear that the store-load forwarding and allowing loads to issue out-of-order with stores is not a worthwhile feature. The critical path being one to memory, coupled with the blocking cache and load/store structural hazards, points to a specific feature that would have greatly advanced performance: non-blocking cache. The LSQ analysis shows that the speedup gained from store-load forwarding is essentially non-existent (<0.1%). Additionally, the 2.76% of memory hazards accounts for flushing of instructions out of the pipeline. A non-blocking cache would reduce our stalling involved with waiting for a miss to return, even though we only miss on 6.8% of cache accesses. We would have also liked to have implemented a speculative load issuing scheme, so we could have reduced the amount of memory hazards themselves. On a similar note, our D$ outperforms our I$ by approximately 20% (hit rate). We did not have instruction prefetching, which we would also have liked to add to make up for the latency we incurred from the 26.3% of misses. Finally, the I$ would have likely benefited from a victim cache, especially for programs that have several function calls and potentially hot sets.

14 Due to time constraints, we were also not able to get our processor to be completely parameterizable in terms of processor width. Our processor is only two-wide and changing the width parameter to anything greater will break the processor. This limitation kept our CPI high on many programs.

In general, our team would have benefited from a more in-depth design phase, with more consideration on the impact certain advanced features will have on our performance rather than on whether we wanted to implement them or not. All in all, we are very proud of this project and the time spent throughout the semester assembling the structures. We will all be sending individual emails about our opinions of group contributions. While we still hold disappointment about features we never got around to implementing, we are still proud to have gotten through this unusual semester, and we thank the 470 staff immensely for their assistance!

15