EECS 470 Final Report: Potatolakez Processor

EECS 470 Final Report: PotatoLakeZ Processor James Read, Donato Mastropietro, Skyler Hau, Nathan Richards, Pratham Dhanjal [jamread, donatom, hausky, nricha, pratham]@umich.edu 1 Table of Contents Table of Contents 2 Introduction 3 Design 3 Out of Order Execution 3 2-way Superscalar 4 Pipeline Design and Stages 5 Fetch 5 Dispatch 5 Issue 5 Execute 1 6 Execute 2 6 Complete 6 Retire 6 Modules 6 16-entry Reorder Buffer 6 8-entry Reservation Station 6 8-Entry Load-Store Queue 7 Memory Systems 7 Memory Arbiter 7 128B Instruction Cache 7 128B Data Cache 7 FIFO Memory Buffer 8 Branch Predictor 8 FIFO Free List 8 Map Table 8 Architected Structures 8 Testing 9 Module Tests 9 runSingle script 9 runAll script 9 `define DEBUG 10 Analysis 10 CPI - A Breakdown 10 Synthesis Clock Periods 11 Cache Analysis 11 LSQ Analysis 12 Branch Predictor Analysis 13 Component Size Analysis 13 Critical Path 14 Reconsiderations and Improvements 14 2 Introduction This project implements a synthesizable out-of-order processor using the SystemVerilog hardware description language. This processor uses the RV32IM instruction set without any system calls, fences, or division and is implemented using the R10K register renaming convention. The project’s most important special feature is that it is superscalar, capable of performing two instructions in parallel. Our team had roughly 2 months to complete this project. Advanced Features: ❖ 2-way superscalar ❖ Write Back Data Cache ❖ Cache Associativity > 1 ❖ Loads Issue Out-of-Order with respect to stores ❖ Dynamic Branch Prediction (Tournament Predictor) ❖ Automated Regression Testing and Visual Debugging Mechanisms Design Out of Order Execution To show that the processor successfully implements an out of order design, a simple test program was constructed with a multiply, followed by a dependent (on the multiply) add, followed by an independent (from the multiply) add. The output from the reservation station shows that the independent add issues before the dependent add, which clearly displays the out of order implementation of the processor. Figure 1: an ADDI instruction leaving the RS before the ADD instruction that follows it, output by out debug output 3 2-way Superscalar An embarrassingly parallel program has been written to prove that we have successfully implemented a 2-way superscalar processor. This program consisted of 12 unrelated add instructions that were then looped over some number of times. The graph shows that as a program continues to run the same parallel instructions the average CPI starts to amortize to ~0.6, which is well below the minimum CPI that a single wide processor could achieve. Figure 2: Indications of Superscalar behavior 4 Pipeline Design and Stages Figure 3: High-level PotatoLakeZ pipeline design with legend. The red arrows represent the critical path Fetch The processor uses a 2-wide fetch, and receives information about structural hazards from dispatch and retire (discussed in respective sections). Mis-aligned instructions (with regards to an 8B cache line) requests are handled by re-fetching the most recent PC and invalidating all packets besides the mis-aligned instruction. Dispatch Our processor has a 2-wide dispatch with some variation from the standard MIPS R10K design. Within the dispatch stage, the tournament branch predictor is instantiated. Dispatch will ensure that the hazardous instructions will be re-sent to fetch. These hazards will include full signals from the ROB and RS, an empty signal from the Free List, and hazard signals from the load and store queues. Issue The 2-wide issue of this processor is restricted by a lack of internal forwarding within the load and store queues. If a load and store are dispatched simultaneously, there is no mechanism to correctly return the load/store queue positions since those are written on the clock-edge. Two loads or two stores will not suffer from this lack of internal forwarding, however. 5 Execute 1 In the execute 1 stage the width of the processor expands to 5-wide: 2 ALUs, 2 multipliers, and a 1 wide port to the LSQ. We decided to pipeline our multiplier in 4-separate stages so as to balance reducing our clock period and the latency associated with producing the result. Execute 2 The execute 2 stage consists solely of a ROB sized buffer to funnel the 5-wide execute 1 stage back down to the 2-wide width of the rest of the pipeline. A buffer the size of the ROB was chosen due to the fact that there would not need to be any stall logic implemented for the buffer since the maximum number of in flight instructions at one point would be rob size. We realized that having such a large data structure at the tail end of our execute stage would most likely slow down the clock period of our processor, so it was decided to split it to its own stage to ensure that the critical path would not be the execute stage. Interestingly enough, this was not our system’s bottleneck. Complete After instructions are buffered out of the execute 2 stage, physical register tags are broadcasted to the CDB and values are written to the physical register file. Retire The retire stage updates the architected versions of the free list and map table, in addition to sending store requests to the data cache. Depending on the position of instructions in the pipeline, either “first” or “second” in the superscalar width (2-wide), some of the values are forwarded to the map table and free list if a mispredicted branch or memory violation is being retired. Modules 16-entry Reorder Buffer The reorder buffer (ROB) follows the MIPS R10K conventions and contains much of the same overhead. We ended up with 16 entries, but we did test with other sizes such as 8 and 32. Unfortunately, testing with these different sizes exposed some unexpected bugs that we were unable to squash before the deadline. Our ROB uses a circular buffer style queue with head and tail pointers. The implementation relies heavily on the size being a power of two, as to exploit the ability for head and tail pointers to wrap around. The determination of whether the ROB is full relies on head and tail pointer arithmetic. We defined a `WRAP macro in cases where we needed to perform arithmetic on the head/tail without storing it into a bus, and we want to account for wrap-around. The ROB mainly communicates with the dispatch, complete, and retire stage. Additionally, it receives information about loads and branches that require a flush of the system. This will be discussed in further detail later. 8-entry Reservation Station The reservation station (RS) is a standard part of the R10K architecture and follows such conventions. It has 8 entries, which generally provide enough bandwidth to avoid stalling the rest of the processor. The reservation station uses a pair of priority selectors/encoders for dispatch and issue. The issue policy is stricter than a typical MIPS R10K machine, as it will never issue 2 memory operations at the same time. This is because our data cache is blocking and our load and store queues do not allow for internal forwarding when they associatively CAM during the execute stage (content-addressable memory). For this reason, we have to let a single load get all the way through the system before we allow the other one to issue. In general, the RS will need to check the status 6 of memory before it allows a load to issue. Multiple in-flight loads are not handled by our memory system design. The repercussions of this stricter issue policy will be discussed in the analysis section. 8-Entry Load-Store Queue We chose to implement separate load and store queues, each with a size of 8. The main reason behind this design choice was to make store-to-load forwarding and detecting memory violations easier. When a load or store is dispatched, an entry in the respective queue is allocated and is marked as busy. To keep track of the order of each memory operation, the current tail of the opposite queue is stored in the ROB entry during dispatch. For example when a load is dispatched, the current tail of the store queue is saved into the load’s entry in the ROB. Because we dispatch instructions in order, we ensure that this ordering is correct. This allows loads to be issued out of order past pending stores as we can detect during execution if they cause a memory violation by their LQ/SQ position. Our issuing scheme is non-speculative as we don’t predict if a memory hazard will occur prior to issuing, only during execution. As the load/store is executed, the rest of the entry in the queue is filled with the memory address, and value (stores only). When a load executes, it CAMs the store queue to check if a store that dispatched before the load that has the same memory address has already executed. If so, the value from that store is forwarded to the load. When stores execute, they CAM the load queue and check if a load has executed out of order and received the wrong value (memory violation). If so, the violating load is flagged in the ROB. To handle memory violations, the instructions in the ROB before the load are retired as normal, but when the violating load is at the head of the ROB, the ROB, RS, LQ and SQ are flushed and the map table and free list are restored to the architected state. Any pending memory requests must also complete before the flushing can happen to avoid invalid instructions floating in the pipeline.

EECS 470 Final Report: Potatolakez Processor

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support