Superscalar Instruction Dispatch with Exception Handling
Total Page:16
File Type:pdf, Size:1020Kb
Superscalar Instruction Dispatch With Exception Handling Design and Testing Report Joseph McMahan December 2013 Contents 1 Design 3 1.1 Overview . ............. ............. 3 1.1.1 The Tomasulo Algorithm . ............. 4 1.1.2 Two-Phase Cloc"ing .......... ............. 4 1.1.3 #esolving Data Hazards . ............. ' 1.1.4 (oftware Interface............ ............. + 1.2 High ,evel Design ............. ............. 10 1.3 ,ow ,evel Design ............. ............. 12 1.3.1 Buses . ............. ............. 12 1.3.2 .unctional /nits ............. ............ 13 1.3.3 Completion File ............. ............. 14 1.3.4 Bus !0cles . ............. ............. 17 1.3.' *nstruction Deco&e an& Dispatch . ............. 18 2 Testing 23 2.1 Tests . ............. ............. 23 2.1.1 ,oa& Data2 Ad& ............. ............. 23 2.1.2 #esource Hazard3 Onl0 One (tation Available ......... 24 2.1.3 Multipl0 . ............. ............. 26 2.1.4 #esource Hazard3 Two Instructions o) (ame Type . 26 2.1.' Data Hazar&3 #A5........... ............. 29 2.1.4 Data Hazar&3 5A5 .......... ............. 31 2.1.+ Data Hazar&3 5AR . ............. 31 2.1.1 Memor0 Hazar&3 5AR . ............. 35 2.1.6 #esource Hazard3 No (tations Available ............ 3' 2.1.10 (tall on .ull Completion File . ............ 31 2.1.11 Han&le 89ception............ ............. 40 1 3 Code 43 3.1 Mo&ules . ............. ............. 43 3.1.1 rstation.v . ............. ............. 43 3.1.2 reg.v . ............. ............. 49 3.1.3 fetch.v . ............. ............. 53 3.1.4 store.v . ............. ............. '6 3.1.' mem-unit.v . ............. ............. 63 3.1.4 ID.v . ............. ............. 70 3.1.+ alu.v . ............. ............. 84 3.1.1 alu-unit.v . ............. ............. 14 3.1.6 mult.v . ............. ............. 90 3.1.10 mp-unit.v . ............. ............. 92 3.1.11 cf.v . ............. ............. 64 3.1.12 IMem.v . ............. ............. 107 3.1.13 cpu.v . ............. ............. 110 3.1.14 assemble.p0 . ............. ............. 114 2 Chapter 1 Design 1.1 Overview This report covers the &esign2 implementation2 an& testing of a basic superscalar ! / with exception han&ling. The ! / has a pea" throughput of two instructions per c0cle. The &esign uses the Tomasulo algorithm for parallel execution :see section 1.1.1;. This calls for an instruction &ispatch that &0namicall0 allocates resources an& allows for localize& control logic. There is no state machine governing the ! / actions< each functional unit operates un&er its own control. As long as the required resources are available, the instruction &eco&e unit will &ispatch two instructions ever0 cycle2 allowing the )unctional units to han&le the execution. Naturall02 the instruction deco&e will have to resolve the &ata hazar&s that arise in &ispatching two instructions ever0 cycle. The process for resolving #ea&-After- 5rite :RA5;2 5rite-After-Write :5A5;2 an& 5rite-After-#ead :5AR; hazards for both the register names an& memor0 accesses is covered in section 1.1.3. The functional units consist of an a&&er2 a multiplier2 a fetch unit2 an& a store unit. A completion >le is used to trac" all instructions that have been issued but not 0et retired into the retirement registers. (ection 1.3.3 covers the implementation an& purpose of the completion >le in the system. 3 writes are issued to the memory hierarch0. WAW The problem o) subsequent writes at the same a&&ress is han&led b0 the mem- ory interface alwa0s issuing writes in sequential order. RAW When a rea& comman& is issued b0 the instruction &ispatch an& goes out on an instruction bus, the allocated fetch reservation station chec"s i) the requeste& a&&ress exists in an0 o) the store stations :meaning it belongs to a pen&ing write operation;. *) there are no matches, it latches the a&dress )rom the instruction bus an& signals to the memor0 inter)ace that it has a rea& request rea&0 for the cache. *) there is a match2 it instea& latches the &ata &irectl0 from the store unit2 circumventing the memor0 access entirel0. This metho& not onl0 solves the issues that might arise when one gives strict priorit0 to reads over writes, it actuall0 eliminates the &ela0 )or memor0 access in situations with close rea&-after-write operations. 1.1.4 Software Interface op src/&st a&&r 2 2 28 Table 1.1: 7umber of bits &evoted to each >el& for a loa& or store comman&. op &st src1 src2 :empt0; imm ' 2 2 2 ' 16 Table 1.23 7umber of bits &evoted to each >el& for all AL/ an& multipl0 comman&s. The ! / ta"es 32-bit machine instructions. Tables 1.1.4 an& 1.1.4 show how the 32-bit instructions are &ivi&ed for loa&?store instructions :1.1.4; an& all other instructions :1.1.4;. + operation assembl0 t0pe resource funct a&& a&& 0 00 00 NOR nor 0 00 01 shi)t left logical sll 0 00 10 shi)t right logical srl 0 00 11 multipl0 mult 0 01 00 load lower imme&iate li 0 10 00 load upper immediate lui 0 10 01 store word sw 1 0 n?a load wor& lw 1 1 n?a Table 1.3: *nstruction opco&es. Note that loa& an& store instructions use 2-bit opco&es, while all others use 5-bit opco&es. ,oa& an& store comman&s use a separate instruction format from all other com- man&s; these comman&s use a 2-bit opco&e, a 2-bit register number2 an& a 28-bit wor& ad&ress. :The wor& a&&ress is later pa&&ed with two zeros to form a 30-bit b0te a&&ress.; All other instructions use a '-bit opco&e, 2-bits each )or a &estination register an& two source registers, an& a 16-bit imme&iate >el&. The immediate >el& allows for values speci>e& in a program to be loa&ed into registers. Table 1.1.4 shows how all opco&es are &ivi&e&. The >rst bit in&icates the in- struction type: memory or non-memory. @ariable-length opco&es are used an& the >rst bit in&icates how man0 bits to expect for the instruction. 1 is used for memory operations, 0 for all others. .or memor0 operations, the secon& bit indicates a loa& or a store operation. .or non-memory operations :all instructions e9cept loa&-wor& an& store-word;2 the remaining four bits give the rest o) the instruction. The secon& an& thir& bit make up the resource i&enti>er< an instruction can possibl0 use the AL/2 the multiplier2 or the register >le. A &esignate& >el& for resource t0pe ma"es it easier )or the instruction &eco&e to "now i) a resource hazard exists between two instructions. The >nal two bits give the function code, which speci>es which operation the )unctional unit is to perform. The multiplier has onl0 one possible operation. The register >le has two3 loa& lower imme&iate :load the imme&iate >el& into the lower 14 bits) an& loa& 1 1.2 High eve! Design The &esign of the ! / is bro"en into three major components: the instruction &eco&e/&ispatch unit :*D;2 the completion >le :!.;2 an& a set of )unctional units :./;. The system contains two instruction buses an& >ve &ata buses. 8ach bus has a single owner with write privileges. Figure 1.2 shows the bloc"s an& buses in the high-level &esign. The two instruction buses are output b0 the ID. At pea" throughput2 each has a new instruction every cycle. Each register in the #. monitors the B&stC :&estination; >el& o) the instruction buses; i) a register >n&s that its tag matches an instructionDs &estination >el&2 then it latches the allocate& station for that instruction2 which tells it what &ata to wait for. When &ata shows up with that tag attached2 the register latches it. The registers also monitor the “srcC :source) >el&s of each instruction. *) a tag match is foun& :i.e.2 a register is use& as an operan&;2 the register signals to the #. controller that it needs to output its contents onto a bus. Ever0 functional unit monitors the instruction buses. The reservation stations in each function unit compare their tags to the tag in the “allocate& stationC >el& of each instruction. *) it matches, the station latches the two source tags :which tell it what &ata to wait for; an& the function tag. When all operan&s have been acquired2 the functional unit executes the instruction2 an& puts it out on a &ata bus. The ALU, multiplier2 an& fetch unit all have ownership of one &ata bus. The completion >le has two :since multiple registers frequentl0 nee& to output their con- tents at the same time). The store unit never has &ata to output2 as it simple sen&s store requests to the memor0 controller. Ever0 functional unit �as well as the #. an& ID; monitor the tags on all >ve &ata buses. Ever0 bus is ta"en as an input into ever0 bloc" in the &esign. ,ocali%ed control in each ./ han&les how to execute each instruction2 meaning that the s0stem-wi&e control is minimal. 7o state machine is used for the ! U. The ID &ispatches instructions every cycle, an& the )unctional units han&le the execution. 10 Figure 1.13 High-level &esign o) the superscalar ! /. Onl0 the connections control- ling bus contents are &rawn here; however, every unit in the &esign takes as input ever0 bus. 11 Figure 1.23 Each functional unit has a reservation station o) two entries, which latch instructions an& collect the &ata needed for operations. 5hen operan& &ata are rea&02 the functional unit executes the &esired operation. 1.3.2 "unctiona! $nits The ! / has four functional units: a four-function ALU, a multiplier2 a fetch unit :for memor0 rea&s)2 an& a store unit :for memor0 writes).