Superscalar Instruction Dispatch with Exception Handling

Superscalar Instruction Dispatch With Exception Handling Design and Testing Report Joseph McMahan December 2013 Contents 1 Design 3 1.1 Overview . ............. ............. 3 1.1.1 The Tomasulo Algorithm . ............. 4 1.1.2 Two-Phase Cloc"ing .......... ............. 4 1.1.3 #esolving Data Hazards . ............. ' 1.1.4 (oftware Interface............ ............. + 1.2 High ,evel Design ............. ............. 10 1.3 ,ow ,evel Design ............. ............. 12 1.3.1 Buses . ............. ............. 12 1.3.2 .unctional /nits ............. ............ 13 1.3.3 Completion File ............. ............. 14 1.3.4 Bus !0cles . ............. ............. 17 1.3.' *nstruction Deco&e an& Dispatch . ............. 18 2 Testing 23 2.1 Tests . ............. ............. 23 2.1.1 ,oa& Data2 Ad& ............. ............. 23 2.1.2 #esource Hazard3 Onl0 One (tation Available ......... 24 2.1.3 Multipl0 . ............. ............. 26 2.1.4 #esource Hazard3 Two Instructions o) (ame Type . 26 2.1.' Data Hazar&3 #A5........... ............. 29 2.1.4 Data Hazar&3 5A5 .......... ............. 31 2.1.+ Data Hazar&3 5AR . ............. 31 2.1.1 Memor0 Hazar&3 5AR . ............. 35 2.1.6 #esource Hazard3 No (tations Available ............ 3' 2.1.10 (tall on .ull Completion File . ............ 31 2.1.11 Han&le 89ception............ ............. 40 1 3 Code 43 3.1 Mo&ules . ............. ............. 43 3.1.1 rstation.v . ............. ............. 43 3.1.2 reg.v . ............. ............. 49 3.1.3 fetch.v . ............. ............. 53 3.1.4 store.v . ............. ............. '6 3.1.' mem-unit.v . ............. ............. 63 3.1.4 ID.v . ............. ............. 70 3.1.+ alu.v . ............. ............. 84 3.1.1 alu-unit.v . ............. ............. 14 3.1.6 mult.v . ............. ............. 90 3.1.10 mp-unit.v . ............. ............. 92 3.1.11 cf.v . ............. ............. 64 3.1.12 IMem.v . ............. ............. 107 3.1.13 cpu.v . ............. ............. 110 3.1.14 assemble.p0 . ............. ............. 114 2 Chapter 1 Design 1.1 Overview This report covers the &esign2 implementation2 an& testing of a basic superscalar ! / with exception han&ling. The ! / has a pea" throughput of two instructions per c0cle. The &esign uses the Tomasulo algorithm for parallel execution :see section 1.1.1;. This calls for an instruction &ispatch that &0namicall0 allocates resources an& allows for localize& control logic. There is no state machine governing the ! / actions< each functional unit operates un&er its own control. As long as the required resources are available, the instruction &eco&e unit will &ispatch two instructions ever0 cycle2 allowing the )unctional units to han&le the execution. Naturall02 the instruction deco&e will have to resolve the &ata hazar&s that arise in &ispatching two instructions ever0 cycle. The process for resolving #ea&-After- 5rite :RA5;2 5rite-After-Write :5A5;2 an& 5rite-After-#ead :5AR; hazards for both the register names an& memor0 accesses is covered in section 1.1.3. The functional units consist of an a&&er2 a multiplier2 a fetch unit2 an& a store unit. A completion >le is used to trac" all instructions that have been issued but not 0et retired into the retirement registers. (ection 1.3.3 covers the implementation an& purpose of the completion >le in the system. 3 writes are issued to the memory hierarch0. WAW The problem o) subsequent writes at the same a&&ress is han&led b0 the memory interface alwa0s issuing writes in sequential order. RAW When a rea& comman& is issued b0 the instruction &ispatch an& goes out on an instruction bus, the allocated fetch reservation station chec"s i) the requeste& a&&ress exists in an0 o) the store stations :meaning it belongs to a pen&ing write operation;. *) there are no matches, it latches the a&dress )rom the instruction bus an& signals to the memor0 inter)ace that it has a rea& request rea&0 for the cache. *) there is a match2 it instea& latches the &ata &irectl0 from the store unit2 circumventing the memor0 access entirel0. This metho& not onl0 solves the issues that might arise when one gives strict priorit0 to reads over writes, it actuall0 eliminates the &ela0 )or memor0 access in situations with close rea&-after-write operations. 1.1.4 Software Interface op src/&st a&&r 2 2 28 Table 1.1: 7umber of bits &evoted to each >el& for a loa& or store comman&. op &st src1 src2 :empt0; imm ' 2 2 2 ' 16 Table 1.23 7umber of bits &evoted to each >el& for all AL/ an& multipl0 comman&s. The ! / ta"es 32-bit machine instructions. Tables 1.1.4 an& 1.1.4 show how the 32-bit instructions are &ivi&ed for loa&?store instructions :1.1.4; an& all other instructions :1.1.4;. + operation assembl0 t0pe resource funct a&& a&& 0 00 00 NOR nor 0 00 01 shi)t left logical sll 0 00 10 shi)t right logical srl 0 00 11 multipl0 mult 0 01 00 load lower imme&iate li 0 10 00 load upper immediate lui 0 10 01 store word sw 1 0 n?a load wor& lw 1 1 n?a Table 1.3: *nstruction opco&es. Note that loa& an& store instructions use 2-bit opco&es, while all others use 5-bit opco&es. ,oa& an& store comman&s use a separate instruction format from all other comman&s; these comman&s use a 2-bit opco&e, a 2-bit register number2 an& a 28-bit wor& ad&ress. :The wor& a&&ress is later pa&&ed with two zeros to form a 30-bit b0te a&&ress.; All other instructions use a '-bit opco&e, 2-bits each )or a &estination register an& two source registers, an& a 16-bit imme&iate >el&. The immediate >el& allows for values speci>e& in a program to be loa&ed into registers. Table 1.1.4 shows how all opco&es are &ivi&e&. The >rst bit in&icates the instruction type: memory or non-memory. @ariable-length opco&es are used an& the >rst bit in&icates how man0 bits to expect for the instruction. 1 is used for memory operations, 0 for all others. .or memor0 operations, the secon& bit indicates a loa& or a store operation. .or non-memory operations :all instructions e9cept loa&-wor& an& store-word;2 the remaining four bits give the rest o) the instruction. The secon& an& thir& bit make up the resource i&enti>er< an instruction can possibl0 use the AL/2 the multiplier2 or the register >le. A &esignate& >el& for resource t0pe ma"es it easier )or the instruction &eco&e to "now i) a resource hazard exists between two instructions. The >nal two bits give the function code, which speci>es which operation the )unctional unit is to perform. The multiplier has onl0 one possible operation. The register >le has two3 loa& lower imme&iate :load the imme&iate >el& into the lower 14 bits) an& loa& 1 1.2 High eve! Design The &esign of the ! / is bro"en into three major components: the instruction &eco&e/&ispatch unit :*D;2 the completion >le :!.;2 an& a set of )unctional units :./;. The system contains two instruction buses an& >ve &ata buses. 8ach bus has a single owner with write privileges. Figure 1.2 shows the bloc"s an& buses in the high-level &esign. The two instruction buses are output b0 the ID. At pea" throughput2 each has a new instruction every cycle. Each register in the #. monitors the B&stC :&estination; >el& o) the instruction buses; i) a register >n&s that its tag matches an instructionDs &estination >el&2 then it latches the allocate& station for that instruction2 which tells it what &ata to wait for. When &ata shows up with that tag attached2 the register latches it. The registers also monitor the “srcC :source) >el&s of each instruction. *) a tag match is foun& :i.e.2 a register is use& as an operan&;2 the register signals to the #. controller that it needs to output its contents onto a bus. Ever0 functional unit monitors the instruction buses. The reservation stations in each function unit compare their tags to the tag in the “allocate& stationC >el& of each instruction. *) it matches, the station latches the two source tags :which tell it what &ata to wait for; an& the function tag. When all operan&s have been acquired2 the functional unit executes the instruction2 an& puts it out on a &ata bus. The ALU, multiplier2 an& fetch unit all have ownership of one &ata bus. The completion >le has two :since multiple registers frequentl0 nee& to output their contents at the same time). The store unit never has &ata to output2 as it simple sen&s store requests to the memor0 controller. Ever0 functional unit �as well as the #. an& ID; monitor the tags on all >ve &ata buses. Ever0 bus is ta"en as an input into ever0 bloc" in the &esign. ,ocali%ed control in each ./ han&les how to execute each instruction2 meaning that the s0stem-wi&e control is minimal. 7o state machine is used for the ! U. The ID &ispatches instructions every cycle, an& the )unctional units han&le the execution. 10 Figure 1.13 High-level &esign o) the superscalar ! /. Onl0 the connections control- ling bus contents are &rawn here; however, every unit in the &esign takes as input ever0 bus. 11 Figure 1.23 Each functional unit has a reservation station o) two entries, which latch instructions an& collect the &ata needed for operations. 5hen operan& &ata are rea&02 the functional unit executes the &esired operation. 1.3.2 "unctiona! $nits The ! / has four functional units: a four-function ALU, a multiplier2 a fetch unit :for memor0 rea&s)2 an& a store unit :for memor0 writes).

Superscalar Instruction Dispatch with Exception Handling

IEEE Paper Template in A4

Computer Science 246 Computer Architecture Spring 2010 Harvard University

Dynamic Scheduling

United States Patent (19) 11 Patent Number: 5,680,565 Glew Et Al

Identifying Bottlenecks in a Multithreaded Superscalar

Advanced Processor Designs

Computer Architecture Out-Of-Order Execution

CSE 586 Computer Architecture Lecture 4

MIPS Architecture with Tomasulo Algorithm [12]

In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (Tlbs)

Improving the Precise Interrupt Mechanism of Software- Managed TLB Miss Handlers

Identifying Bottlenecks in a Multithreaded Superscalar