<<

1

Single-Cycle Processors: Datapath & Control

Arvind Computer Science & Artificial Intelligence Lab M.I.T.

Based on the material prepared by Arvind and Krste Asanovic 6.823 L5- 2 Instruction Set Architecture (ISA) Arvind versus Implementation

• ISA is the hardware/software interface – Defines set of programmer visible state – Defines instruction format (bit encoding) and instruction semantics –Examples:MIPS, , IBM 360, JVM • Many possible implementations of one ISA – 360 implementations: model 30 (c. 1964), z900 (c. 2001) –x86 implementations:8086 (c. 1978), 80186, 286, 386, 486, Pentium, , Pentium-4 (c. 2000), AMD Athlon, Transmeta Crusoe, SoftPC – MIPS implementations: , , , ... –JVM:HotSpot, PicoJava, ARM Jazelle, ...

September 26, 2005 6.823 L5- 3 Arvind Performance

Time = Instructions Cycles Time Program Program * Instruction * Cycle

– Instructions per program depends on source code, compiler technology, and ISA – Cycles per instructions (CPI) depends upon the ISA and the – Time per cycle depends upon the microarchitecture and the base technology

Microarchitecture CPI cycle time Microcoded >1 short this lecture Single-cycle unpipelined 1 long Pipelined 1 short

September 26, 2005 6.823 L5- 4 Arvind

Microarchitecture: Implementation of an ISA

Controller control status points lines

Data path

Structure: How components are connected. Static Behavior: How data moves between components Dynamic

September 26, 2005 Hardware Elements

• Combinational circuits OpSelect – Mux, Demux, Decoder, ALU, ... - Add, Sub, ... - And, Or, Xor, Not, ... Sel - GT, LT, EQ, Zero, ... Sel lg(n) lg(n) O O0 A A0 0 O O1 A O 1 A Result 1 . A . . ALU Mux . lg(n) . . . Comp? Demux Decoder O B An-1 On n-1 1 • Synchronous state elements – Flipflop, Register, , SRAM, DRAM D register Clk ... D0 D1 D2 Dn-1 En En En ff ff ff ff ... ff Clk D Clk ... Q Q Q0 Q1 Q2 Qn-1 Edge-triggered: Data is sampled at the rising edge

September 26, 2005 6.823 L5- 6 Arvind Register Files Clock WE

we ReadSel1 rs1 rd1 ReadData1 ReadSel2 rs2 Register rd2 ReadData2 file WriteSel ws 2R+1W WriteData wd ws clk wd rs1 5 32 5 … … register 0 rd1 32 32

we … register 1

… rs2 32 5 register 31 rd2 32 32 • No timing issues in reading a selected register • Register files with a large number of ports are difficult to design – Intel’s , GPR File has 128 registers with 8 read ports and 4 write ports!!! September 26, 2005 6.823 L5- 7 Arvind A Simple Memory Model

WriteEnable Clock

Address MAGIC ReadData RAM WriteData

Reads and writes are always completed in one cycle • a Read can be done any time (i.e. combinational) • a Write is performed at the rising clock edge if it is enabled ⇒ the write address and data must be stable at the clock edge

Later in the course we will present a more realistic model of memory

September 26, 2005 6.823 L5- 8 Arvind

Implementing MIPS:

Single-cycle per instruction datapath & control logic

September 26, 2005 6.823 L5- 9 Arvind The MIPS ISA Processor State 32 32-bit GPRs, R0 always contains a 0 32 single precision FPRs, may also be viewed as 16 double precision FPRs FP , used for FP compares & exceptions PC, the program some other special registers Data types 8-bit byte, 16-bit half word 32-bit word for integers 32-bit word for single precision floating point 64-bit word for double precision floating point Load/Store style instruction set data addressing modes- immediate & indexed branch addressing modes- PC relative & register indirect Byte addressable memory- big endian mode All instructions are 32 bits September 26, 2005 6.823 L5- 10 Arvind Instruction Execution

Execution of an instruction involves

1. instruction fetch 2. decode and register fetch 3. ALU operation 4. memory operation (optional) 5. write back

and the computation of the address of the next instruction

September 26, 2005 6.823 L5- 11 Arvind Datapath: Reg-Reg ALU Instructions

RegWrite 0x4 Add clk inst<25:21> we inst<20:16> rs1 rs2 PC addr inst inst<15:11> rd1 ws ALU z Inst. wd rd2 clk GPRs Memory

inst<5:0> ALU Control

OpCode RegWrite Timing? 6 5 5 5 5 6 0 rs rt rd 0 func rd ← (rs) func (rt) 31 26 25 21 20 16 15 11 5 0 September 26, 2005 6.823 L5- 12 Arvind Datapath: Reg-Imm ALU Instructions

RegWrite 0x4 clk Add inst<25:21> we rs1 rs2 addr PC inst<20:16> rd1 inst ws ALU wd rd2 z clk Inst. GPRs Memory inst<15:0> Imm Ext inst<31:26> ALU Control

OpCode ExtSel

655 16 opcode rs rt immediate rt ← (rs) op immediate 31 26 25 2120 16 15 0 September 26, 2005 6.823 L5- 13 Arvind Conflicts in Merging Datapath

RegWrite 0x4 Introduce clk Add muxes inst<25:21> we rs1 rs2 addr PC inst<20:16> rd1 inst ws ALU inst<15:11> wd rd2 z clk Inst. GPRs Memory inst<15:0> Imm Ext inst<31:26> ALU inst<5:0> Control

OpCode ExtSel 6 5 5 5 5 6 0 rs rt rd 0 func rd ← (rs) func (rt) opcode rs rt immediate rt ← (rs) op immediate

September 26, 2005 6.823 L5- 14 Arvind Datapath for ALU Instructions

RegWrite 0x4 clk Add <25:21> we rs1 <20:16> rs2 PC addr rd1 inst ws ALU <15:11> wd rd2 z clk Inst. GPRs Memory <15:0> Imm Ext <31:26>, <5:0> ALU Control

OpCode RegDst ExtSel OpSel BSrc rt / rd Reg / Imm 6 5 5 5 5 6 0 rs rt rd 0 func rd ← (rs) func (rt) opcode rs rt immediate rt ← (rs) op immediate

September 26, 2005 6.823 L5- 15 Arvind Datapath for Memory Instructions

Should program and data memory be separate?

Harvard style: separate (Aiken and Mark 1 influence) - read-only program memory - read/write data memory at some level the two memories have to be the same

Princeton style: the same (von Neumann’s influence) - A Load or Store instruction requires accessing the memory more than once during its execution

September 26, 2005 6.823 L5- 16 Arvind

Load/Store Instructions:Harvard Datapath

RegWrite MemWrite

0x4 clk WBSrc ALU / Mem Add “base” we rs1 clk rs2 rd1 we PC addr inst ws addr wd ALU rd2 z Inst. GPRs rdata clk Data Memory disp Imm Memory Ext wdata ALU Control

OpCode RegDst ExtSel OpSel BSrc 6 5 5 16 opcode rs rt displacement (rs) + displacement 31 26 25 21 20 16 15 0 rs is the base register rt is the destination of a Load or the source for a Store September 26, 2005 6.823 L5- 17 Arvind MIPS Control Instructions Conditional (on GPR) PC-relative branch 6 5 5 16 opcode rs offset BEQZ, BNEZ Unconditional register-indirect jumps 6 5 5 16 opcode rs JR, JALR Unconditional absolute jumps 6 26 opcode target J, JAL

• PC-relative branches add offset×4 to PC+4 to calculate the target address (offset is in words): ±128 KB range • Absolute jumps append target×4 to PC<31:28> to calculate the target address: 256 MB range • jump-&-link stores PC+4 into the link register (R31) • All Control Transfers are delayed by 1 instruction we will worry about the branch delay slot later September 26, 2005 6.823 L5- 18 Arvind Conditional Branches (BEQZ, BNEZ)

PCSrc br RegWrite MemWrite WBSrc

pc+4

0x4 Add Add

clk

we rs1 clk rs2 PC addr rd1 we inst ws addr wd rd2 ALU clk Inst. GPRs z rdata Memory Data Imm Memory Ext wdata ALU Control

OpCode RegDst ExtSel OpSel BSrc zero?

September 26, 2005 6.823 L5- 19 Arvind Register-Indirect Jumps (JR)

PCSrc br RegWrite MemWrite WBSrc rind

pc+4

0x4 Add Add

clk

we rs1 clk rs2 PC addr rd1 we inst ws addr wd rd2 ALU clk Inst. GPRs z rdata Memory Data Imm Memory Ext wdata ALU Control

OpCode RegDst ExtSel OpSel BSrc zero?

September 26, 2005 6.823 L5- 20 Arvind Register-Indirect Jump-&-Link (JALR)

PCSrc br RegWrite MemWrite WBSrc rind

pc+4

0x4 Add Add

clk

we rs1 clk rs2 PC addr 31 rd1 we inst ws addr wd rd2 ALU clk Inst. GPRs z rdata Memory Data Imm Memory Ext wdata ALU Control

OpCode RegDst ExtSel OpSel BSrc zero?

September 26, 2005 6.823 L5- 21 Arvind Absolute Jumps (J, JAL)

PCSrc br RegWrite MemWrite WBSrc rind jabs pc+4

0x4 Add Add

clk

we rs1 clk rs2 PC addr 31 rd1 we inst ws addr wd rd2 ALU clk Inst. GPRs z rdata Memory Data Imm Memory Ext wdata ALU Control

OpCode RegDst ExtSel OpSel BSrc zero?

September 26, 2005 6.823 L5- 22 Arvind Harvard-Style Datapath for MIPS

PCSrc br RegWrite MemWrite WBSrc rind jabs pc+4

0x4 Add Add

clk

we rs1 clk rs2 PC addr 31 rd1 we inst ws addr wd rd2 ALU clk Inst. GPRs z rdata Memory Data Imm Memory Ext wdata ALU Control

OpCode RegDst ExtSel OpSel BSrc zero?

September 26, 2005 23

Five-minute break to stretch your legs 6.823 L5- 24 Arvind Single-Cycle Hardwired Control:

We will assume • clock period is sufficiently long for all of the following steps to be “completed”:

1. instruction fetch 2. decode and register fetch 3. ALU operation 4. data fetch if required 5. register write-back setup time

⇒ tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB

• At the rising edge of the following clock, the PC, the register file and the memory are updated

September 26, 2005 6.823 L5- 25 Hardwired Control is pure Arvind

ExtSel BSrc OpSel op code combinational MemWrite WBSrc zero? logic RegDst RegWrite PCSrc

September 26, 2005 6.823 L5- 26 Arvind ALU Control & Immediate Extension

Inst<5:0> (Func) Inst<31:26> (Opcode) ALUop + 0?

OpSel ( Func, Op, +, 0? )

Decode Map

ExtSel

( sExt16, uExt16, High16)

September 26, 2005 6.823 L5- 27 Arvind Hardwired Control Table

Opcode ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc ALU * Reg Func no yes ALU rd pc+4 ALUi sExt16 Imm Op no yes ALU rt pc+4

ALUiu uExt16 Imm Op no yes ALU rt pc+4

LW sExt16 Imm + no yes Mem rt pc+4

SW sExt16 Imm + yes no * * pc+4

BEQZz=0 sExt16 *0? no no * * br

BEQZz=1 sExt16 * 0? no no * * pc+4 J * * * no no * * jabs JAL * * * no yes PC R31 jabs JR * ** no no * * rind JALR ** *no yesPC R31 rind

BSrc = Reg / Imm WBSrc = ALU / Mem / PC RegDst = rt / rd / R31 PCSrc = pc+4 / br / rind / jabs

September 26, 2005 6.823 L5- 28 Arvind Pipelined MIPS To pipeline MIPS:

• First build MIPS without pipelining with CPI=1

• Next, add pipeline registers to reduce cycle time while maintaining CPI=1

September 26, 2005 6.823 L5- 29 Arvind Pipelined Datapath

0x4 Add we rs1 rs2 PC addr rd1 we rdata IR ws addr wd rd2 ALU GPRs rdata Inst. Data Memory Imm Memory Ext wdata

write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles

tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) However, CPI will increase unless instructions are pipelined

September 26, 2005 6.823 L5- 30 Arvind An Ideal Pipeline

stage stage stage stage 1 2 3 4

• All objects go through the same stages

• No sharing of resources between any two stages

• Propagation delay through all pipeline stages is equal

• The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. But can an instruction pipeline satisfy the last condition? September 26, 2005 6.823 L5- 31 How to divide the datapath Arvind into stages

Suppose memory is significantly slower than other stages. In particular, suppose

tIM = 10 units tDM = 10 units tALU = 5 units tRF = 1 unit tRW = 1 unit

Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance

September 26, 2005 6.823 L5- 32 Arvind Alternative Pipelining

0x4 Add we rs1 rs2 PC addr rd1 we rdata IR ws addr wd rd2 ALU GPRs rdata Inst. Data Memory Imm Memory Ext wdata

write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase

ttCC >> maxmax {t{tIMIM,, ttRFRF,+t tALUALU,, ttDMDM,+t, ttRWRWRW}}} = = ttDMDM+ tRW ⇒ increase the critical path by 10% Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase

September 26, 2005 6.823 L5- 33 Arvind Maximum Speedup by Pipelining

Assumptions Unpipelined Pipelined Speedup

tC tC 1. tIM = tDM = 10, tALU = 5, tRF = tRW= 1 4-stage pipeline 27 10 2.7

2. tIM =tDM = tALU = tRF = tRW = 5 4-stage pipeline 25 10 2.5

3. tIM =tDM = tALU = tRF = tRW = 5 5-stage pipeline 25 5 5.0

It is possible to achieve higher speedup with more stages in the pipeline.

September 26, 2005 34

Thank you !