CS/ECE 3330 Computer Architecture

CS/ECE 3330 Computer Architecture Chapter 4 The Processor Introduction n CPU performance factors § Instruction count – Determined by ISA and compiler § CPI and Cycle time – Determined by CPU hardware n We will examine two MIPS implementations § A simplified version § A more realistic pipelined version n Simple subset, shows most aspects § Memory reference: lw, sw § Arithmetic/logical: add, sub, and, or, slt § Control transfer: beq, j 1 COMP 140 – Summer 2014 Instruction Execution n PC → instruction memory, fetch instruction n Register numbers → register file, read registers n Depending on instruction class § Use ALU to calculate – Arithmetic result – Memory address for load/store – Branch target address § Access data memory for load/store § PC ← target address or PC + 4 2 COMP 140 – Summer 2014 CPU Overview 3 COMP 140 – Summer 2014 Multiplexers n Can’t just join wires together § Use multiplexers 4 COMP 140 – Summer 2014 Control 5 COMP 140 – Summer 2014 Logic Design Basics n Information encoded in binary § Low voltage = 0, High voltage = 1 § One wire per bit § Multi-bit data encoded on multi-wire buses n Combinational element § Operate on data § Output is a function of input n State (sequential) elements § Store information 6 COMP 140 – Summer 2014 Combinational Elements AND-gate Adder Y = A & B Y = A + B A A Y + Y B B Multiplexer Arithmetic/Logic Unit Y = S ? I1 : I0 Y = F(A, B) A I0 M u Y ALU Y I1 x B S F 7 COMP 140 – Summer 2014 Sequential Elements n Register: stores data in a circuit § Uses a clock signal to determine when to update the stored value § Edge-triggered: update when Clk changes from 0 to 1 Clk D Q D Clk Q 8 COMP 140 – Summer 2014 Sequential Elements n Register with write control § Only updates on clock edge when write control input is 1 § Used when stored value is required later Clk D Q Write Write Clk D Q 9 COMP 140 – Summer 2014 Clocking Methodology n Combinational logic transforms data during clock cycles § Between clock edges § Input from state elements, output to state element § Longest delay determines clock period 10 COMP 140 – Summer 2014 Building a Datapath n Datapath § Elements that process data and addresses in the CPU – Registers, ALUs, mux’s, memories, … n We will build a MIPS datapath incrementally § Refining the overview design 11 COMP 140 – Summer 2014 Instruction Fetch Increment by 4 for next 32-bit instruction register 12 COMP 140 – Summer 2014 Recall: MIPS R-format Instructions op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits n Example: add $t0, $s1, $s2 n Instruction fields § op: operation code (opcode) § rs: first source register number § rt: second source register number § rd: destination register number § shamt: shift amount (00000 for now) § funct: function code (extends opcode) 13 COMP 140 – Summer 2014 R-Format Instructions n Read two register operands n Perform arithmetic/logical operation n Write register result 14 COMP 140 – Summer 2014 Recall: MIPS I-format Instructions op rs rt constant or address 6 bits 5 bits 5 bits 16 bits n Examples: § lw $t0, 32($s3) § beq rs, rt, L1 n Immediate arithmetic and load/store instructions § rt: destination or source register number § Constant: –215 to +215 – 1 § Address: offset added to base address in rs 15 COMP 140 – Summer 2014 Load/Store Instructions n Read register operands n Calculate address using 16-bit offset § Use ALU, but sign-extend offset n Load: Read memory and update register n Store: Write register value to memory 16 COMP 140 – Summer 2014 Branch Instructions n Read register operands n Compare operands § Use ALU, subtract and check Zero output n Calculate target address § Sign-extend displacement § Shift left 2 places (word displacement) § Add to PC + 4 – Already calculated by instruction fetch 17 COMP 140 – Summer 2014 Branch Instructions Just re-routes wires Sign-bit wire replicated 18 COMP 140 – Summer 2014 Composing the Elements n First-cut data path does an instruction in one clock cycle § Each datapath element can only do one function at a time § Hence, we need separate instruction and data memories n Use multiplexers where alternate data sources are used for different instructions 19 COMP 140 – Summer 2014 R-Type/Load/Store Datapath 20 COMP 140 – Summer 2014 Full Datapath 21 COMP 140 – Summer 2014 ALU Control n ALU used for § Load/Store: F = add § Branch: F = subtract § R-type: F depends on funct field ALU control Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 set-on-less-than 1100 NOR 22 COMP 140 – Summer 2014 ALU Control n Assume 2-bit ALUOp derived from opcode § Combinational logic derives ALU control opcode ALUOp Operation funct ALU function ALU control lw 00 load word XXXXXX add 0010 sw 00 store word XXXXXX add 0010 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010 subtract 100010 subtract 0110 AND 100100 AND 0000 OR 100101 OR 0001 set-on-less-than 101010 set-on-less-than 0111 23 COMP 140 – Summer 2014 The Main Control Unit n Control signals derived from instruction R-type 0 rs rt rd shamt funct 31:26 25:21 20:16 15:11 10:6 5:0 Load/ 35 or 43 rs rt address Store 31:26 25:21 20:16 15:0 Branch 4 rs rt address 31:26 25:21 20:16 15:0 opcode always read, write for sign-extend read except R-type and add for load and load 24 COMP 140 – Summer 2014 Datapath With Control 25 COMP 140 – Summer 2014 op rs rt rd sham funct R-Type Instruction 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits rd rs rt add $t1,$t2,$t3 26 COMP 140 – Summer 2014 op rs rt constant or address Load Instruction 6 bits 5 bits 5 bits 16 bits rt rs lw $t1,offset($t2) 27 COMP 140 – Summer 2014 op rs rt constant or address Branch-on-Equal 6 bits 5 bits 5 bits 16 bits rs rt beq $t1,$t2,offset 28 COMP 140 – Summer 2014 Implementing Jumps Jump 2 address 31:26 25:0 n Jump uses word address n Update PC with concatenation of § Top 4 bits of old PC § 26-bit jump address § 00 n Need an extra control signal decoded from opcode 29 COMP 140 – Summer 2014 Datapath With Jumps Added 30 COMP 140 – Summer 2014 Performance Issues n Longest delay determines clock period § Critical path: load instruction § Instruction memory → register file → ALU → data memory → register file n Not feasible to vary period for different instructions n Violates design principle § Making the common case fast n We will improve performance by pipelining 31 COMP 140 – Summer 2014 Pipelining Analogy n Pipelined laundry: overlapping execution § Parallelism improves performance n Four loads: n Speedup = 8hrs/3.5hrs = 2.3 n Non-stop: (n=# loads) n Speedup = 2hrs*n/(0.5hrs*n + 1.5hrs) ≈ 4 (# of stages) 32 COMP 140 – Summer 2014 MIPS Pipeline n Five stages, one step per stage § IF: Instruction fetch from memory § ID: Instruction decode & register read § EX: Execute operation or calculate address § MEM: Access memory operand § WB: Write result back to register Instr 1 IF ID EX MEM WB Instr 2 IF ID EX MEM WB Instr 3 IF ID EX MEM WB Time à 33 COMP 140 – Summer 2014 Pipeline Performance n Assume time for stages is § 100ps for register read or write § 200ps for other stages n Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register ALU op Memory Register Total time read access write lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps 34 COMP 140 – Summer 2014 Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) 35 COMP 140 – Summer 2014 Pipeline Speedup n If all stages are balanced § i.e., all take the same time § Time between instructionspipelined = Time between instructionsnonpipelined Number of stages n If not balanced, speedup is lower n Speedup due to increased throughput § Latency (time for each instruction) does not decrease 36 COMP 140 – Summer 2014 Pipelining and ISA Design n MIPS ISA designed for pipelining § All instructions are 32-bits – Easier to fetch and decode in one cycle – c.f. x86: 1- to 17-byte instructions § Few and regular instruction formats – Can decode and read registers in one step § Load/store addressing – Can calculate address in 3rd stage, access memory in 4th stage § Alignment of memory operands – Memory access takes only one cycle 37 COMP 140 – Summer 2014 Hazards n Situations that prevent starting the next instruction in the next cycle n Structure hazards § A required resource is busy n Data hazard § Need to wait for previous instruction to complete its data read/write n Control hazard § Deciding on control action depends on previous instruction 38 COMP 140 – Summer 2014 Structure Hazards n Conflict for use of a resource n In MIPS pipeline with a single memory § Load/store requires data access § Instruction fetch would have to stall for that cycle – Would cause a pipeline “bubble” n Hence, pipelined datapaths require separate instruction/data memories § Or separate instruction/data caches 39 COMP 140 – Summer 2014 Data Hazards n An instruction depends on completion of data access by a previous instruction § add $s0, $t0, $t1 sub $t2, $s0, $t3 add $s0, $t0, $t1 IF ID EX MEM WB sub $t2, $s0, $t3 IF ID ID ID EX MEM WB 40 COMP 140 – Summer 2014 Data Hazards n Think about which pipeline stage generates or uses a value add $s0, $t0, $t1 IF ID EX MEM WB sub $t2, $s0, $t3 IF ID EX MEM WB lw $s0, 20($t1) IF ID EX MEM WB sw $t3, 12($t0) IF ID EX MEM WB 41 COMP 140 – Summer 2014 Forwarding (aka Bypassing) n Use result when it is computed § Don’t wait for it to be stored in a register § Requires extra connections in the datapath 42 COMP 140 – Summer 2014 Load-Use Data Hazard n Can’t always avoid stalls by forwarding § If

CS/ECE 3330 Computer Architecture

IEEE Paper Template in A4

Computer Science 246 Computer Architecture Spring 2010 Harvard University

Dynamic Scheduling

United States Patent (19) 11 Patent Number: 5,680,565 Glew Et Al

Identifying Bottlenecks in a Multithreaded Superscalar

Advanced Processor Designs

Computer Architecture Out-Of-Order Execution

CSE 586 Computer Architecture Lecture 4

Superscalar Instruction Dispatch with Exception Handling

MIPS Architecture with Tomasulo Algorithm [12]

In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (Tlbs)

Improving the Precise Interrupt Mechanism of Software- Managed TLB Miss Handlers