<<

Instruction Execution & Instruction Execution Pipelining

CS160 Ward 1 CS160 Ward 2

Instruction Execution Indirect Cycle

• Simple fetch-decode-execute cycle: • Instruction execution may require 1. Get address of next instruction from PC memory access to fetch operands 2. Fetch next instruction into IR • Memory fetch may use indirect 3. Change PC addressing 4. Determine instruction type (add, shift, … ) – Actual address to be fetched is in memory – Requires more memory accesses 4. If instruction has operand in memory, fetch it into a register – Can be thought of as additional instruction 5. Execute instruction storing result subcycle 6. Go to step 1

CS160 Ward 3 CS160 Ward 4 Basic Instruction Cycle States Data Flow (Instruction Fetch) • Depends on CPU design; below typical • Fetch (Step 2) – PC contains address of next instruction – Address moved to MAR – Address placed on address requests memory read – Result placed on data bus, copied to MBR, then to IR – Meanwhile PC incremented by 1 instruction word • Typically byte addressable, so PC incremented by 4 for 32-bit address ISA

CS160 Ward 5 CS160 Ward 6

Data Flow (Instruction Fetch) Data Flow (Data Fetch) 3-Bus Architecture • IR is examined • If indirect addressing, indirect cycle is performed – Right most bits of MBR transferred to MAR – Control unit requests memory read – Result (address of operand) moved to MBR

Combined in 1-Bus Architecture

CS160 Ward 7 CS160 Ward 8 Data Flow (Indirect Diagram) Data Flow (Execute)

• May take many forms • Depends on instruction being executed • May include – Memory read/write – Input/Output – Register transfers – ALU operations

CS160 Ward 9 CS160 Ward 10

Data Path [1] Data Path [2]

• Data path cycle: two operands going from registers, through ALU and results stored (the faster the better). • Data path cycle time: time to accomplish above • Width of data path is a major determining factor in performance

CS160 Ward 11 CS160 Ward 12 RISC and CISC Based Processors RISC vs. CISC Debate

• RISC (Reduced Instruction Set Computer) • CISC & RISC both improved during 1980’s vs – CISC driven primarily by technology CISC (Complex Instruction Set Computer) – RISC driven by technology & architecture – “War” started in the late 1970’s • RISC has basically won – RISC: – Improved by over 50% / year • Simple interpret step so each instruction faster – Took advantage of new technology • Issues instructions more quickly (faster) – Incorporated new architectural features (e.g., Instruction Level Parallelism (ISL), superscalar – Basic concept: concepts) If RISC needs 5 instructions to do what 1 CISC • Compilers have become critical component instruction does but 1 CISC instruction takes 10 in performance times longer than 1 RISC, then RISC is faster

CS160 Ward 13 CS160 Ward 14

More on RISC & CISC

• RISC-CISC debate largely irrelevant now – Concepts are similar • Heavy reliance on compiler technology Prefetching – Modern RISC has “complicated” ISA – CISC leader (Intel family) now has large RISC component

CS160 Ward 15 CS160 Ward 16 Prefetch Improved Performance

• Fetch accesses main memory • But not doubled: • Execution usually does not access main – Fetch usually shorter than complete memory execution of one instruction • Can fetch next instruction during • Prefetch more than one instruction? execution of current instruction – If fetch takes longer than execute, then (will discuss later) saves the day! • Called instruction prefetch – Any jump or branch means that prefetched instructions are not the required instructions

CS160 Ward 17 CS160 Ward 18

Prefetching, Waiting & Branching

Pipelining

CS160 Ward 19 CS160 Ward 20 Pipelining Timing of

• Add more stages to improve performance • For example, 6 stages: – Fetch instruction (FI) Memory access – Decode instruction (DI) – Calculate operands (CO) – Fetch operands (FO) ? Memory access ? – Execute instruction (EI) – Write operand (result) (WO) ? Memory access ?

Overlap these stages Again, cache saves the day!

CS160 Ward 21 CS160 Ward 22

Important points Execution Time

• An instruction pipeline passes instructions through • Assume that a pipelined instruction a series of stages with each performing some part has 4 stages, and the maximum time of the instruction. required in the stages are 10, 11, 10 and 12 • If the speed of two processors, one with a nanoseconds, respectively. pipeline and one without, are the same, the • How long will the processor take to perform pipeline architecture will not improve the overall each stage? 12 nanoseconds time required to execute one instruction. • How long does it take for one instruction to • If the speed of two processors, one with a complete execution? 48 nanoseconds pipeline and one without, are the same, the • How long will it take for ten instruction to pipelined architecture has a higher throughput complete execution? (48 + 9*12) nanoseconds (number of instructions processed per second). 156 nanoseconds 15.6 nanoseconds/instru

CS160 Ward 23 CS160 Ward 24 4-Stage Pipeline Some Potential Causes of Pipeline “Stalls”

• Another example (with interstage buffers): • What might cause the pipeline to “stall”? – Fetch instruction (F) Memory access – One stage takes too long (examples) – Decode instruction and fetch operands (D) • Different execution times (data hazard) – Execute instruction (E) ? Memory access ? • Cache miss (control hazard) – Write operand (result) (W) • Memory unit busy (structural hazard) – Branch to a new location (control hazard) – Call a subroutine (control hazard)

CS160 Ward 25 CS160 Ward 26

Long Stages – Execution [1] Long Stages – Execution [2]

• For a variety of reasons, some instructions • Solution: take longer to execute than others (e.g., – Sub-divide the long execute stage into ⇒ add vs divide) stall multiple stages

Time Clock yclec 1 2 3 4 5 6 7 8 9 – Provide multiple execution hardware so Instruction that multiple instructions can be in the I1 F1 D1 E1 W1 stalls 2 execute stage at the same time

I F D E W cycles 2 2 2 2 2 – Produces one instruction completion every

I3 F3 D3 E3 W3 cycle after the pipeline is full data I F D E W 4 4 4 4 4 hazard example I5 F5 D5 E5

CS160 Ward 27 CS160 Ward 28 Long Stages – Cache Miss Long Stages – Memory Access Conflict

• Delay in getting next instruction due to • Two different instructions require a cache miss (will discuss later) access to memory at the same time

Load X(R1),R2 - place contents of memory location X+R1 into R2

control I2 is writing to R2 hazard when I3 wants to example write to a register

stall Decode, Execute & Write stages become idle – stall. structural hazard

CS160 Ward 29 CS160 Ward 30

Conditional Branch in a Pipeline Instruction 3 is a conditional branch to Instruction 15 Six Stage Instruction Pipeline Note Important Unconditional Architectural Logic branch Point: When is the PC changed?

? control hazard

CS160 Ward 31 CS160 Ward 32 Data Dependencies in a Pipeline

Alternative Pipeline r1 = r2 + r3 Depiction r4 = r1 -r3

Pipeline stalls until r1 has been written. (RAW hazard)

r1 = r2 + r3 r4 = r1 -r3

CS160 Ward 33 CS160 Ward 34

Example of Avoiding Stalls RAW Hazard: Forwarding

• Hardware optimization to avoid stall • Allows ALU to reference result in next instruction • Example: Instruction K: C ← add A B Instruction K+1: D ← subtract E C

• Stalls eliminated by rearranging (a) to (b)

CS160 Ward 35 CS160 Ward 36 Unconditional Branches: Instruction Conditional Branches: Delayed Branch Queue & Prefetching

With more stages, need to reorder more instructions.

CS160 Ward 37 CS160 Ward 38

Conditional Branches: Dynamic Prediction Pipelining Factors [1]

without branch Never get speedup better than the number of stages in a pipeline

CS160 Ward 39 CS160 Ward 40 Pipelining Speedup Factors [2] Classic 5-Stage RISC Pipeline

• Instruction Fetch (IF) Memory access • Instruction Decode & Register Fetch (ID)

• Execute / Effective Address (load-store) (EX) • Memory Access (MEM) ? Memory access ?

• Write-back (to register) (WB)

Never get speedup better than the number of instructions without a branch.

CS160 Ward 41 CS160 Ward 42

Another 5-Stage Pipeline Achieving Maximum Speed

• Instruction Fetch (IF) Memory access • Program must be written to accommodate instruction pipeline • Instruction Decode (ID) • To minimize stalls, for example: • Operand Fetch (OF) ? Memory access ? – Avoid introducing unnecessary branches • Execute Instruction (EX) – Delay references to result register(s)

• Write-back (WB) ? Memory access ?

CS160 Ward 43 CS160 Ward 44 More About Pipelines Advanced Pipeline Topics

• Although hardware that uses an instruction • Structural hazards (e.g., resource conflicts) pipeline will not run at full speed unless • Data hazards (e.g., RAW, WAR, WAW) programs are written to accommodate the • Control hazards (e.g., branch prediction, pipeline, a programmer can choose to ignore cache miss) pipelining and assume the hardware will automatically increase speed whenever • Dynamic scheduling possible. • Out-of-order issue • Modern compilers are typically able to move • Out-of-order execution instructions around to optimize pipeline • execution for RISC architectures. • • . . . .

CS160 Ward 45 CS160 Ward 46

Superscalar Architecture [1] • Superscalar architecture (parallelism inside single processor – Instruction-Level Parallelism) example – 2 pipelines with 2 ALU’s – 2 instructions must not conflict over resource (e.g., Parallelizations register, memory access unit) and must not depend on other result

CS160 Ward 47 CS160 Ward 48 Superscalar Architecture [2] Processor-Level Parallelism

• Superscalar architecture (parallelism inside • Array processor: single control unit, large array single processor – Instruction-Level of identical processors that perform same Parallelism) example instructions on different data – Multiple functional units (i.e., ALU’s) • : heavily pipelined ALU efficient at executing instructions on data pairs (like matrices) and vector register (set of registers that can be loaded simultaneously) • Multiprocessors: more than one independent CPU sharing same memory • Multicomputers: large number of interconnected computers that don’t share memory but pass data on buses to share it

CS160 Ward 49 CS160 Ward 50