Course Outline CS 211: Architecture

Introduction to ILP Processors & • Introduction: Trends, Performance models Concepts • Review of computer organization and ISA implementation • Overview of Pipelining • ILP Processors: Superscalar Processors – Next! ILP Intro and Superscalar • ILP: EPIC/VLIW Processors • optimization techniques for ILP processors – getting max performance out of ILP design • Part 2: Other components- memory, I/O.

CS211 1 CS211 2

Introduction to Instruction Level Architectures for Parallelism (ILP) Instruction-Level Parallelism

Scalar (baseline) • What is ILP? Instruction Parallelism = D – and Compiler design techniques that Operation Latency = 1 speed up execution by causing individual machine operations to execute in parallel Peak IPC = 1 • ILP is transparent to the user D – Multiple operations executed in parallel even though the system is handed a single program 1 IF DE EX WB written with a sequential processor in mind 2 3 4 • Same execution hardware as a normal RISC 5 machine 6 SUCCESSIVE – May be more than one of any given type of INSTRUCTIONS hardware 0 123456789 TIME IN CYCLES (OF BASELINE MACHINE)

CS211 3 CS211 4

Page 1 Superpipelined Machine Superscalar Machines

Superpipelined Execution Superscalar (Pipelined) Execution IP = DxM IP = DxN OL = M minor cycles OL = 1 baseline cycles Peak IPC = 1 per minor cycle (M per Peak IPC = N per baseline cycle baseline cycle)

major cycle = M minor cycle 1 minor cycle 2 N 3 1 2 4 5 3 6 4 7 5 8 6 9 IF DE EX WB IF DE EX WB

1 2 3 4 5 6 CS211 5 CS211 6

Superscalar and Superpipelined Limitations of Inorder Pipelines Superscalar Parallelism Superpipeline Parallelism Operation Latency: 1 Operation Latency: M • CPI of inorder pipelines degrades very sharply if the Issuing Rate: N Issuing Rate: 1 machine parallelism is increased beyond a certain point, i.e. when NxM approaches average distance Superscalar Degree (SSD): N Superpipelined Degree (SPD): M between dependent instructions (Determined by Issue Rate) (Determined by Operation Latency) • Forwarding is no longer effective SUPERSCALAR ⇒ must stall more often Key: – Pipeline may never be full due to frequent dependency IFetch stalls!! SUPERPIPELINED Dcode Execute Writeback 1 2 3 0 1 2345678 9 10 11 12 13 4 Time in Cycles (of Base Machine) 5 6 7 Superscalar and superpipelined machines of equal degree 8 have roughly the same performance, i.e. if n = m then both 9 have about the same IPC. IF DE EX WB CS211 7 CS211 8

Page 2 What is parallelism and where What is Parallelism?

x = a + b; y = b * 2 • Work x = a + b; z =(x-y) * (x+y) y = b * 2 T1 - time to complete a computation on a sequential z =(x-y) * (x+y) system a b • Critical Path a b T∞ - time to complete the same + *2 computation on an infinitely- x parallel system + *2 y • Average Parallelism x y - + Pavg = T1 / T∞ • For a p wide system - +

* Tp ≥ max{ T1/p, T∞ } * Pavg>>p ⇒ Tp ≈ T1/p

CS211 9 CS211 10

Example Execution Example Execution

Sequential Execution

Functional Unit Operations Performed Latency Integer Unit 1 Integer ALU Operations 1 Integer Multiplication 2 Loads 2 Stores 1 Integer Unit 2 / Integer ALU Operations 1 Branch Unit Integer Multiplication 2 Loads 2 Stores 1 Test-and-branch 1 Floating-point Unit 1 Floating Point Operations 3 ILP Execution Floating-point Unit 2

CS211 11 CS211 12

Page 3 ILP: Instruction-Level Parallelism Inter-instruction Dependences

• ILP is is a measure of the amount of inter- Š Data dependence dependencies between instructions r3 ← r1 op r2 Read-after-Write r5 ← r3 op r4 (RAW) Š Anti-dependence • Average ILP = no. instruction / no. cyc required r ← r op r Write-after-Read code1: ILP = 1 3 1 2 r1 ← r4 op r5 (WAR) i.e. must execute serially Š Output dependence code2: ILP = 3 r3 ← r1 op r2 Write-after-Write i.e. can execute at the same time r5 ← r3 op r4 (WAW) r3 ← r6 op r7 code1: r1 ← r2 + 1 code2: r1 ← r2 + 1 r3 ← r1 / 17 r3 ← r9 / 17 Š Control dependence r4 ← r0 - r3 r4 ← r0 - r10

CS211 13 CS211 14

Scope of ILP Analysis Questions Facing ILP System Designers

r1 ⇐ r2 + 1 • What gives rise to instruction-level parallelism in conventional, sequential programs r3 ⇐ r1 / 17 ILP=1 • How is the potential parallelism identified and r4 ⇐ r0 - r3 enhanced, and how much is there? ILP=2 r11 ⇐ r12 + 1 • What must be done in order to exploit the parallelism that has been identified? r13 r19 / 17 ⇐ • How should the work of identifying, enhancing and r14 ⇐ r0 - r20 exploiting the parallelism be divided between the hardware and software (the compiler)? • What are the alternatives in selecting the architecture of an ILP processor? Out-of-order execution permits more ILP to be exploited

CS211 15 CS211 16

Page 4 Sequential Processor ILP Processors:Superscalar Sequential Instructions Sequential Instructions

Superscalar Processor Processor

Execution unit Scheduling Logic

InstructionInstruction scheduling/scheduling/ parallelism extraction done by hardware

CS211 17 CS211 18

ILP Processors:EPIC/VLIW ILP Architectures

Serial Program (C(C code)code) Scheduled Instructions • Between the compiler and the run-time hardware, the following functions must be performed – Dependencies between operations must be EPIC Processor determined – Operations that are independent of any operation compiler that has not yet completed must be determined – Independent operations must be scheduled to execute at some particular time, on some specific functional unit, and must be assigned a register into which the result may be deposited.

CS211 19 CS211 20

Page 5 ILP Architecture Classifications Sequential Architecture

• Sequential Architectures • Program contains no explicit information – The program is not expected to convey any explicit regarding dependencies that exist between information regarding parallelism instructions • Dependence Architectures • Dependencies between instructions must be – The program explicitly indicates dependencies determined by the hardware between operations – It is only necessary to determine dependencies with • Independence Architectures sequentially preceding instructions that have been issued but not yet completed – The program provides information as to which operations are independent of one another • Compiler may re-order instructions to facilitate the hardware’s task of extracting parallelism

CS211 21 CS211 22

Sequential Architecture Example Sequential Architecture Example

• Superscalar processor is a representative ILP • Superscalar processors attempt to issue implementation of a sequential architecture multiple – For every instruction issued by a Superscalar – However, essential dependencies are specified by processor, the hardware must check whether the operands interfere with the operands of any other sequential ordering so operations must be instruction that is either processed in sequential order » (1) already in execution, (2) been issued but – This proves to be a performance bottleneck that is waiting for completion of interfering very expensive to overcome instructions that would have been executed earlier in a sequential program, and (3) being • Alternative to multiple instructions per cycle issued concurrently but would have been is pipelining and issue instructions faster executed earlier in the sequential execution of the program – Superscalar proc. issues multiple inst. In cycle

CS211 23 CS211 24

Page 6 Dependence Architecture Dependence Architecture Example

• Compiler or programmer communicates to • Dataflow processors are representative of the hardware the dependencies between Dependence architectures instructions – Execute instruction at earliest possible time subject – removes the need to scan the program in sequential to availability of input operands and functional order (the bottleneck for superscalar processors) units • Hardware determines at run-time when to – Dependencies communicated by providing with each instruction a list of all successor instructions schedule the instruction – As soon as all input operands of an instruction are available, the hardware fetches the instruction – The instruction is executed as soon as a functional unit is available • Few Dataflow processors currently exist

CS211 25 CS211 26

Independence Architecture Independence Architecture Example

• By knowing which operations are • EPIC/VLIW processors are examples of independent, the hardware needs no further Independence architectures checking to determine which instructions can – Specify exactly which functional unit each be issued in the same cycle operation is executed on and when each operation is issued • The set of independent operations is far greater than the set of dependent operations – Operations are independent of other operations issued at the same time as well as those that are in – Only a subset of independent operations are execution specified – Compiler emulates at compile time what a dataflow • The compiler may additionally specify on processor does at run-time which functional unit and in which cycle an operation is executed – The hardware needs to make no run-time decisions

CS211 27 CS211 28

Page 7 Compiler vs. Processor VLIW and Superscalar Compiler Hardware

Frontend and Optimizer Superscalar • basic structure of VLIW and superscalar Determine Dependences Dataflow Determine Dependences consists of a number of Eus, each capable of

Determine Independences Determine Independences parallel operation on data fetched from Indep. Arch.

Bind Operations to Function Units VLIW Bind Operations to Function Units • VLIW and superscalar require highly multiported register files Bind Transports to Busses Bind Transports to Busses – limit on register ports places inherent limitation on TTA maximum number of EUs

Execute

B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993.

CS211 29 CS211 30

VLIW & Superscalar-Differences VLIW&Superscalar-Differences

• presentation of instructions: • Decode and Issue unit in superscalar issues – VLIW receive multi-operation instructions multiple instructions for the EUs – Superscalar accept traditional sequential stream – Have to figure out dependencies and independent but can issue more than one instruction instructions • VLIW needs very long instructions in order to • VLIW expect dependency free code whereas specify what each EU should do superscalar typically do not expect this. • Superscalar receive stream of conventional – Superscalars cope with dependencies using instructions hardware

CS211 31 CS211 32

Page 8 Instruction Scheduling Instruction Scheduling: The Optimization Goal

• dependencies must be detected and resolved • Given a source program P, schedule the instructions so as to minimize the overall execution • static: accomplished by compiler which time on the functional units in the target machine. avoids dependencies by rearranging code • dynamic: detection and resolution performed by hardware. processor typically maintains issue window (prefetched inst) and execution window (being executed). check for dependencies in issue window.

CS211 33 CS211 34

EPIC/VLIW vs Superscalar: Summary Next. . . .

• In EPIC processors • Basic ILP techniques: dependence analysis, – compiler manages hardware resources simple code optimization – synergy between compiler and architecture is key – First look at S/W (Compiler) technique to extract ILP – some compiler optimizations will be covered in • Superscalar Processors/ Dynamic ILP depth – Branches – scheduling algorithms implemented in hardware • In Superscalar processors • EPIC Processors – architecture is “self-managed” – Intel IA64 family – notably instruction dependence analysis and – compiler optimizations needed scheduling done by hardware • Overview of Compiler Optimization

CS211 35 CS211 36

Page 9 Superscalar Processors Superscalar Terminology

•Basic Superscalar Able to issue > 1 instruction / cycle Superpipelined Deep, but not superscalar pipeline. E.g., MIPS has 8 stages Branch prediction Logic to guess whether or not branch will be taken, and possibly branch target

•Advanced Out-of-order Able to issue instructions out of program order Speculation Execute instructions beyond branch points, possibly nullifying later Able to dynamically assign physical registers to instructions Retire unit Logic to keep track of instructions as they complete.

CS211 37 CS211 38

Superscalar Execution Example Single Order, Data Dependence – In Order Adding Advanced Features

• Assumptions Data Flow • Out Of Order Issue – Single FP takes 2 $f2 $f4 $f6 – Can start y as soon as adder available cycles – Must hold back z until $f10 not busy & adder available – Single FP multiplier takes 5 + + y cycles v v: addt $f2, $f4, $f10 v Critical – Can issue add & multiply w: mult $f10, $f6, $f10 w together Path = $f8 $f4 9 cycles x: addt $f10, $f8, $f12 x – Must issue in-order w * y: addt $f4, $f6, $f4 y – in,in,out + z z: addt $f4, $f8, $f10 z (In order) z x + (Single adder, data dependence) $f10 • With Register Renaming v: addt $f2, $f4, $f10 v $f12 v: addt $f2, $f4, $f10a v w: mult $f10, $f6, $f10 w w: mult $f10a, $f6, $f10a w x x x: addt $f10, $f8, $f12 x: addt $f10a, $f8, $f12 (inorder) y y: addt $f4, $f6, $f4 y: addt $f4, $f6, $f4 y z z: addt $f4, $f8, $f10 z: addt $f4, $f8, $f14 z

CS211 39 CS211 40

Page 10 Flow Path Model of Superscalars Superscalar Pipeline Design

Fetch

I- Instruction Buffer Instruction Branch Instruction FETCH Flow Decode Predictor Instruction Flow Buffer Dispatch Buffer DECODE Dispatch Integer Floating-point Media Memory Issuing Buffer

Memory Execute Data Data Flow EXECUTE Flow Completion Buffer Reorder Buffer Complete Register (ROB) Data COMMIT Store Buffer Flow Store D-cache Queue Retire

CS211 41 CS211 42

Inorder Pipelines Out-of-order Pipelining 101

Program Order IF •• •

IF IF IF Ia: F1 ← F2 x F3 ID •• • . . . . . D1 D1 D1 Ib: F1 ← F4 + F5 RD •• •

D2 D2 D2 EX INT Fadd1 Fmult1 LD/ST

EX EX EX Fadd2 Fmult2

WB WB WB Fmult3 Out-of-order WB U - Pipe V - Pipe Ib: F1 ← “F4 + F5” Intel i486 Intel WB •• • ...... Ia: F1 ← “F2 x F3” Inorder pipeline, no WAW no WAR (almost always true) What is the value of F1? WAW!!! CS211 43 CS211 44

Page 11 In-order Issue into Superscalar Execution Check List Diversified Pipelines

INSTRUCTION PROCESSING CONSTRAINTS RD ← Fn (RS, RT) Inorder•• • Inst. Dest. Func Source •• • Reg. Unit Registers Resource Contention Code Dependences Stream

(Structural Dependences) •• •

Control Dependences Data Dependences INT Fadd1 Fmult1 LD/ST

Fadd2 Fmult2 (RAW)True Dependences Storage Conflicts Issue stage needs to check: Fmult3 1. Structural Dependence 2. RAW Hazard (WAR)Anti-Dependences Output Dependences (WAW) 3. WAW Hazard •• • 4. WAR Hazard

CS211 45 CS211 46

Superscalar Processors Parallel Execution

• when instructions executed in parallel they will finish out of program order • Tasks: – unequal execution times • parallel decoding • specific means needed to preserve logical • superscalar instruction issue consistency • parallel instruction execution – preservation of sequential consistency – preserving sequential consistency of exception • exceptions during execution processing – preservation seq. consistency exception proc. – preserving sequential consistency of exec.

CS211 47 CS211 48

Page 12 More Hardware Features to Support ILP Hardware Features to Support ILP

• Pipelining – Advantages • Additional Functional Units » Relatively low cost of implementation - requires – Advantages latches within functional units » Does not suffer from increased latency » With pipelining, ILP can be doubled, tripled or more bottleneck – Disadvantages – Disadvantages » Adds delays to execution of individual operations » Amount of functional unit hardware proportional to » Increased latency eventually counterbalances increase in ILP » Interconnection network and register file size proportional to square of number of functional units

CS211 49 CS211 50

Hardware Features to Support ILP Hardware Features to Support ILP

• Instruction Issue Unit • – Care must be taken not to issue an instruction if – Little ILP typically found in basic blocks another instruction upon which it is dependent is » a straight-line sequence of operations with no not complete intervening control flow – Requires complex control logic in Superscalar – Multiple basic blocks must be executed in parallel processors » Execution may continue along multiple paths – Virtually trivial control logic in VLIW processors before it is known which path will be executed

CS211 51 CS211 52

Page 13 Hardware Features to Support ILP Hardware Features to Support ILP

• Requirements for Speculative Execution • Speculative Execution – Terminate unnecessary speculative computation – Expensive in hardware once the branch has been resolved – Alternative is to perform speculative code motion at – Undo the effects of the speculatively executed compile time operations that should not have been executed » Move operations from subsequent blocks up past branch operations into proceeding blocks – Ensure that no exceptions are reported until it is known that the excepting operation should have – Requires less demanding hardware been executed » A mechanism to ensure that exceptions caused by speculatively scheduled operations are – Preserve enough execution state at each reported if and only if flow of control is such speculative branch point to enable execution to that they would have been executed in the non- resume down the correct path if the speculative speculative version of the code execution happened to proceed down the wrong » Additional registers to hold the speculative one. execution state

CS211 53 CS211 54

Instruction Level Parallelism (ILP)

• ILP: Overlap execution of unrelated instructions • How to extract parallelism from program? – Beyond single block to get more instruction level parallelism Introduction to S/W Techniques for ILP • Who does instruction scheduling and parallelism extraction? – Software or Hardware or mix ? – Superscalar processors require H/W solutions, but can also use some compiler help • What new hardware features are required to support more ILP..? – Different requirements for Superscalar and EPIC

CS211 55 CS211 56

Page 14 Recall our old friend from Review of ILP Techniques pipelining • Key issue to worry about is Hazards – Control and data – Rising out of dependencies – Introduces stalls in execution • CPI = ideal CPI + Structural Stalls + Data Hazard Stalls + Control Stalls • How to increase ILP – Ideal (pipeline) CPI: measure of the maximum – Reduce data hazards: RAW, WAR, WAW performance attainable by the implementation – Handle control hazards better – Structural hazards: HW cannot support this – Increase ideal IPC (instructions per cycle) combination of instructions • Note: Bottom line is how to better schedule – Data hazards: Instruction depends on result of prior instructions instruction still in the pipeline – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)

CS211 57 CS211 58

Ideas to Reduce Stalls Instruction-Level Parallelism (ILP)

Technique Reduces • Basic Block (BB) ILP is quite small Dynamic scheduling Data hazard stalls – BB: a straight-line code sequence with no branches in except to Dynamic branch Control stalls the entry and no branches out except at the exit prediction – average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches Issuing multiple Ideal CPI – Plus instructions in BB likely to depend on each other Chapter 3 instructions per cycle Speculation Data and control stalls • To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks Dynamic memory Data hazard stalls involving disambiguation memory • Simplest: loop-level parallelism to exploit parallelism among iterations of a loop Loop unrolling Control hazard stalls – Vector is one way Basic compiler pipeline Data hazard stalls scheduling » Where is this useful ? Compiler dependence Ideal CPI and data hazard stalls – If not vector, then either dynamic via branch prediction or static via loop unrolling by compiler Chapter 4 analysis Software pipelining and Ideal CPI and data hazard stalls trace scheduling Compiler speculation Ideal CPI, data and control stalls

CS211 59 CS211 60

Page 15 Quick recall of data hazards.. Data Dependence and Hazards

• InstrJ is data dependent on InstrI InstrJ tries to read operand before InstrI writes it • True/flow dependencies - RAW I: add r1,r2,r3 • Name dependencies WAR, WAW J: sub r4,r1,r3 – Also known as false dependencies, output dep • or InstrJ is data dependent on InstrK which is dependent on InstrI • Caused by a “True Dependence” (compiler term) • If true dependence caused a hazard in the pipeline, called a Read After Write (RAW) hazard

CS211 61 CS211 62

Name Dependence #1: Name Dependence #2: Anti-dependence Output dependence • Instr writes operand before Instr writes it. • Name dependence: when 2 instructions use same J I register or memory location, called a name, but no I: sub r1,r4,r3 flow of data between the instructions associated J: add r1,r2,r3 with that name; 2 versions of name dependence K: mul r6,r1,r7 • InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 • Called an “output dependence” by compiler writers J: add r1,r2,r3 This also results from the reuse of name “r1” K: mul r6,r1,r7 • If anti-dependence caused a hazard in the pipeline, Called an “anti-dependence” by compiler writers. called a Write After Write (WAW) hazard This results from reuse of the name “r1” • If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard

CS211 63 CS211 64

Page 16 ILP and Data Hazards ILP and Data Hazards

• HW/SW must preserve program order: • HW/SW goal: exploit parallelism by – order instructions would execute in if executed sequentially preserving program order only where it one at a time as determined by original source program affects the outcome of the program – Does this mean we can never change order of execution of • Instructions involved in a name dependence instructions ? can execute simultaneously if name used in » Ask - What happens if we change the order of an instruction instructions is changed so instructions do not conflict » Does result change ? – Register renaming resolves name dependence for regs – Either by compiler or by HW •

CS211 65 CS211 66

Control Dependencies Control Dependence Ignored

• Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order • Control dependence need not be preserved if p1 { – willing to execute instructions that should not have been executed, thereby violating the control dependences, if S1; can do so without affecting correctness of the program }; • Instead, 2 properties critical to program if p2 { correctness are exception behavior and data flow S2; } • S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.

CS211 67 CS211 68

Page 17 Exception Behavior Data Flow

• Data flow: actual flow of data values among • Preserving exception behavior => any instructions that produce results and those that changes in instruction execution order must consume them not change how exceptions are raised in – branches make flow dynamic, determine which instruction program (=> no new exceptions) is supplier of data • Example: • Example: DADDU R2,R3,R4 DADDU R1,R2,R3 BEQZ R2,L1 BEQZ R4,L LW R1,0(R2) DSUBU R1,R5,R6 L1: L: … • Problem with moving LW before BEQZ? OR R7,R1,R8 • OR depends on DADDU or DSUBU? Must preserve data flow on execution

CS211 69 CS211 70

Ok.. ILP through Software/Compiler Loop Unrolling: A Simple S/W Technique

• Ask what you (SW/compiler) can do for the • Parallelism within one “basic block” is HW ? minimal • Quick look at one SW technique to – Need to look at larger regions of code to schedule – Decrease CPU time • Loops are very common – expose more ILP – Number of iterations, same tasks in each iteration • Simple Observation : If iterations are independent, then multiple iterations can be executed in parallel • Loop Unrolling- Unrolling multiple iterations of a loop to create more instructions to schedule

CS211 71 CS211 72

Page 18 FP Loop Hazards Example FP Loop: Where are the Hazards?

Loop: LD F0,0(R1) ;F0=vector element Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar in F2 ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot NOP ;delayed branch slot Instruction Instruction Latency in Instruction Instruction Latency in producing result using result clock cycles producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Another FP ALU op 3 FP ALU op Store double 2 FP ALU op Store double 2 Load double FP ALU op 1 Load double FP ALU op 1 Load double Store double 0 Load double Store double 0 Integer op Integer op 0 Integer op Integer op 0 • Where are the stalls? CS211 73 CS211 74

FP Loop Showing Stalls Revised FP Loop Minimizing Stalls

1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 1 Loop: LD F0,0(R1) 3 ADDD F4,F0,F2 ;add scalar in F2 2 stall 4 stall 3 ADDD F4,F0,F2 5 stall 4 SUBI R1,R1,8 6 SD 0(R1),F4 ;store result 5 BNEZ R1,Loop ;delayed branch 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 6 SD 8(R1),F4 ;altered when move past SUBI 8 BNEZ R1,Loop ;branch R1!=zero Swap BNEZ and SD by changing address of SD 9 stall ;delayed branch slot Instruction Instruction Latency in Instruction Instruction Latency in producing result using result clock cycles producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Another FP ALU op 3 FP ALU op Store double 2 FP ALU op Store double 2 Load double FP ALU op 1 Load double FP ALU op 1 • 9 clocks: Rewrite code to minimize stalls? 6 clocks: Unroll loop 4 times code to make faster? CS211 75 CS211 76

Page 19 Unroll Loop Four Times (straightforward Unrolled Loop That Minimizes Stalls way) 1 Loop: LD F0,0(R1) Rewrite loop to 1 Loop: LD F0,0(R1) • What assumptions made 2 ADDD F4,F0,F2 2 LD F6,-8(R1) when moved code? 3 SD 0(R1),F4 ;drop SUBI & BNEZ minimize stalls? 3 LD F10,-16(R1) 4 LD F14,-24(R1) – OK to move store past 4 LD F6,-8(R1) SUBI even though changes ADDD F4,F0,F2 5 ADDD F8,F6,F2 5 register 6 SD -8(R1),F8 ;drop SUBI & BNEZ 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 – OK to move loads before 7 LD F10,-16(R1) stores: get right data? 8 ADDD F12,F10,F2 8 ADDD F16,F14,F2 – When is it safe for 9 SD -16(R1),F12 ;drop SUBI & BNEZ 9 SD 0(R1),F4 10 SD -8(R1),F8 compiler to do such 10 LD F14,-24(R1) changes? 11 ADDD F16,F14,F2 11 SD -16(R1),F12 12 SD -24(R1),F16 12 SUBI R1,R1,#32 13 SUBI R1,R1,#32 ;alter to 4*8 13 BNEZ R1,LOOP 14 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 15 NOP 14 clock cycles, or 3.5 per iteration 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iteration When safe to move instructions? Assumes R1 is multiple of 4 CS211 77 CS211 78

Compiler Perspectives on Code Movement Where are the data dependencies?

• Definitions: compiler concerned about dependencies in program, whether or not a HW hazard depends on a given 1 Loop: LD F0,0(R1) pipeline 2 ADDD F4,F0,F2 • Try to schedule to avoid hazards 3 SUBI R1,R1,8 • (True) Data dependencies (RAW if a hazard for HW) 4 BNEZ R1,Loop ;delayed branch – Instruction i produces a result used by instruction j, or 5 SD 8(R1),F4 ;altered when move past SUBI – Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. • If dependent, can’t execute in parallel • Easy to determine for registers (fixed names) • Hard for memory: – Does 100(R4) = 20(R6)? – From different loop iterations, does 20(R6) = 20(R6)?

CS211 79 CS211 80

Page 20 Where are the name dependencies? Where are the name dependencies?

1 Loop: LD F0,0(R1) 1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F0,-8(R1) 4 LD F6,-8(R1) 2 ADDD F4,F0,F2 5 ADDD F8,F6,F2 3 SD -8(R1),F4 ;drop SUBI & BNEZ 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F0,-16(R1) 7 LD F10,-16(R1) 8 ADDD F4,F0,F2 8 ADDD F12,F10,F2 9 SD -16(R1),F4 ;drop SUBI & BNEZ 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F0,-24(R1) 10 LD F14,-24(R1) 11 ADDD F4,F0,F2 11 ADDD F16,F14,F2 12 SD -24(R1),F4 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 14 BNEZ R1,LOOP 15 NOP 15 NOP

How can remove them? Called “register renaming”

CS211 81 CS211 82

Compiler Perspectives on Code Compiler Perspectives on Code Movement Movement

• Again Name Dependenceis are Hard for Memory • Final kind of dependence called control dependence Accesses • Example – Does 100(R4) = 20(R6)? if p1 {S1;}; – From different loop iterations, does 20(R6) = 20(R6)? if p2 {S2;}; • Our example required compiler to know that if R1 doesn’t change then: S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1. 0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1)

There were no dependencies between some loads and stores so they could be moved by each other

CS211 83 CS211 84

Page 21 Compiler Perspectives on Code Compiler Perspectives on Code Movement Movement

• Another kind of dependence called name dependence: two instructions use same name (register or memory • Two (obvious) constraints on control dependences: – An instruction that is control dependent on a branch cannot be moved location) but don’t exchange data before the branch so that its execution is no longer controlled by the branch. • Antidependence (WAR if a hazard for HW) – Instruction j writes a register or memory location that – An instruction that is not control dependent on a branch cannot be instruction i reads from and instruction i is executed first moved to after the branch so that its execution is controlled by the branch. • Output dependence (WAW if a hazard for HW) – Instruction i and instruction j write the same register or memory • Control dependencies relaxed to get parallelism; get same effect if preserve order of exceptions (address in register checked by location; ordering between instructions must be preserved. branch before use) and data flow (value in register depends on branch) – Can “violate” the two constraints above by placing some ‘checks’ in place ? » Branch prediction, speculation

CS211 85 CS211 86

Where are the control dependencies? When Safe to Unroll Loop?

1 Loop: LD F0,0(R1) • Example: Where are data dependencies? 2 ADDD F4,F0,F2 (A,B,C distinct & nonoverlapping) 3 SD 0(R1),F4 for (i=1; i<=100; i=i+1) { 4 SUBI R1,R1,8 A[i+1] = A[i] + C[i]; /* S1 */ 5 BEQZ R1,exit B[i+1] = B[i] + A[i+1];} /* S2 */ 6 LD F0,0(R1) 7 ADDD F4,F0,F2 1. S2 uses the value, A[i+1], computed by S1 in the same iteration. 8 SD 0(R1),F4 2. S1 uses a value computed by S1 in an earlier iteration, since 9 SUBI R1,R1,8 iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. 10 BEQZ R1,exit This is a “loop-carried dependence”: between iterations 11 LD F0,0(R1) 12 ADDD F4,F0,F2 • Implies that iterations are dependent, and can’t be 13 SD 0(R1),F4 executed in parallel 14 SUBI R1,R1,8 • Not the case for our prior example; each iteration was 15 BEQZ R1,exit distinct ....

CS211 87 CS211 88

Page 22 Next. . . Superscalar HW Schemes: Instruction Parallelism

• Why in HW at run time? – Works when can’t know real dependence at compile time • How to deal with instruction flow – Compiler simpler – Code for one machine runs well on another – Dynamic Branch prediction • Key idea: Allow instructions behind stall to proceed • How to deal with register/data flow DIVD F0,F2,F4 – Register renaming ADDD F10,F0,F8 • Dynamic branch prediction SUBD F12,F8,F14 • Dynamic scheduling using Tomasulo method – Enables out-of-order execution => out-of-order completion – ID stage checked both for structuralScoreboard dates to CDC 6600 in 1963

CS211 89 CS211 90

Page 23