Introduction to ILP Processors

Course Outline CS 211: Computer Architecture Introduction to ILP Processors & • Introduction: Trends, Performance models Concepts • Review of computer organization and ISA implementation • Overview of Pipelining • ILP Processors: Superscalar Processors – Next! ILP Intro and Superscalar • ILP: EPIC/VLIW Processors • Compiler optimization techniques for ILP processors – getting max performance out of ILP design • Part 2: Other components- memory, I/O. CS211 1 CS211 2 Introduction to Instruction Level Architectures for Parallelism (ILP) Instruction-Level Parallelism Scalar Pipeline (baseline) • What is ILP? Instruction Parallelism = D – Processor and Compiler design techniques that Operation Latency = 1 speed up execution by causing individual machine operations to execute in parallel Peak IPC = 1 • ILP is transparent to the user D – Multiple operations executed in parallel even though the system is handed a single program 1 IF DE EX WB written with a sequential processor in mind 2 3 4 • Same execution hardware as a normal RISC 5 machine 6 SUCCESSIVE – May be more than one of any given type of INSTRUCTIONS hardware 0 123456789 TIME IN CYCLES (OF BASELINE MACHINE) CS211 3 CS211 4 Page 1 Superpipelined Machine Superscalar Machines Superpipelined Execution Superscalar (Pipelined) Execution IP = DxM IP = DxN OL = M minor cycles OL = 1 baseline cycles Peak IPC = 1 per minor cycle (M per Peak IPC = N per baseline cycle baseline cycle) major cycle = M minor cycle 1 minor cycle 2 N 3 1 2 4 5 3 6 4 7 5 8 6 9 IF DE EX WB IF DE EX WB 1 2 3 4 5 6 CS211 5 CS211 6 Superscalar and Superpipelined Limitations of Inorder Pipelines Superscalar Parallelism Superpipeline Parallelism Operation Latency: 1 Operation Latency: M • CPI of inorder pipelines degrades very sharply if the Issuing Rate: N Issuing Rate: 1 machine parallelism is increased beyond a certain point, i.e. when NxM approaches average distance Superscalar Degree (SSD): N Superpipelined Degree (SPD): M between dependent instructions (Determined by Issue Rate) (Determined by Operation Latency) • Forwarding is no longer effective SUPERSCALAR ⇒ must stall more often Key: – Pipeline may never be full due to frequent dependency IFetch stalls!! SUPERPIPELINED Dcode Execute Writeback 1 2 3 0 1 2345678 9 10 11 12 13 4 Time in Cycles (of Base Machine) 5 6 7 Superscalar and superpipelined machines of equal degree 8 have roughly the same performance, i.e. if n = m then both 9 have about the same IPC. IF DE EX WB CS211 7 CS211 8 Page 2 What is parallelism and where What is Parallelism? x = a + b; y = b * 2 • Work x = a + b; z =(x-y) * (x+y) y = b * 2 T1 - time to complete a computation on a sequential z =(x-y) * (x+y) system a b • Critical Path a b T∞ - time to complete the same + *2 computation on an infinitely- x parallel system + *2 y • Average Parallelism x y - + Pavg = T1 / T∞ • For a p wide system - + * Tp ≥ max{ T1/p, T∞ } * Pavg>>p ⇒ Tp ≈ T1/p CS211 9 CS211 10 Example Execution Example Execution Sequential Execution Functional Unit Operations Performed Latency Integer Unit 1 Integer ALU Operations 1 Integer Multiplication 2 Loads 2 Stores 1 Integer Unit 2 / Integer ALU Operations 1 Branch Unit Integer Multiplication 2 Loads 2 Stores 1 Test-and-branch 1 Floating-point Unit 1 Floating Point Operations 3 ILP Execution Floating-point Unit 2 CS211 11 CS211 12 Page 3 ILP: Instruction-Level Parallelism Inter-instruction Dependences • ILP is is a measure of the amount of inter- Data dependence dependencies between instructions r3 ← r1 op r2 Read-after-Write r5 ← r3 op r4 (RAW) Anti-dependence • Average ILP = no. instruction / no. cyc required r ← r op r Write-after-Read code1: ILP = 1 3 1 2 r1 ← r4 op r5 (WAR) i.e. must execute serially Output dependence code2: ILP = 3 r3 ← r1 op r2 Write-after-Write i.e. can execute at the same time r5 ← r3 op r4 (WAW) r3 ← r6 op r7 code1: r1 ← r2 + 1 code2: r1 ← r2 + 1 r3 ← r1 / 17 r3 ← r9 / 17 Control dependence r4 ← r0 - r3 r4 ← r0 - r10 CS211 13 CS211 14 Scope of ILP Analysis Questions Facing ILP System Designers r1 ⇐ r2 + 1 • What gives rise to instruction-level parallelism in conventional, sequential programs r3 ⇐ r1 / 17 ILP=1 • How is the potential parallelism identified and r4 ⇐ r0 - r3 enhanced, and how much is there? ILP=2 r11 ⇐ r12 + 1 • What must be done in order to exploit the parallelism that has been identified? r13 r19 / 17 ⇐ • How should the work of identifying, enhancing and r14 ⇐ r0 - r20 exploiting the parallelism be divided between the hardware and software (the compiler)? • What are the alternatives in selecting the architecture of an ILP processor? Out-of-order execution permits more ILP to be exploited CS211 15 CS211 16 Page 4 Sequential Processor ILP Processors:Superscalar Sequential Instructions Sequential Instructions Superscalar Processor Processor Execution unit Scheduling Logic InstructionInstruction scheduling/scheduling/ parallelism extraction done by hardware CS211 17 CS211 18 ILP Processors:EPIC/VLIW ILP Architectures Serial Program (C(C code)code) Scheduled Instructions • Between the compiler and the run-time hardware, the following functions must be performed – Dependencies between operations must be EPIC Processor determined – Operations that are independent of any operation compiler that has not yet completed must be determined – Independent operations must be scheduled to execute at some particular time, on some specific functional unit, and must be assigned a register into which the result may be deposited. CS211 19 CS211 20 Page 5 ILP Architecture Classifications Sequential Architecture • Sequential Architectures • Program contains no explicit information – The program is not expected to convey any explicit regarding dependencies that exist between information regarding parallelism instructions • Dependence Architectures • Dependencies between instructions must be – The program explicitly indicates dependencies determined by the hardware between operations – It is only necessary to determine dependencies with • Independence Architectures sequentially preceding instructions that have been issued but not yet completed – The program provides information as to which operations are independent of one another • Compiler may re-order instructions to facilitate the hardware’s task of extracting parallelism CS211 21 CS211 22 Sequential Architecture Example Sequential Architecture Example • Superscalar processor is a representative ILP • Superscalar processors attempt to issue implementation of a sequential architecture multiple instructions per cycle – For every instruction issued by a Superscalar – However, essential dependencies are specified by processor, the hardware must check whether the operands interfere with the operands of any other sequential ordering so operations must be instruction that is either processed in sequential order » (1) already in execution, (2) been issued but – This proves to be a performance bottleneck that is waiting for completion of interfering very expensive to overcome instructions that would have been executed earlier in a sequential program, and (3) being • Alternative to multiple instructions per cycle issued concurrently but would have been is pipelining and issue instructions faster executed earlier in the sequential execution of the program – Superscalar proc. issues multiple inst. In cycle CS211 23 CS211 24 Page 6 Dependence Architecture Dependence Architecture Example • Compiler or programmer communicates to • Dataflow processors are representative of the hardware the dependencies between Dependence architectures instructions – Execute instruction at earliest possible time subject – removes the need to scan the program in sequential to availability of input operands and functional order (the bottleneck for superscalar processors) units • Hardware determines at run-time when to – Dependencies communicated by providing with each instruction a list of all successor instructions schedule the instruction – As soon as all input operands of an instruction are available, the hardware fetches the instruction – The instruction is executed as soon as a functional unit is available • Few Dataflow processors currently exist CS211 25 CS211 26 Independence Architecture Independence Architecture Example • By knowing which operations are • EPIC/VLIW processors are examples of independent, the hardware needs no further Independence architectures checking to determine which instructions can – Specify exactly which functional unit each be issued in the same cycle operation is executed on and when each operation is issued • The set of independent operations is far greater than the set of dependent operations – Operations are independent of other operations issued at the same time as well as those that are in – Only a subset of independent operations are execution specified – Compiler emulates at compile time what a dataflow • The compiler may additionally specify on processor does at run-time which functional unit and in which cycle an operation is executed – The hardware needs to make no run-time decisions CS211 27 CS211 28 Page 7 Compiler vs. Processor VLIW and Superscalar Compiler Hardware Frontend and Optimizer Superscalar • basic structure of VLIW and superscalar Determine Dependences Dataflow Determine Dependences consists of a number of Eus, each capable of Determine Independences Determine Independences parallel operation on data fetched from Indep. Arch. register file Bind Operations to Function Units VLIW Bind Operations to Function Units • VLIW and superscalar require highly multiported

Introduction to ILP Processors

GPU Implementation Over IPTV Software Defined Networking

Computer Hardware Architecture Lecture 4

Superscalar Fall 2020

The Multiscalar Architecture

Trends in Processor Architecture

Intro Evolution of Superscalar Processor

Parallel Architectures MICHAEL J

Modern Processor Design: Fundamentals of Superscalar

Multithreading

A VHDL Model of a Superscalar Implementation of the DLX Instruction Set Architcture

Superscalar Processors and Parallel Computer Systems Objective

Memory Data Flow, VLIW and EPIC Processors