Superscalar Processors

Superscalar Processors Superscalar Processors: Branch Prediction Dynamic Scheduling Superscalar: A Sequential Architecture Superscalar Terminology Basic Superscalar processor is a representative ILP implementation of a sequential architecture Superscalar Able to issue > 1 instruction / cycle Superpipelined Deep, but not superscalar pipeline. - For every instruction issued by a Superscalar processor, the hardware must check whether the operands interfere with the E.g., MIPS R5000 has 8 stages operands of any other instruction that is either Branch prediction Logic to guess whether or not branch will be taken, • (1) already in execution, (2) been issued but waiting for and possibly branch target completion of interfering instructions that would have been executed earlier in a sequential program, and (3) Advanced being issued concurrently but would have been executed earlier in the sequential execution of the program Out-of-order Able to issue instructions out of program order - Superscalar proc. issues multiple inst. In cycle Speculation Execute instructions beyond branch points, possibly nullifying later Register renaming Able to dynamically assign physical registers to instructions Retire unit Logic to keep track of instructions as they complete. Superscalar Execution Example Single Order, Data Dependence – In Order Adding Advanced Features Assumptions Data Flow Out Of Order Issue - Single FP adder takes 2 $f2 $f4 $f6 - Can start y as soon as adder available cycles -Must hold back z until $f10 not busy & adder available - Single FP multiplier takes 5 + + y cycles v v: addt $f2, $f4, $f10 v Critical - Can issue add & multiply w: mult $f10, $f6, $f10 w Path = $f8 $f4 together 9 cycles x: addt $f10, $f8, $f12 x w * - Must issue in-order y: addt $f4, $f6, $f4 y - <op> in,in,out + z z: addt $f4, $f8, $f10 z (In order) z x + (Single adder, data dependence) $f10 v: addt $f2, $f4, $f10 v $f12 w: mult $f10, $f6, $f10 w x: addt $f10, $f8, $f12 x y: addt $f4, $f6, $f4 (inorder) y z: addt $f4, $f8, $f10 z Adding Advanced Features Flow Path Model of Superscalars I-cache Branch Instruction FETCH Flow Predictor Instruction Buffer DECODE With Register Renaming Integer Floating-point Media Memory Memory Data v: addt $f2, $f4, $f10a v EXECUTE Flow w Reorder w: mult $f10a, $f6, $f10a Buffer x Register x: addt $f10a, $f8, $f12 (ROB) Data COMMIT y: addt $f4, $f6, $f4 y Flow Store D-cache Queue z: addt $f4, $f8, $f14 z Superscalar Pipeline Design Inorder Pipelines Fetch IF IF IF Instruction Instruction Buffer Decode Flow D1 D1 D1 Dispatch Buffer Dispatch D2 D2 D2 Issuing Buffer EX EX EX Execute Data Flow Completion Buffer WB WB WB Complete U -Pipe V -Pipe Store Buffer Intel i486 Intel Pentium Retire Inorder pipeline, no WAW no WAR (almost always true) In-order Issue into Out-of-order Pipelining 101 Diversified Pipelines Program Order IF •• • RD ← Fn (RS, RT) Ia: F1 ← F2 x F3 Inorder•• • ID •• • . Dest. Func Source Ib: F1 ← F4 + F5 Inst. •• • Reg. Unit Registers RD •• • Stream •• • EX INT Fadd1 Fmult1 LD/ST INT Fadd1 Fmult1 LD/ST Fadd2 Fmult2 Fadd2 Fmult2 Fmult3 Out-of-order WB Issue stage needs to check: Fmult3 1. Structural Dependence Ib: F1 ← “F4 + F5” WB •• • . 2. RAW Hazard 3. WAW Hazard Ia: F1 ← “F2 x F3” •• • 4. WAR Hazard What is the value of F1? WAW!!! Icache Icache Instruction Instruction buffer buffer Decode / Issue Decode / Issue Scalar issue Superscalar issue Typical FX- FD/I . FD I . pipeline layout Contrasting decoding and instruction issue in a scalar and a 4-way superscalar processor Superscalar Processors: Tasks Superscalar Issues to be considered parallel decoding Parallel decoding – more complex task than in scalar processors. superscalar instruction issue - High issue rate can lengthen the decoding cycle therefore use parallel instruction execution predecoding. - preserving sequential consistency of exception processing - partial decoding performed while instructions are loaded into the instruction cache - preserving sequential consistency of exec. Superscalar instruction issue – A higher issue rate gives rise to higher processor performance, but amplifies the restrictive effects of control and data dependencies on the processor performance as well. - To overcome these problems designers use advanced techniques such as shelving, register renaming, and speculative branch processing Superscalar issues Pre-Decoding Parallel instruction execution task – Also called more EUs than the scalar processors, therefore higher number “preservation of the sequential consistency of of instructions in execution instruction execution”. While instructions are - more dependency check comparisons needed executed in parallel, instructions are usually Predecoding – As I-cache is being loaded, a predecode unit, completed out of order in respect to a sequential performs a partial decoding and appends a number of decode bits to each instruction. These bits usually indicate: operating procedure - the instruction class Preservation of sequential consistency of exception processing task - type of resources which are required for the execution - the fact that branch target addresses have been calculated - Predecoding used in PowerPC 601, MIPS R8000,SuperSparc The Principle of Predecoding Superscalar Instruction Issues Second-level cache specify how false data and unresolved control (or memory) dependencies are coped with during instruction issue - the design options are either to avoid them during the Typically 128 bits/cycle instruction issue by using register renaming and speculative branch processing, respectively, or not Predecode unit When instructions are written into the Icache, False data dependencies between register data may E.g. 148 bits/cycle the predecode unit appends 4-7 bits to each RISC instruction1 be removed by register renaming Icache 1 In the AMD K5, which is an x86-compatible CISC-processor, the predecode unit appends 5 bits to each byte Hardware Features to Support ILP Parallel Execution when instructions executed in parallel they will finish Instruction Issue Unit out of program order - Care must be taken not to issue an instruction if another instruction upon which it is dependent is not complete - unequal execution times - Requires complex control logic in Superscalar processors specific means needed to preserve logical consistency - Virtually trivial control logic in VLIW processors - preservation of sequential consistency exceptions during execution - preservation seq. consistency exception proc. Hardware Features to Support ILP Hardware Features to Support ILP Speculative Execution Requirements for Speculative Execution - Little ILP typically found in basic blocks - Terminate unnecessary speculative computation once the • a straight-line sequence of operations with no intervening branch has been resolved control flow - Undo the effects of the speculatively executed operations - Multiple basic blocks must be executed in parallel that should not have been executed • Execution may continue along multiple paths before it is - Ensure that no exceptions are reported until it is known that known which path will be executed the excepting operation should have been executed - Preserve enough execution state at each speculative branch point to enable execution to resume down the correct path if the speculative execution happened to proceed down the wrong one. Hardware Features to Support ILP Next. Superscalar Processor Design Speculative Execution How to deal with instruction flow - Expensive in hardware - Dynamic Branch prediction - Alternative is to perform speculative code motion at compile time How to deal with register/data flow • Move operations from subsequent blocks up past branch - Register renaming operations into proceeding blocks - Requires less demanding hardware Solutions studied: • A mechanism to ensure that exceptions caused by - Dynamic branch prediction algorithms speculatively scheduled operations are reported if and only if flow of control is such that they would have been - Dynamic scheduling using Tomasulo method executed in the non-speculative version of the code • Additional registers to hold the speculative execution state Summary of discussions Superscalar Pipeline Design Fetch ILP processors Instruction Buffer - VLIW/EPIC, Superscalar Instruction Decode Superscalar has hardware logic for extracting Flow parallelism Dispatch Buffer - Solutions for stalls etc. must be provided in hardware Dispatch Stalls play an even greater role in ILP processors Issuing Buffer Software solutions, such as code scheduling through code movement, can lead to improved execution Execute times Data Flow Completion Buffer - More sophisticated techniques needed Complete - Can we provide some H/W support to help the compiler – leads to EPIC/VLIW Store Buffer Retire Flow Path Model of Superscalars Instruction Fetch Bandwidth Solutions Ability to fetch number of instructions from cache is I-cache crucial to superscalar performance Branch Instruction FETCH Flow -Use instruction fetch buffer to prefetch instructions Predictor Instruction Buffer - Fetch multiple instructions in one cycle to support the s-wide DECODE issue of superscalar processors Integer Floating-point Media Memory Design instruction cache ( I-Cache) to support this - Shall discuss solutions when Memory design is covered Memory Data EXECUTE Flow Reorder Buffer Register (ROB) Data COMMIT Flow Store D-cache Queue The Principle of Predecoding Instruction Decoding Issues Primary tasks: Second-level cache - Identify individual instructions

Superscalar Processors

GPU Implementation Over IPTV Software Defined Networking

Computer Hardware Architecture Lecture 4

Superscalar Fall 2020

The Multiscalar Architecture

Trends in Processor Architecture

Intro Evolution of Superscalar Processor

Parallel Architectures MICHAEL J

Modern Processor Design: Fundamentals of Superscalar

Multithreading

A VHDL Model of a Superscalar Implementation of the DLX Instruction Set Architcture

Superscalar Processors and Parallel Computer Systems Objective

Memory Data Flow, VLIW and EPIC Processors