Modern Computer Architectures Lecture-1: Introduction

Sandeep Kumar Panda Asso.Prof. IT Department CEB,BBSR

1 Introduction

– Computer performance has been increasing phenomenally over the last five decades. – Brought out by Moore’s Law:

● Transistors per square inch roughly double every eighteen months. – Moore’s law is not exactly a law:

● But, has held good for nearly 50 years.

2 Introduction Cont…

● If commercial aircrafts had similar performance increase over the last 50 years, we should have:

– Commercial planes flying at 1000 times the supersonic speed. – Aircrafts of the size of a chair. – Costing couple of thousand rupees only.

3 Moore’s Law

Gordon Moore (co-founder of ) predicted in 1965: “Transistor density of minimum cost semiconductor chips would Moore’s Law: it’s worked for double roughly every 18 months.” a long time.

Transistor density is correlated to processing speed. 4 Trends Related to Moore’s Law Cont… • performance: • Twice as fast after every 2 years (roughly). • Memory capacity: • Twice as much after every 18 months (roughly).

5 Interpreting Moore’s Law

● Moore's law is not about just the density of transistors on a chip that can be achieved:

– But about the density of transistors at which the cost per transistor is the lowest.

● As more transistors are made on a chip:

– The cost to make each transistor reduces. – But the chance that the chip will not work due to a defect rises.

● Moore observed in 1965 there is a transistor density or complexity:

– At which "a minimum cost" is achieved. 6 Integrated Circuits Costs

IC cost = Die cost + Testing cost + Packaging cost Final test

Final test yield: Fraction of packaged dies which pass the final testing state.

7 8” MIPS64 Wafer (564 Dies) Drawing single-crystal Si ingot from furnace…. Then, slice into wafers and pattern it…

8 Integrated Circuits Costs

IC cost = Die cost + Testing cost + Packaging cost Final test yield

Die cost = Wafer cost Dies per Wafer * Die yield

Final test yield: Fraction of packaged dies which pass the final testing stage Die yield: Fraction of good dies on a wafer

9 Integrated Circuits Capacity

10 Processor Performance

11 Where Has This Performance Improvement Come From?

● Technology – More transistors per chip – Faster logic

● Machine Organization/Implementation – Deeper pipelines – More instructions executed in parallel,…

● Instruction Set Architecture – Reduced Instruction Set Computers (RISC) – Multimedia extensions –

technology – Finding more parallelism in code – Greater levels of optimization 12 How Did Processor Performance Improve?

● Till 1980s, most of the performance improvements came from using innovations in manufacturing technologies: – VLSI – Reduction in feature size

● Improvements due to innovations in manufacturing technologies have slowed down since 1980s: – Smaller feature size gives rise to increased resistance, capacitance, propagation delays. – Larger power dissipation. (Aside: What is the power consumption of Intel Pentium Processor? Roughly 100 watts idle)

13 Feature Size

14 Feature size shrinks by 70% per 18 to 24 months Average Transistor Cost Per Year

15 Power Density Trend

16 Watch This

Click the chip

17 Power Consumption in a Processor

● Power=Dynamic power + Leakage power

● Dynamic power = Number of transistors x capacitance x voltage2 x frequency

● Leakage power is rising and will soon match dynamic power. Pentium P-Pro P-II P-III P-4 Year 1993 95 97 99 2000 Transistors 3.1M 5.5M 7.5M 9.5M 42M Clock Speed 60M 200M 300M 500M 1.5G

18 Dynamic Power

2 PdynkiCiVAif iunits

● Dynamic power in CMOS current flows when active – Transistor on – evaluates new inputs – Flip-flop, latch captures new value (clock edge)

● Terms – C: capacitance of circuit

● wire length, number and size of transistors – V: supply voltage – A: activity factor – f: frequency

● Future: Power dissipation a major factor 19 Current Chip Manufacturing

● Till recently most desktop processors were fabricated using a 65 nm process.

● Intel in January 2007 demonstrated a working 45nm chip:

– Intel began mass-producing in late 2007 (atom and core2duo processors). – Compare: The diameter of an atom is of the order of 0.1 nm.

● A decade ago, chips were built using a 500 nm (0.5 micron) process.

● In 1971, 10micron process was used.

20 How Did Performance Improve? Cont… ● Since 1980s, most of the performance improvements have come from: – Architectural and organizational innovations

● What is the difference between: – and computer organization?

21 Architecture vs. Organization

● Architecture: – Also known as Instruction Set Architecture (ISA) – Programmer visible part of a processor: instruction set, registers, addressing modes, etc.

● Organization: – High-level design: How many caches? How many arithmetic and logic units? What type of pipelining, control design, etc.

– Sometimes known as micro-architecture. 22

Processor Views The ART and Science of Instruction-Set [Gerrit Blaauw & , 1981] ARCHITECTURE (ISA): programmer view – Functional appearance to user/system programmer – Opcodes, addressing modes, registers, etc. IMPLEMENTATION (μarchitecture): Processor designer view – Logical structure or organization that underlies the architecture – Pipelining, functional units, caches, physical registers REALIZATION (Chip): Chip/system designer view – Physical structure that embodies the implementation 23 – Gates, cells, transistors, wires Computer Architecture

● The structure of a computer that a machine language programmer must understand: – To be able to write a correct program for that machine.

● A family of computers of the same architecture should be able to run the same assembly language program. – Architecture leads to the notion of binary compatibility. 24 Instruction Set Architecture: The Critical Interface

software

instruction set

hardware

● Advantages of abstraction: – Lasts through many generations (portability) – Used in many different ways (generality) – Provides convenient functionality to higher levels 25 – Permits an efficient implementation at lower levels Course Objectives

● Modern processors such as Intel Pentium, AMD Athlon, etc. use: – Many architectural and organizational innovations not covered in a first course. – Innovations in memory, , and storage designs as well. – Multiprocessors and clusters

● In this light, objective of this course:

– Study the architectural and organizational innovations used in modern computers. 26 A Few Architectural and Organizational Innovations

● RISC (Reduced Instruction Set Computers): – Exploited instruction-level parallelism:

● Initially through pipelining and later through multiple instruction issue (superscalar) – Use of on-chip caches

● Dynamic

● Branch prediction 27 Today’s Objectives

● Study some preliminary concepts: – Amdahl’s law, performance benchmarking, etc.

● RISC versus CISC architectures.

● Types of parallelism in programs versus types of parallel computers.

. 28 Amdahl’s Law

● Quantifies overall performance gain due to improvement to a part of a computation.

● Amdahl’s Law: – Performance improvement gained from using some faster mode of execution is limited by the amount of time the enhancement is actually used.

Execution time for a task without enhancement = Execution time for the task using enhancement

29 A Typical Computer System

30 Computer System Components

CPU

Caches Processor-Memory Bus

Adapter RAM Peripheral Buses

Controllers Controllers

I/O devices Displays Networks Keyboards 31 Amdahl’s Law and Speedup

● Speedup tells us: – How much faster a machine will run due to an enhancement. ● For using Amdahl’s law two things should be considered: – 1st… Fraction of the computation time in the original machine that can use the enhancement

● If a program executes in 30 seconds and 15 seconds of exec. uses enhancement, fraction = ½ – 2nd… Improvement gained by enhancement

● If enhanced task takes 3.5 seconds and original task took 7secs, we say the speedup is 2. 32 Amdahl’s Law Equations

Fractionenhanced Execution timenew = Execution timeold x (1 – Fractionenhanced) + Speedupenhanced

Execution Timeold 1 Speedupoverall = = Execution Timenew Fractionenhanced (1 – Fractionenhanced) + Speedup Use previous equation, enhanced Solve for speedup

Don’t just try to memorize these equations and plug numbers into them. It’s always important to think about the problem too!

33 Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

1 ExTimeold Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced ExTimenew Speedupenhanced

34 Amdahl’s Law: Example

● Floating point instructions improved to run 2 times faster.

● But, only 10% of actual instructions are FP

ExTimenew = ?

Speedupoverall = ?

35 Amdahl’s Law: Example

● Floating point instructions improved to run 2X; – But only 10% of actual instructions are FP.

ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold

1 Speedupoverall = = 1.053 0.95

36 Performance Measurements

● Performance measurement is important: – Helps us to determine if one processor (or computer) works faster than another. – A computer exhibits higher performance if it executes programs faster.

37 How to Select Computer Systems?

● Suppose you are asked by your principal to select a computer system for a specific application: – Say to run a web service for your college.

● How will you proceed?

38 Comparing Performance

● X is n times faster than Y Execution time Y  n Execution time X

of X is n times that of Y Tasks per unit time X  n Tasks per unit timeY

39 If Only It Were That Simple

● X is n times faster than Y for A Execution time of app A on machine Y  n Execution time of app A on machine X

● But what about different applications (or even parts of the same application)? – GUI is 10 times faster than Y on X… – File operations 1.5 times faster on Y than X… – Arith op. Y is 2 times faster than X – Mem op is 3 times faster than Y on X… – …and on and on and on… 40 Clock-Rate Based Performance Measurement

● Comparing performance based on clock rates is obviously meaningless: – Execution time=CPI × Clock cycle time – Please remember:

● Higher CPI need not mean better performance.

● Also, a processor with a higher may execute programs much slower!

41 Example: Calculating Overall CPI ()

Operation Freq CPI(i) ALU 40% 1 Load 27% 2 Store 13% 2 Branch 20% 5

Typical Instruction Mix

Overall CPI= 1*0.4+ 2*0.27+ 2*0.13+5*0.2 = 2.2

42 MIPS and MFLOPS

● Used extensively 30 years back.

● MIPS: millions of instructions processed per second.

● MFLOPS: Millions of FLoating point OPerations completed per Second

Instruction Count Clock Rate MIPS = = Exec. Time x 106 CPI x 106

43 Problems with MIPS

● Three significant problems with using MIPS:

● So severe, made some one term: – “Meaningless Information about Processing Speed”

● Problem 1: – MIPS is instruction set dependent.

44 Problems with MIPS cont… ● Problem 2: – MIPS varies between programs on the same computer.

● Problem 3: – MIPS can vary inversely to performance!

● Let’s look at an example as to why MIPS doesn’t work…

45 A MIPS Example

● Consider the following computer: Instruction counts (in millions) for each instruction class Code type- A (1 cycle) B (2 cycle) C (3 cycle) Compiler 1 5 1 1 Compiler 2 10 1 1

The machine runs at 100MHz. Instruction A requires 1 clock cycle, Instruction B requires 2 clock cycles, Instruction C requires 3 clock cycles.

Sn CPIi x Ni CPU Clock Cycles i =1 CPI = = Instruction Count Instruction Count

46 A MIPS Example cont… count cycles [(5x1) + (1x2) + (1x3)] x 106

CPI1 = = 10/7 = 1.43 (5 + 1 + 1) x 106

100 MHz

MIPS1 = = 69.9 1.43

[(10x1) + (1x2) + (1x3)] x 106

CPI2 = = 15/12 = 1.25 (10 + 1 + 1) x 106

100 MHz So, compiler 2 has a higher

MIPS2 = = 80.0 MIPS rating and should be 1.25 faster?

47 A MIPS Example cont…

● Now let’s compare CPU time: Note Instruction Count x CPI ! important CPU Time = formula! Clock Rate

7 x 106 x 1.43

CPU Time1 = = 0.10 seconds 100 x 106

12 x 106 x 1.25

CPU Time2 = = 0.15 seconds 100 x 106

Therefore program 1 is faster despite a lower MIPS!

48 Toy Benchmarks

● The performance of different computers can be compared by running some standard programs: – Quick sort, Merge sort, etc.

● But, the basic problem remains: – Even if you select based on a toy benchmark, the system may not perform well in a specific application. – What can be a solution then? 49 Synthetic Benchmarks

● Basic Principle: Analyze the distribution of instructions over a large number of practical programs.

● Synthesize a program that has the same instruction distribution as a typical program: – Need not compute something meaningful.

● Dhrystone, Khornerstone, Linpack are some of the older synthetic benchmarks: – More recent is SPEC.. 50 SPEC Benchmarks

● SPEC: Standard Performance Evaluation Corporation: – A non-profit organization (www.spec.org)

● CPU-intensive benchmark for evaluating processor performance of workstation: – Generations: SPEC89, SPEC92, SPEC95, and SPEC2000 … – Emphasizing memory system performance in SPEC2000.

51 Problems with Benchmarks

● SPEC89 benchmark included a small kernel called matrix 300: – Consists of 8 different 300*300 matrix operations. – Optimization of this inner-loop resulted in performance improvement by a factor of 9.

● Optimizing performance can discard 25% Dhrystone code

● Solution: Benchmark suite

52 Other SPEC Benchmarks

● SPECviewperf: 3D graphics performance

– For applications such as CAD/CAM, visualization, content creations, etc.

● SPEC JVM98: performance of client-side Java virtual machine.

● SPEC JBB2000: Server-side Java application

● SPEC WEB2005: evaluating WWW servers

– Contains multiple workloads utilizing both http and https, dynamic content implemented in PHP and JSP.

53 Summary of SPEC Benchmarks

● CPU: CPU2006

● Graphics: SPECviewperf9

● Java Client/Server: jAppServer2004

● Mail Servers: MAIL2001

● Network File System: SDS97_R1

● Power (under development)

● Web Servers: WEB2005

54 BAPCo

● Non-profit consortium www.bapco.com

● SYSmark 2004 SE – Office productivity benchmark

55 Instruction Set Architecture (ISA)

● Programmer visible part of a processor:

– Instruction Set (what operations can be performed?)

– Instruction Format (how are instructions specified?)

– Registers (where are data located?)

– Addressing Modes (how is data accessed?)

– Exceptional Conditions (what happens if something goes wrong?)

56

ISA cont…

● ISA is important: – Not only from the programmer’s perspective. – From processor designer and implementer perspectives as well.

57 Evolution of Instruction Sets

Single Accumulator (Manchester Mark I, IBM 700 series 1953)

Stack (Burroughs, HP-3000 1960-70)

General Purpose Register Machines

Complex Instruction Sets RISC (MIPS,IBM RS6000, . . .1987) (Vax, Intel 386 1977-85)

58 Different Types of ISAs

● Determined by the means used for storing data in CPU:

● The major choices are: – A stack, an accumulator, or a set of registers.

● Stack architecture: – Operands are implicitly on top of the stack.

59 Different Types of ISAs cont… ● Accumulator architecture: – One operand is in the accumulator (register) and the others are elsewhere. – Essentially this is a 1 – Found in older machines…

● General purpose registers: – Operands are in registers or specific memory locations. 60 Comparison of Architectures

Consider the operation: C =A + B

Stack Accumulator Register-Memory Register-Register Push A Load A Load R1, A Load R1, A Push B Add B Add R1, B Load R2, B Add Store C Store C, R1 Add R3, R1, R2 Pop C Store C, R3

61 Types of GPR Computers

● Register-Register (0,3)

● Register-Memory (1,2)

● Register-Memory (2,2) (3,3)

62 Modern Computer Architectures Lecture-3: Some More Basic Concepts

63 Some More Architectural Issues ● Instruction length needs to be in multiples of bytes.

● Instruction encoding can be: – Variable or Fixed

● Variable encoding tries to use as few bits to represent a program as possible: – But at the cost of complexity of decoding.

● Fixed Encoding: Alpha, ARM, MIPS instructions: Operation Addr1 Addr2 Addr3 and Modes 64 How to Run an Application Faster?

● Improve performance of: – Processor – Memory – Hard disk – Network and Peripheral devices – Operating system – Compiler – Algorithm

65 How to Improve Processor Performance?

● Operate on multiple data at the same time: –

● Operate on multiple operations at the same time: – Operation Parallelism

66 Basics of

● If you are ploughing a field, which of the following would you rather use [Seymour Cray]:

– One strong OX? – A pair of cows? – Two pairs of goats? – 128 chicken?

● The answer would be different:

– If you are considering the problem of computer problem solving. 67 Basics of Parallel Computing cont… ● Consider another scenario: – You have get a color image printed on a stack of papers. ● Would you rather: 1. For each sheet: print red, then green, and blue and then take up the next paper? 2. As soon as you complete printing red, advance it to blue, in the mean while take a new paper for printing green?

68 Flynn’s Classical Classification of Computers

● SISD (Single Instruction Single Data): – Uniprocessors.

● MISD (Multiple Instruction Single Data): – No practical examples exist

● SIMD (Single Instruction Multiple Data): – Specialized processors

● MIMD (Multiple Instruction Multiple Data):

– General purpose, commercially important 69

Flynn’s Classical Classification

Processors

SISD SIMD MISD MIMD

70 SISD

IS IS DS C P M

71 SIMD

DS P

IS C M

DS P

72 MISD

IS IS DS C P

M

IS IS DS C P

73 MIMD

IS IS DS C P

M

IS IS DS C P

74 Modern Classification

Parallel architectures

Data-parallel Function-parallel architectures architectures

75 Data Parallel Architectures

Data-parallel architectures

Systolic Associative Vector SIMDs architecture And neural architectures s architectures

76 Function Parallel Architectures Function-parallel architectures

Instr level level Process level Parallel Arch Parallel Arch Parallel Arch (ILPs) (MIMDs)

Pipelined VLIWs Superscalar Distributed Shared processors processors Memory Memory MIMD MIMD

77 Data Parallel Architectures

● SIMD Processors – Multiple processing elements driven by a single instruction stream

● Vector Processors – Uni-processors with vector instructions

● Associative Processors – SIMD like processors with associative memory

● Systolic Arrays – Application specific VLSI structures

78 Systolic Arrays [H.T. Kung 1978] • Simplicity, Regularity, Concurrency, Communication. Example : Band matrix multiplication

A11A120 0 0 0  B11B120 0 0 0  A A A 0 0 0  B B B 0 0 0   21 22 23   21 22 23 

A31A32A33A340 0  B31B32B33B340 0  C       0 A42A43A44A450  0 B42B43B44B450  0 0 A A A A  0 0 B B B B   53 54 55 56   53 54 55 56  0 0 0 A64A65A66 0 0 0 B64B65B66

79 T=0

B31 A23

A 22 A12 B21

A31 A A 21 11 B11 B12

80 Classification for MIMD Computers

: – Processors communicate through a shared memory. – Typically processors connected to each other and to the shared memory through a bus.

: – Processors do not share any physical memory. – Processors connected to each other through a network. 81 Shared Memory

● Shared memory located at a centralized location: – May consist of several interleaved modules –-- same distance (access time) from any processor. – Also called (UMA) model.

82 Distributed Memory

● Memory is distributed to each processor: – Improves .

● Non-Uniform Memory Access (NUMA)

– (a) Message passing architectures – No processor can directly access another processor’s memory. – (b) Distributed Shared Memory (DSM)– Memory is distributed, but the address space is shared.

83 UMA vs. NUMA Computers

P1 P2 Pn P1 P2 Pn

Cache Cache Cache Cache Cache Bus Main Main Main Memory Memory Memory Main Memory Network

(a) UMA Model (b) NUMA Model 84 Advantages of UMA Model

● Ease of programming: –When communication patterns are complex

● Lower communication overhead

● Hardware-controlled caching to reduce remote data fetching: –Remote data is cached

85 Advantages of Message- Passing Communication

● The hardware much simpler and standard

● Explicit communication => simpler to understand, help make effort to reduce communication cost

● Synchronization is naturally associated with sending/receiving messages

86 Modern Computer Architectures Lecture-2: A Few Basic Concepts

87 RISC/CISC Controversy

● RISC: Reduced Instruction Set Computer

● CISC: Complex Instruction Set Computer

● Genesis of CISC architecture:

– Implementing commonly used instructions in hardware can lead to significant performance benefits. – For example, use of a FP processor can lead to performance improvements.

● Genesis of RISC architecture:

– The rarely used instructions can be eliminated

to save chip space --- on chip cache and large 88 number of registers can be provided. Features of A CISC Processor

● Rich instruction set: – Some simple, some very complex

● Complex addressing modes: – Orthogonal addressing (Every possible for every instruction).

 Many instructions take multiple cycles:  Large variation in CPI

 Instructions are of variable sizes

 Small number of registers

control

 No (or inefficient) pipelining 89 Examples of CISC Philosophy

● One instruction could do the work of several instructions.

– For example, a single instruction could load two numbers to be added, add them, and then store the result back to memory directly.

● Many versions of the same instructions were supported:

– Different versions did almost the same thing with minor changes. – For example, one version would read two numbers from memory, and store the result in a register. Another version would read one number from memory and the other from a register and store the result to memory. 90 Features of a RISC Processor

● Small number of instructions

● Small number of addressing modes

 Large number of registers (>32)

 Instructions execute in one or two clock cycles

 Uniformed length instructions and fixed instruction format.

 Register-Register Architecture:  Separate memory instructions (load/store)

 Separate instruction/data cache

 Hardwired control

 Pipelining (Why CISC are not pipelined?) 91

CISC vs. RISC Organizations

Microprogrammed Hardwared Cache Control Unit Microprogrammed Control Memory Instruction Data Cache Cache

Main Memory Main Memory

(a) CISC Organization (b) RISC Organization

92 Why Does RISC Lead to Improved Performance?

● Increased GPRs (Also, Register Windows) – Lead to decreased data traffic to memory. – Remember memory is the bottleneck.

● Register-Register architecture leads to more uniform instructions: – Efficient pipelining becomes possible.

● However, large instruction memory traffic: – Because of larger number of instructions results.

93 Why Does RISC Lead to Improved Performance? Cont…

● RISC-Like instructions can be scheduled to achieve efficiency: – Either by compiler or – By hardware (Dynamic instruction scheduling).

● Suppose we need to add 2 values and store the results back to memory.

● In CISC: – It would be done using a single instruction.

● In RISC: – 4 instructions would be necessary.

94 Early RISC Processors

● 1987 Sun SPARC

● 1990 IBM RS 6000

● 1996 IBM/Motorola PowerPC

95 Parallel Execution of Programs

● Parallel execution of a program can be done in one or more of following ways: – Instruction-level (Fine grained): individual instructions on any one thread are executed parallely.

● Parallel execution across a sequence of instructions (block) -- could be a loop, a conditional, or some other sequence of stmts. – Thread-level (Medium grained): different threads of a process are executed parallely. – Process-level (Coarse grained): different processes (tasks) can be executed parallely. 96 Exploitation of Instruction-Level Parallelism

● ILP can be exploited by deploying several available techniques: – Temporal parallelism (Overlapped execution):

● Pipelining – Spatial Parallelism:

● Superscalar execution (Multiple instructions that use multiple data MIMD)

● Vector processing (single instruction multiple data SIMD)

97 Modern Computer Architectures Lecture-4: ILP Exploitation Through Pipelining

98 Original ILP Apprehensions

● Flynn’s Bottleneck (1970):

– Speedup due to ILP can at best be 2. – Flynn’s study focused on ILP found in the basic blocks of some common programs. – Crossing basic block boundaries would involve crossing control dependencies --- Would require flushing the .

● Flynn’s Bottleneck appears too pessimistic in retrospective:

– Present ILP exploitation level is much beyond Flynn’s bottleneck. 99 Pipelining

● Pipelining incorporates the concept of overlapped execution: – Used in many everyday applications without our notice.

● Has proved to be a very popular and successful way to exploit ILP: – Instruction pipes are being used in almost all modern processors.

100 A Pipeline Example

● Consider two alternate ways in which an engineering college can work: – Approach 1. Admit a batch of students and next batch admitted only after already admitted batch completes (i.e. admit once every 4 years). – Approach 2. Admit students every year. – In the second approach:

● Average number of students graduating per year increases four times. 101

Pipelining

First Year Second Year Third Year Fourth Year First Year Second Year Third Year Fourth Year First Year Second Year Third Year Fourth Year

102 Pipelined Execution

Time

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB

103 Advantages of Pipelining

● An n-stage pipeline: – Can improve performance upto n times.

● Not much investment in hardware: – No replication of hardware resources necessary. – The principle deployed is to keep the units as busy as possible.

● Transparent to the programmers: – Easy to use 104 Basic Pipelining Terminologies

● Pipeline cycle (or Processor cycle): – The time required to move an instruction one step further in the pipeline. – Not to be confused with clock cycle.

● Synchronous pipeline: – Pipeline cycle is constant (clock-driven).

● Asynchronous pipeline: – Time for moving from stage to stage varies – Handshaking communication between stages

105 Pipeline Cycle

● Pipeline cycle:

– Determined by the time required by the slowest stage.

● Pipeline designers try to balance the length (i.e. the processing time) of each pipeline stage.

– For a perfectly balanced pipeline, the execution time per instruction is t/n,

● where t is the execution time per instruction on nonpipelined machine and n is the number of pipe stages.

● However, it is very difficult to make the different pipeline stages perfectly balanced.

● Besides, pipelining itself involves some overhead.

– The pipeline overhead arises due to the latches used between two successive pipeline stages. 106 Synchronous Pipeline

- Transfers between stages are simultaneous. - One task or operation enters the pipeline per cycle.

L L L L L

Input Output S1 S2 Sk

Clock

 m d

107 Asynchronous Pipeline

- Transfers performed when individual stages are ready. - Handshaking protocol between processors.

Input Output Ready Ready Ready S1 S2 Sk Ready Ack Ack Ack Ack

- Different amounts of delay may be experienced at different stages. - Can display variable throughput rate.

108 A Few Pipeline Concepts

Si Si+1

 m d

Pipeline cycle : 

Latch delay : d  = max {m } + d

Pipeline frequency : f f = 1 / 

109 Ideal Pipeline Speedup

● k-stage pipeline processes n tasks in k + (n-1) clock cycles:

– k cycles for the first task and n-1 cycles for the remaining n-1 tasks.

● Total time to process n tasks

● Tk = [ k + (n-1)] 

● For the non-pipelined processor

T1 = n k 

110 Pipeline Speedup Expression

● Speedup=

T n k n k S = 1  = k = [ k + (n-1)] k + (n-1) Tk 

● Observe that the memory bandwidth must increase by a factor of Sk: – Otherwise, the processor would stall waiting for data to arrive from memory.

111 Pipelines: A Few Basic Concepts

● Historically, there are two different types of pipelines:

– Instruction pipelines – Arithmetic pipelines

● Arithmetic pipelines (e.g. FP multiplication) are not popular in general purpose computers:

– Need a continuous stream of arithmetic operations. – E.g. Vector processors operating on an array.

● On the other hand instruction pipelines being used in almost every modern processor.

112 Pipelines: A Few Basic Concepts

● Pipeline increases instruction throughput: – But, does not decrease the execution time of the individual instructions. – In fact, slightly increases execution time of each instruction due to pipeline overheads.

● Pipeline overhead arises due to a combination of: – Pipeline register delay – Clock skew

113 Pipelines: A Few Basic Concepts ● Pipeline register delay: – Caused due to set up time

● Clock skew: – the maximum delay between clock arrival at any two registers.

● Once clock cycle is as small as the pipeline overhead: – No further pipelining would be useful. – Very deep pipelines may not be useful. 114 Pipeline Registers

● Pipeline registers are essential part of pipelines:

– There are 4 groups of pipeline registers in 4 stage pipeline.

● Each group saves output from one stage and passes it as input to the next stage:

– IF/ID

– ID/EX

– EX/MEM

– MEM/WB

● This way, each time “something is computed”...

– Effective address, Immediate value, Register content, etc. – It is saved safely in the context of the instruction that needs it.

115 Pipeline Registers

IF/ID ID/EX EX/MEM MEM/WB

inst. 1 IF inst. 2 inst. 3

116 Pipeline Registers

IF/ID ID/EX EX/MEM MEM/WB

inst. 1 IF inst. 2 inst. 3

117 Pipeline Registers

IF/ID ID/EX EX/MEM MEM/WB

inst. 1 IF ID inst. 2 IF inst. 3

118 Pipeline Registers

IF/ID ID/EX EX/MEM MEM/WB

inst. 1 IF ID inst. 2 IF inst. 3

119 Pipeline Registers

IF/ID ID/EX EX/MEM MEM/WB

inst. 1 IF ID EX inst. 2 IF ID inst. 3 IF

120 Pipeline Registers

IF/ID ID/EX EX/MEM MEM/WB

inst. 1 IF ID EX inst. 2 IF ID inst. 3 IF

121 Pipeline Registers

IF/ID ID/EX EX/MEM MEM/WB

inst. 1 IF ID EX MEM inst. 2 IF ID EX inst. 3 IF ID

122 Pipeline Registers

IF/ID ID/EX EX/MEM MEM/WB

inst. 1 IF ID EX MEM inst. 2 IF ID EX inst. 3 IF ID

123 Pipeline Registers

IF/ID ID/EX EX/MEM MEM/WB

inst. 1 IF ID EX MEM WB inst. 2 IF ID EX MEM . . . inst. 3 IF ID EX . . .

● Typically, we will not think too much about pipeline registers and one just assumes that values are passed “magically” down stages of the pipeline.

124 Pipeline Register Depiction

IM Reg DM Reg EX/MEM EX/MEM IF/ID ALU ID/EX ID/EX MEM/WB

IM Reg DM Reg ALU IF/ID ID/EX ID/EX EX/MEM EX/MEM MEM/WB instructions IM Reg DM Reg IF/ID ALU ID/EX ID/EX EX/MEM EX/MEM MEM/WB

IM Reg DM

every stage reads input IF/ID ID/EX ID/EX ALU

and writes output to EX/MEM the pipeline registers 125 Why Pipelining RISC Processors is Easy

● Recall the main principles of RISC:

– All operands are in registers – The only operations that affect memory are loads and stores. – Few instruction formats, fixed encoding (i.e., all instructions are the same size)

● Although pipelining could conceivably be implemented for any architecture,

– It would be inefficient. – Pentium has characteristics of RISC and CISC --- CISC Instructions are internally converted to RISC-like instructions. 126 Drags on Pipeline Performance

● Things are actually not so rosy, due to the following factors: – Difficult to balance the stages – Pipeline overheads: latch delays – Clock skew – Hazards

127 Exercise

● Consider an unpipelined processor: – Takes 4 cycles for ALU and other operations – 5 cycles for memory operations. – Assume the relative frequencies:

● ALU and other=60%,

● memory operations=40% – Cycle time =1ns

● Compute speedup due to pipelining: – Ignore effects of branching. – Assume pipeline overhead = 0.2ns

128 Solution

● Average instruction execution time for large number of instructions:

– unpipelined= 1ns * (60%*4+ 40%*5) =4.4ns – Pipelined=1.2ns

● Speedup=4.4/1.2=3.7 times

129 Pipeline Hazards

● Hazards can result in incorrect operations: – Structural hazards: Two instructions requiring the same hardware unit at same time. – Data hazards: Instruction depends on result of a prior instruction that is still in pipeline

● Data dependency – Control hazards: Caused by delay in decisions about changes in control flow (branches and jumps).

● Control dependency 130 Pipeline Interlock

● Pipeline interlock: – Resolving of pipeline hazards through hardware mechanisms.

● Interlock hardware detects all hazards: – Stalls appropriate stages of the pipeline to resolve hazards.

131 MIPS

● MIPS architecture: – First publicly known implementations of RISC architectures – Grew out of research at Stanford University

● MIPS computer system founded in 1984: – R2000 introduced in 1986. – Licensed the designs rather than selling the design.

132 Commercial Success of MIPS

● Popularly used as IP-cores (building-blocks) for embedded processor designs. – Both 32-bit and 64-bit basic cores are offered --- the design is licensed as MIPS32 and MIPS64. – MIPS cores have been commercially successful --- used in many consumer and industrial applications.

● MIPS cores can be found in: – Modern Cisco and Linksys routers, cable modems and ADSL modems, smartcards, laser printer engines, set- top boxes, robots, handheld computers, Sony PlayStation 2 and Sony PlayStation Portable.

● In cell phone/PDA applications, the MIPS core has been unable to displace the incumbent, competing ARM core 133 MIPS cont… ● MIPS: Without Interlocked Pipeline Stages

● Operates on 64bit data

● 32 64bit registers

● 2 128KB high speed cache

● Constant 32bit instruction length

● Initial MIPS processors achieved 1instr/cycle (CPI =1)

134 MIPS cont… ● Registers R0 to R31:

– Value of R0 is always 0

● Small number of addressing modes:

– Only immediate and displacement modes supported, besides register mode. – Register indirect achieved by placing 0 in displacement field.

● Add R4, 0(R1) //[R4] <- [R4] + [[R1]] – Absolute addressing achieved by using R0 as the base register.

● Add R1, 100(R0) //[ R1] <- [R1]+100

135 MIPS

● R0 is used to synthesize popular instructions: – E.g., there is no Mov instruction – Add R3,R0,#3 // move 3 to R3

● R0 helps in reducing the number of instructions.

136 MIPS

● MIPS memory: – Byte addressable with 64-bit address.

● MIPS provides 4 broad classes of instructions: – Load/Store – ALU operations – Branches and jumps – Floating point operations 137 MIPS Pipeline

● Uses a 5-stage pipeline

● IF: Instruction fetch

● ID: Decode operands and fetch register operands

● ALU: ALU operation or data operand address generation

● MEM: Data memory reference

● WB: Write back into

138 MIPS Pipelining Stages

● 5 stages of MIPS Pipeline:

– IF Stage:

● Needs access to the Memory to load the instruction.

● Needs a to update the PC. – ID Stage:

● Needs access to the Register File in reading operand.

● Needs an adder (to compute the potential branch target). – EX Stage:

● Needs an ALU. – MEM Stage:

● Needs access to the Memory. – WB Stage:

● Needs access to the Register File in writing. 139 More Complete Picture of a Pipeline

Fetch Decode Execute Memory Writeback

Instruc Cache Funit Dunit EXunit Memunit WBunit

Register File

Data Cache

140 Details of Pipeline Stages: MIPS R2000 Stage name Phase Function performed 1. IF 1 Translate virtual instr. addr. using TLB 2 Access I-cache using physical address 2. RD 1 Decode Instruction 2 Read reg. file; if a branch, generate target addr. 3. ALU 1 Start ALU op.; if a branch, check br. Condition 2 Finish ALU op; if load/store, Add base Reg and offset form Effective addr, translate virtual addr. 4. MEM 1 Access D-cache 2 Return data from D-cache, check tags & parity 5. WB 1 Write register file for both mem and other instrs. 2 ---

141 Extending MIPS to Handle FP Operations

● It is impractical to expect:

– all MIPS FP operations complete in 1 or 2 cycles.

● FP should have the same pipeline stages as integer instructions:

– EX cycles may be repeated many times to complete an operation. – There may be multiple functional units.

Ex int

IF ID FP Mul Mem WB

FP Add 142 Further MIPS Enhancements cont…

● MIPS could achieve CPI of 1, to improve performance further, two possibilities: – Superscalar – Superpipelined

● Superscalar: – Replicate each pipeline stage so that two or more instructions can proceed simultaneously.

● Superpipeline: – Split pipeline stages into further stages.

143 Superscalar Processing Stages

S4 S1 S2 S3 u v S5 S6 1 I-1 2 I-2 I-1 3 I-3 I-2 I-1 4 I-4 I-3 I-2 I-1 5 I-4 I-3 I-1 I-2

Cycles 6 I-4 I-3 I-2 I-1 7 I-3 I-4 I-2 I-1 8 I-4 I-3 I-2 9 I-4 I-3 10 I-4

144 Summary

● RISC architecture style saves chip area that is used to support more registers and cache: – Also is facilitated due to small and uniform sized instructions.

● Three main types of parallelism in a program: – Instruction-level – Thread-level – Process-level

145 Summary Cont…

● Two main types of parallel computers:

– SIMD – MIMD

● Instruction pipelines are found in almost all modern processors:

– Exploits instruction-level parallelism – Transparent to the programmers

● Hazards can slowdown a pipeline:

– In the next lecture, we shall examine hazards in more detail and available ways to resolve hazards.

146 References

[1]J.L. Hennessy & D.A. Patterson, “Computer Architecture: A Quantitative Approach”. Morgan Kaufmann Publishers, 3rd Edition, 2003 [2]John Paul Shen and Mikko Lipasti, “Modern Processor Design,” Tata Mc- Graw-Hill, 2005

147 Feature Size

● The minimum feature size: – The width of the smallest line or gap that appears in your design.

148 Modern Computer Architectures

Lecture 6: Pipeline Hazards and Their Resolution Mechanisms

1 Module Objectives

● Hazards, their causes, and resolution

● Branch prediction

● Exploiting loop-level parallelism

● Dynamic instruction scheduling:

and Tomasulo’s algorithm

● Compiler techniques for exposing ILP

● Superscalar and VLIW processing

● Survey of some modern processors

2 Introduction

● What is ILP (Instruction-Level Parallelism)? – Parallel execution of different instructions belonging to the same thread.

● A thread usually consists of several basic blocks: – As well as several branches and loops.

● Basic block: – A sequence of instructions not having a branch instruction.

3 Introduction cont…

● Instruction pipelines can effectively exploit parallelism in a basic block: – An n-stage pipeline can improve performance up to n times. – Does not require much investment in hardware – Transparent to the programmers.

● Pipelining can be viewed to: – Decrease average CPI, and/or – Decrease clock cycle time for instructions.

4 Drags on Pipeline Performance

● Factors that can degrade pipeline performance: – Unbalanced stages – Pipeline overheads – Clock skew – Hazards

● Hazards cause the worst drag on the performance of a pipeline.

5 Pipeline Hazards

● What is a pipeline hazard? – A situation that prevents an instruction from executing during its designated clock cycles.

● There are 3 classes of hazards: – Structural Hazards – Data Hazards – Control Hazards

6 Structural Hazards

● Arise from resource conflicts among instructions executing concurrently:

– Same resource is required by two (or more) concurrently executing instructions at the same time.

● Easy way to avoid structural hazards:

– Duplicate resources (sometimes not practical)

● Examples of Resolution of Structural Hazard:

– An ALU to perform an arithmetic operation and an adder to increment PC. – Separate data cache and instruction cache accessed simultaneously in the same cycle. 7 Structural Hazard: Example

IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB

8 Data Hazards

● Occur when an instruction under execution depends on:

– Data from an instruction ahead in pipeline.

● Example: A=B+C; D=A+E;

A=B+C; IF ID EXE MEM WB D=A+E; IF ID EXE MEM WB

– Dependent instruction uses old data:

● Results in wrong computations

9 Control Hazards

● Result from branch and other instructions that change the flow of a program (i.e. change PC).

● Example: 1: If(cond){

2: s1} 3: s2

● Statement in line 2 is control dependent on statement at line 1.

● Until condition evaluation completes:

– It is not known whether s1 or s2 will execute next.

10 Can You Identify Control Dependencies?

1: if(cond1){ 2: s1; 3: if(cond2){ 4: s2;} 5: }

11 Program Dependences Can Cause Hazards!

● Hazards can be caused by dependences within a program.

● There are three main types of dependences in a program: – Data dependence – Name dependence – Control dependence

12 Data Dependences

● An instruction j is data dependent on instruction i, iff either of: – Direct: Instruction i produces a result that is used by instruction j. r3  r1 op r2 r5  r3 op r4 – Transitive:

● Instruction j is data dependent on instruction k and

● Instruction k is data dependent on instruction i. r3  r1 op r2 r4  r3 op r2 r5  r6 op r4

13 Detecting Data Dependences

● A data value may flow between instructions:

– (i) through registers – (ii) through memory locations.

● When data flow is through a register:

– Detection is rather straight forward.

● When data flow is through a memory location:

– Detection is difficult. – Two addresses may refer to the same memory location but look different. 100(R4) and 20(R6)

14 Types of Data Dependences

● Two types of data dependences: – True data dependence. – Name dependence:

● Two instructions use the same register or memory location (called a name).

● There is no true flow of data between the two instructions.

● Example: A=B+C; A=P+Q;

● Two types of name dependences: – Anti-dependence – Output dependence

15 Anti-Dependence

● Anti-dependence occurs between two instructions i and j, iff:

– j writes to a register or memory location that i reads. – Original ordering must be preserved to ensure that i reads the correct value.

● Example:

– ADD F0,F6,F8 – SUB F8,F4,F5

16 Output Dependence

● Output dependence occurs between two instructions i and j, iff: – The two instructions write to the same memory location.

● Ordering of the instructions must be preserved to ensure: – Finally written value corresponds to j.

17 Exercise

● Identify all the dependences in the following C code: 1. a=b+c; 2. b=c+d; 3. a=a+c; 4. c=b+a;

18 Hazard Resolution

● Name dependences: – Once identified, can be easily eliminated through simple compiler renaming techniques. – Memory-related dependences are difficult to identify:

● Hardware techniques (scoreboarding and dynamic instruction scheduling) are being used.

● True data dependences: – More difficult to handle. – Can not be eliminated; can only be overcome! – Many techniques have evolved over the years.

19 Data Hazards

● Data hazards are caused by data dependences: – But, mere existence of data dependence in a program may not result in a data hazard.

20 Types of Data Hazards

● Data hazards are of three types: – Read After Write (RAW) – Write After Write (WAW) – Write After Read (WAR)

● With an in-order execution machine: – WAW, WAR hazards can not occur.

● Assume instruction i is issued before j.

21 Read after Write (RAW) Hazards ● RAW hazard:

– instruction j tries to read its operand before instruction i writes it. – j would incorrectly receive an old or incorrect value. i: ADD R1, R2, R3 ● Example: j: SUB R4, R1, R6

… j i …

Instruction j is a Instruction i is a read instruction write instruction issued after i issued before j

22 RAW Dependency: More Examples

● Example program (a):

– i1: load r1, addr; – i2: add r2, r1,r1;

● Program (b):

– i1: mul r1, r4, r5; – i2: add r2, r1, r1;

● Both cases, i2 does not get operand until i1 has completed writing the result

– In (a) this is due to load-use dependency – In (b) this is due to define-use dependency 23 Write After Write (WAW) Hazards

● WAW hazard:

– instruction j tries to write an operand before instruction i writes it. (How can this happen???) – Writes are performed in wrong order.

● Example: i: DIV F1, F2, F3 j: SUB F1, F4, F6 … j i …

Instruction j is a Instruction i is a write instruction write instruction issued after i issued before j

24 Out-of-order Pipelining

Program Order IF • • • Ia: F1  F2 x F3 ID • • • . . . . . Ib: F1  F4 + F5

RD • • •

EX INT Fadd1 Fmult1 LD/ST

Fadd2 Fmult2

Fmult3 Out-of-order WB

Ib: F1  “F4 + F5” WB • • • ......

Ia: F1  “F2 x F3”

25 Write After Read (WAR) Hazards

● WAR hazard:

– instruction j tries to write an operand before instruction i reads it. – Instruction i instead of getting old value, receives newer, undesired value:

● Example: i: DIV F7, F1, F3 j: SUB F1, F4, F6 … j i …

Instruction j is a Instruction i is a write instruction read instruction issued after i issued before j 26 WAR Hazards

● Called an “anti-dependence” by compiler writers. – This results from reuse of the name (register) “F1”.

● Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5

27 WAR and WAW Dependency: More Examples ● Example program (a):

– i1: mul r1, r2, r3; – i2: add r2, r4, r5;

● Example program (b):

– i1: mul r1, r2, r3; – i2: add r1, r4, r5;

● Both cases have dependence between i1 and i2

– in (a) r2 must be read before it is written into – in (b) r1 must be written by i2 after it has been written into by i1 28 Is There Really a Dependence/Hazard ?

● Exercise: –i1: mul r1, r2, r3; –i2: add r2, r4, r5;

29 Inter-Instruction Dependences

 Data dependence

r3  r1 op r2 Read-after-Write r5  r3 op r4 (RAW)  Anti-dependence

r3  r1 op r2 Write-after-Read False r1  r4 op r5 (WAR) Dependency  Output dependence

r3  r1 op r2 Write-after-Write r5  r3 op r4 (WAW) r3  r6 op r7

Control dependence

30 Data Dependencies : Summary

Data dependencies in straight-line code

RAW WAR WAW Read After Write Write After Read Write After Write dependency dependency dependency ( Flow dependency ) ( Anti dependency ) ( Output dependency )

Load-Use Define-Use dependency dependency

True dependency False dependency Cannot be overcome Can be eliminated by 31 Dependences and Hazards

Dependences Hazards

True RAW

Data Name Output WAW

Anti WAR

Control Control

------Structural

32 A Solution to WAR and WAW Hazards

● Rename Registers

– i1: mul r1, r2, r3; – i2: add r6, r4, r5;

● Register renaming can get rid of most false dependencies:

– Compiler can do register renaming in the register allocation process (i.e., the process that assigns registers to variables).

33 Use of Compiler Techniques to Tackle Data hazards

● A compiler can help eliminate some stalls caused by data hazards:

– Example: an instruction that uses result of a LOAD’s destination register should not immediately follow the LOAD instruction.

● The technique is called:

– “compiler-based pipeline instruction scheduling”

34 Hardware Techniques to Deal with Hazards

● Simple solution – Stall pipeline

: – Lets some instruction(s) in pipeline proceed, others are made to wait for data, resource, etc.

35 An Example of a Structural Hazard

Mem Reg DM Reg Load ALU

Mem Reg DM Reg Instruction 1 ALU

Mem Reg DM Reg Instruction 2 ALU

Mem Reg DM Reg Instruction 3 ALU

Mem Reg DM Reg Instruction 4 ALU

Time Would there be a hazard here? 36 How is it Resolved?

Mem Reg DM Reg Load ALU

Mem Reg DM Reg Instruction 1 ALU

Mem Reg DM Reg Instruction 2 ALU

Stall Bubble Bubble Bubble Bubble Bubble

Mem Reg DM Reg Instruction 3 ALU

Time A Pipeline can be stalled by

inserting a “bubble” or NOP 37 Modern Computer Architectures

Lecture 7: Resolution Mechanisms for Pipeline Hazards

38 Pipeline Stall: Alternate Representation

Clock Number Inst. # 1 2 3 4 5 6 7 8 9 10 LOAD IF ID EX MEM WB Inst. i+1 IF ID EX MEM WB Inst. i+2 IF ID EX MEM WB Inst. i+3 stall IF ID EX MEM WB Inst. i+4 IF ID EX MEM WB Inst. i+5 IF ID EX MEM Inst. i+6 IF ID EX

● LOAD instruction “steals” an instruction fetch cycle which causes the pipeline to stall.

● Thus, no instruction completes on clock cycle 8

39 Performance with Stalls

● Stalls degrade performance of a pipeline: – Result in deviation from 1 instruction executing/clock cycle. – Let’s examine by how much stalls can impact CPI…

40 Stalls and Performance

• CPI pipelined = =Ideal CPI + Pipeline stall cycles per instruction =1 + Pipeline stall cycles per instruction • Ignoring overhead and assuming stages are balanced: CPI unpipelined Speedup  1 pipeline stall cycles per instruction

41 Speedup Due to Pipelining

1   Pipeline depth 1 Pipeline stall cycles per instruction

42 Alternate Speedup Expression

Clock cycle unpipelined Clock cycle pipelined  Pipeline depth Clock cycle unpipelined Pipeline depth  Clock cycle pipelined

1 Clock cycle unpipelined Speedup from pipelining   1  Pipeline stall cycles per instruction Clock cycle pipelined

43 An Example of Performance Impact of Structural Hazard

● Assume:

– Pipelined processor. – Data references constitute 40% of an instruction mix. – Ideal CPI of the pipelined machine is 1. – Consider two cases:

● Unified data and instruction cache vs. separate data and instruction cache.

● What is the impact on performance?

44 An Example cont… ● Avg. Inst. Time = CPI x Clock Cycle Time (i) For Separate cache: Avg. Instr. Time=1*1=1 (ii) For Unified cache case:

= (1 + 0.4 x 1) x (Clock cycle timeideal)

= 1.4 x Clock cycle timeideal=1.4

● Speedup= 1/1.4 = 0.7

● 30% degradation in performance

45 Recollect Data Hazards What causes them? – Pipelining changes the order of read/write accesses to operands. – Order differs from that of an unpipelined machine.

● Example:

– ADD R1, R2, R3 For MIPS, ADD writes the – SUB R4, R1, R5 register in WB but SUB needs it in ID.

This is a data hazard

46 Illustration of a Data Hazard

Mem Reg DM Reg ADD R1, R2, R3 ALU

Mem Reg DM Reg SUB R4, R1, R5 ALU

Mem Reg DM AND R6, R1, R7 ALU

Mem Reg OR R8, R1, R9 ALU

XOR R10, R1, R11 Mem Reg

Time ADD instruction causes a hazard in next 3 instructions because register not written until after those 3 read it. 47 Forwarding

● Simplest solution to data hazard: – forwarding

● Result of the ADD instruction not really needed: – until after ADD actually produces it.

● Can we move the result from EX/MEM register to the beginning of ALU (where SUB needs it)? – Yes!

48 Forwarding cont…

● Generally speaking:

– Forwarding occurs when a result is passed directly to the functional unit that requires it. – Result goes from output of one pipeline stage to input of another.

49 When Can We Forward?

ADD R1, R2, R3 Mem Reg DM Reg ALU SUB gets info. from EX/MEM

pipe register SUB R4, R1, R5 Mem Reg DM Reg ALU

AND gets info. AND R6, R1, R7 Mem Reg DM ALU from MEM/WB pipe register

OR R8, R1, R9 Mem Reg ALU

OR gets info. by XOR R10, R1, R11 Mem Reg forwarding from register file

If line goes “forward” you can do forwarding. Time If its drawn backward, it’s physically impossible. 50 How to Implement Hazard Control Logic?

● In a pipeline, – All data hazards can be checked during ID phase of pipeline. – If a data hazard is detected, next instruction should be stalled. – Whether forwarding is needed can also be determined at this stage, control signals set.

● If hazard is detected, – Control unit of pipeline must stall pipeline and prevent instructions in IF, ID from advancing.

51 Branch Hazards

● Three simplest methods of dealing with branches: – Flush Pipeline:

● Redo the instructions following a branch, once an instruction is detected to be branch during the ID stage. – Branch Not Taken:

● A slightly higher performance scheme is to assume every branch to be not taken. – Another scheme is delayed branch.

52 An Example of Impact of Branch Penalty

● Assume for a MIPS pipeline: – 16% of all instructions are branches:

● 4% unconditional branches: 3 cycle penalty

● 12% conditional: 50% taken: 3 cycle penalty

53 Impact of Branch Penalty

● For a sequence of N instructions:

– N cycles to initiate each – 3 * 0.04 * N delays due to unconditional branches – 0.5 * 3 * 0.12 * N delays due to conditional taken

● Overall CPI=

– 1.3*N – (or 1.3 cycles/instruction) – 30% Performance Hit!!!

54 Reducing Branch Penalty

● Two approaches: 1) Move condition comparator to ID stage: ● Decide branch outcome and target address in the ID stage itself: ● Reduces branch delay to 2 cycles. 2)Branch prediction

55 Four Simple Branch Hazard Solutions #1: Stall

– until branch direction is clear – flushing pipe #2: Predict Branch Not Taken

– Execute successor instructions in sequence as if there is no branch – undo instructions in pipeline if branch actually taken – 47% branches not taken on average

56 Four Simple Branch Hazard Solutions cont… #3: Predict Branch Taken

– 53% branches taken on average (why?) – But branch target address not available after IF in MIPS

● MIPS still incurs 1 cycle branch penalty even with predict taken

● Other machines: branch target known before branch outcome computed, significant benefits can accrue

57 Four Simple Branch Hazard Solutions cont…

#4: Delayed Branch – Insert unrelated successor in the branch branch instruction sequential successor1 sequential successor2 ...... Branch delay of length n sequential successorn branch target if taken – 1 slot delay required in 5 stage pipeline

58 Delayed Branch

● Simple idea: Put an instruction that would be executed anyway right after a branch.

Branch IF ID EX MEM WB

Delayed slot instruction IF delayID EX MEMslot WB

IF ID EX MEM WB Branch target OR successor

● Question: What instruction do we put in the delay slot?

● Answer: one that can safely be executed no matter what the branch does.

– The compiler decides this. 59 Delayed Branch

● One possibility: An instruction from before

● Example: DADD R1, R2, R3 DADD R1, R2, R3 if R2 == 0 then if R2 == 0 then

DADD R1, R2, R3 delay slot

. . . ● The DADD instruction is executed no matter what happens in the branch:

– Because it is executed before the branch! – Therefore, it can be moved 60 Delayed Branch

● We get to execute the “DADD” execution “for free”

branch IF ID EX MEM WB

add instruction IF ID EX MEM WB

branch target/successor IF ID EX MEM WB

By this time, we know whether to take the branch or whether not to take it

61 Delayed Branch

● Another possibility: An instruction much before

● Example: DSUB R4, R5, R6 ...

DADD R1, R2, R3

if R1 == 0 then

delay slot

● The DSUB instruction can be replicated into the delay slot, and the branch target can be changed

62 Delayed Branch

● Another possibility: An instruction from before

● Example: DSUB R4, R5, R6 ...

DADD R1, R2, R3

if R1 == 0 then

DSUB R4, R5, R6

● The DSUB instruction can be replicated into the delay slot, and the branch target can be changed

63 Delayed Branch

● Yet another possibility: An instruction from inside the taken path

DADD R1, R2, R3 ● Example: if R1 == 0 then

delay slot

OR R7, R8, R9

DSUB R4, R5, R6

● The OR instruction can be moved into the delay slot ONLY IF its execution doesn’t disrupt the program execution (e.g., R7 is overwritten later) 64 Delayed Branch

● Third possibility: An instruction from inside the taken path DADD R1, R2, R3 ● Example: if R1 == 0 then

OR R7, R8, R9

OR R7, R8, R9

DSUB R4, R5, R6

● The OR instruction can be moved into the delay slot ONLY IF its execution doesn’t disrupt the program execution (e.g., R7 is overwritten later) 65 Delayed Branch Example

B1 LD R1,0(R2) DSUBU R1,R1,R3 LD R1,0(R2) R1 != 0 BEQZ R1,L DSUBU R1,R1,R3 BEQZ R1,L OR R4,R5,R6 DADDU R10,R4,R3 OR R4,R5,R6 R1 == 0 DADDU R10,R4,R3 B2 L : DADDU R7,R8,R9 DADDU R7,R8,R9 B3

1.) BEQZ is dependent on DSUBU and DSUBU on LD, 2.) If we knew that the branch was taken with a high probability, then DADDU could be moved into block B1, since it doesn’t have any dependencies with block B2, 3.) Conversely, knowing the branch was not taken, then OR could be moved into block B1, since it doesn’t affect anything in B3, 66 Delayed Branch

● Where to get instructions to fill branch delay slots?

– Before branch instruction – From the target address: Useful only if branch taken. – From fall through: Useful only if branch not taken.

67 Modern Computer Architectures

Lecture 8: Branch Prediction

68 Delayed Branch cont…

● Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots. – About 80% of instructions executed in branch delay slots useful in computation. – About (60% x 80%) i.e. 50% of slots usefully filled.

● Delayed Branch downside: what if multiple instructions issued per clock cycle (superscalar)?

69 Branch Prediction

• KEY IDEA: Hope that branch assumption is correct. • If yes, then we’ve gained a performance improvement. • Otherwise, discard instructions • program is still correct, all we’ve done is “waste” a clock cycle. Two approaches

Direction Based Prediction Profile Based Prediction 70 Direction Based Prediction

• Simple to implement • However, often branch behaviour is variable (dynamic). • Can’t capture such behaviour at compile time with simple direction based prediction! • Need history (aka profile)-based prediction.

71 History-based Branch Prediction

● An important example is State-based branch prediction:

● Needs 2 parts:

– “Predictor” to guess where/if instruction will branch (and to where) – “Recovery Mechanism”: i.e. a way to fix mistakes

72 History-based Branch Prediction cont…

● One bit predictor: – Use result from last time this instruction executed.

● Problem: – Even if branch is almost always taken, we will be wrong at least twice – if branch alternates between taken, not taken

● We get 0% accuracy

73 1-bit Predictor

● Set bit to 1 or 0: –Depending on whether branch Taken (T) or Not-taken (NT) – Pipeline checks bit value and predicts – If incorrect then need to discard speculatively executed instruction

● Actual outcome used to set the bit value.

74 Example

● Let initial value = T, actual outcome of branches is- NT, NT,NT,T,T,T

– Predictions are: T, NT,NT,NT,T,T

● 2 wrong (in red), 4 correct = 66% accuracy

● 2-bit predictors can do even better

● In general, can have k-bit predictors.

75 2-bit Dynamic Branch Prediction Scheme

● Change prediction only if twice mispredicted: T Predict Taken Predict Taken NT T T NT Predict Not NT Predict Not Taken T Taken

NT

● Adds hysteresis to decision making process

76 An Example of Computing Performance

● Program assumptions: – 23% loads and in ½ of cases, next instruction uses load value – 13% stores – 19% conditional branches – 2% unconditional branches – 43% other

77 Example cont… ● Machine Assumptions: – 5 stage pipe

● Penalty of 1 cycle on use of load value immediately after a load.

● Jumps are resolved in ID stage for a 1 cycle branch penalty.

● 75% branch prediction accuracy.

● 1 cycle delay on misprediction.

78 Example cont… ● CPI penalty calculation:

– Loads:

● 50% of the 23% of loads have 1 cycle penalty: .5*.23=0.115 – Jumps:

● All of the 2% of jumps have 1 cycle penalty: 0.02*1 = 0.02 – Conditional Branches:

● 25% of the 19% are mispredicted, have a 1 cycle penalty: 0.25*0.19*1 = 0.0475

● Total Penalty: 0.115 + 0.02 + 0.0475 = 0.1825

● Average CPI: 1 + 0.1825 = 1.1825

79 Exploiting Loop-level Parallelism: Motivation

● An instruction pipeline essentially exploits ILP within a basic block:

– On the average the size of a basic block is 7. – After every 7 instructions, a branch instruction is encountered.

● To obtain substantial performance benefits:

– ILP across multiple basic blocks need to be exploited.

80 Software-based Scheduling vs. Hardware-based Scheduling • Disadvantage with : • In many cases, many information can not be extracted from code • Examples: • pointers to the same memory location. • Value of the induction variable of a loop

● It is still possible to assist hardware by exposing more ILP:

– Rearrange instructions for increased performance

81 Loop-level Parallelism

● It may be possible to execute different iterations of a loop in parallel.

● Example: – For(i=0;i<1000;i++){ – a[i]=a[i]+b[i]; – b[i]=b[i]*2; – }

82 Problems in Exploiting Loop- level Parallelism • Loop Carried Dependences: • A dependence across different iterations of a loop.

● Loop Independent Dependences: – A dependence within the body of the loop itself (i.e. within one iteration).

83 Loop-level Dependence

● Example:

– For(i=0;i<1000;i++){ – a[i+1]=b[i]+c[i] – b[i+1]=a[i+1]+d[i]; – }

● Loop-carried dependence from one iteration to the preceding iteration.

● Also, loop-independent dependence on account of a[i+1]

84 Eliminating Loop-level Dependences Through Code Transformations

● We shall examine 3 techniques: – Static loop unrolling – Basic block transformations –

85 Static Loop Unrolling

- A high proportion of loop instructions are loop management instructions. - Eliminating this overhead can significantly increase the performance of the loop. - for(i=1000;i>0;i--){ - a[i]=a[i]+c; - }

86 Static Loop Unrolling

Loop : L.D F0,0(R1) ; F0 = array elem. ADD.D F4,F0,F2 ; add scalar in F2 S.D F4,0(R1) ; store result DADDUI R1,R1,#-8 ; decrement ptr BNE R1,R2,Loop ; branch if R1 !=R2

87 Static Loop Unrolling cont… Loop : L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32

BNE R1,R2,Loop 88

Static Loop Unrolling cont… Loop : L.D F0,0(R1) Note the renamed registers. This eliminates ADD.D F4,F0,F2 dependencies between S.D F4,0(R1) each of n loop bodies of different iterations. L.D F6,-8(R1) ADD.D F8,F6,F2 n loop S.D F8,-8(R1) Bodies for L.D F10,-16(R1) n = 4 ADD.D F12,F10,F2 S.D F12,-16(R1) L.D F14,-24(R1) Note the adjustments for store and load ADD.D F16,F14,F2 offsets (only store S.D F16,-24(R1) highlighted red)! Adjusted loop DADDUI R1,R1,#-32 overhead instructions BNE R1,R2,Loop

89 Transformation of A Basic Block

● It is possible to rewrite a loop to eliminate loop-carried dependences:

– Only if, there are no cyclic dependences.

for(i=1;i<1000;i++){ a[1]=a[1]+b[1]; a[i]=a[i]+b[i]; for(i=1;i<999;i++){ b[i+1]=c[i]+d[i]; b[i+1]=c[i]+d[i]; } a[i+1]=a[i+1]+b[i+1]; } With dependence b[1000]=c[999]+d[999];

Without dependence 90 Modern Computer Architectures

Lecture 9: Advanced Branch Prediction Techniques

91 Handling Control Hazards: Branch Predictions

● Unless satisfactory resolution mechanisms are in place:

– Branches can significantly degrade the performance of a pipeline:

● We had so far looked at some very simple branch prediction techniques:

– Yet, yielded reasonably good performance benefits: of the order of 50% to 100%. – Can we do better by deploying more advanced branch prediction techniques?

92 Some Discussions on State- Based Predictor • The branch prediction buffer (BPB) is implemented as a special cache: – Accessed during the IF stage.

Lower Prediction Address Bits Bits 01FF 11 05CD 10 ------

93 Some Discussions on State-Based Predictor

● If an instruction is decoded as a branch:

– If the branch is predicted taken, fetching begins as soon as the target address is known.

● Branch taken prediction technique is of little use in MIPS 5 stage pipeline.

– Why? – Useful in deeper pipelines.

● What are the pros and cons of a using large BPB?

94 Predictors in Simple Pipelines

● Initial pipelined processors, e.g. MIPS, SOLARIS, etc.: – Did only trivial branch predictions.

● Possible reasons could be: – The penalty of mispredictions not as severe as in deeper pipelined processors. – Sophisticated branch predictors did not exist.

● Advanced branch prediction techniques have now become very important with: – Use of deeper pipelines. – Introduction of .

95 2-bit Predictor

● What is the prediction accuracy using a 4096 entry 2-bit for a typical application? – 99% to 80% depending upon the application.

● Can an n-bit (n>2) predictor do better? – 2-bit predictors do almost as well as any n-bit predictors.

● Can the accuracy of branch prediction be improved? – Correlating branch predictor.

96 Correlating Branch Predictor

● It may be possible to improve the accuracy of branch prediction:

– By observing the recent behavior of other branches.

● Example: if (a==2){ b=2;} if(b==2}{ b=0;}

97 Correlating Branch Predictor

● An (m,n) predictor: – Makes use of the outcomes observed for the last m branches: – Uses m number of n-bit predictors. – Behavior of a branch can be predicted by choosing from 2**m branch predictors.

98 Correlating Branch Predictor

● Why does the outcome of one branch depend on the outcome of another branch?

● Depending on whether some preceding branch is taken or not: – Some variable may be set to some value or not.

99 Correlating Branch Predictor Example ● d=2;

● While(TRUE){ •What values does d take? ● B1: if(d==0) •2,0,2,0,2,0…. ● d=1;

● B2: if(d==1)

● d=0;

● else d=2;

● }

100 1-bit Predictor for the Example

d=? B1 B1 New B1 B2 B2 New Predic Actual Predic predic Actual B2 tion tion tion Predic 2 NT T T NT T T 0 T NT NT T NT NT 2 NT T T NT T T 0 T NT NT T NT NT

Prediction Accuracy= 0%

101 (1,1) Correlating Predictor for the Example d=? B1 B1 New B2 B2 New Predicti Actual B1 predict Actu B2 on Predic ion al Predic 2 NT/NT T T/NT NT/NT T NT/T

0 T/NT NT T/NT NT/T NT NT/T 2 T/NT T T/NT NT/T T NT/T 0 T/NT NT T/NT NT/T NT NT/T

Prediction accuracy nearly 100%

102 Branch Target Buffers

● Branch Target Buffer (BTB) contains:

– Prediction for the target address. – Can be used along with a separate branch prediction buffer (BPB).

● At the end of IF stage we know whether a branch would be taken:

– And if yes, and if the target is address known. – Then, we can have a 0 cycle penalty for branches.

103 Modern Computer Architectures

Lecture 10: Branch Target Prediction

104 Branch Target Buffers

● Target address to be stored in the BTB:

– only for predicted taken branches. – Why?

● Can the BPB and the BTB be combined?

– Complications would arise for 2-bit predictor. – Target address for branch-untaken would also have to be stored.

● In many commercial processors, e.g. PowerPC:

– Separate BTB and BPB are used.

105 Branch Target Buffers • What are the pros and cons of using a large BTB?

Instruction Prediction Actual Penalty in Buffer? Branch Cycles Yes Taken T 0

Yes Taken NT 2

No NT T 2

No NT NT 0

106 Branch Folding?

● What if the target instruction itself is stored in the BTB instead of the target address? – Unconditional branches can run in 0 cycles. – Pipeline would simply substitute the instruction in BTB for the branch instruction.

107 Exercise

● Consider the 5-stage MIPS pipeline. ● Assume: – 1-bit branch prediction is used and along with branch target prediction using a BTB. ● A benchmark program contains 20% branch instructions: – Of these, 20% are unconditional branches. – Of the conditional branches 50% are taken. – 95% chance that a taken branch instruction is found in prediction buffer. – 95% chance that a taken prediction is correct. ● Compute performance improvements compared to a simple prediction scheme such

as the branch untaken approach. 108 Solution

● With BPB and BTB: – Penalty due to control hazards= – 20%*(60%*5% + 5%)*2 Cycles – = 0.2 *(0.03+0.05)*2 Cycles – = 0.2 * (0.08) * 2 = 0.032 – Average Cycle Time = 1.032

● With Simple Prediction: – Penalty due to control hazards= – 20%*60%*1=0.2*0.6*1=0.12 – Average Cycle Time=1.12

● Performance Improvements = – (1.12/1.032)*100=8.5% 109

Modern Computer Architectures

Lecture 11:Software Pipelining and Predicated Instructions

110 Software Pipelining

● Eliminates loop-independent dependence through code restructuring. – Reduces stalls – Helps achieve better performance in pipelined execution.

● As compared to simple loop unrolling: – Consumes less code space

111 Software Pipelining cont…

● Central idea: reorganize loops – Each iteration is made from instructions chosen from different iterations of the original loop.

i0

i1 Software i2 Pipeline Iteration i3 i4 i5

112 Software Pipelining cont…

● Exactly just as it happens in a hardware pipeline: – In each iteration of a software pipelined code, some instruction of some iteration of the original loop is executed.

113 Software Pipelining cont… - How is this done? 1  unroll loop body with an unroll factor of n. (we have taken n = 3 for our example) 2  select order of instructions from different iterations to pipeline 3  “paste” instructions from different iterations into the new pipelined loop body

114 Static Loop Unrolling Example

Loop : L.D F0,0(R1) ; F0 = array elem. ADD.D F4,F0,F2 ; add scalar in F2 S.D F4,0(R1) ; store result DADDUI R1,R1,#-8 ; decrement ptr BNE R1,R2,Loop ; branch if R1 !=R2

115 Software Pipelining: Step 1

Iteration i: L.D F0,0(R1) Note: ADD.D F4,F0,F2 1.) We are unrolling the loop S.D F4,0(R1) Hence no loop overhead Iteration i + 1: L.D F0,0(R1) Instructions are needed! ADD.D F4,F0,F2 S.D F4,0(R1) 2.) A single loop body of Iteration i + 2: L.D F0,0(R1) restructured loop would contain instructions from ADD.D F4,F0,F2 different iterations of the S.D F4,0(R1) original loop body.

116 Software Pipelining: Step 2

Iteration i: L.D F0,0(R1) Notes: ADD.D F4,F0,F2 S.D F4,0(R1) 1.) 1.) We’ll select the following order in our pipelined loop: Iteration i + 1: L.D F0,0(R1)

ADD.D F4,F0,F2 2.) 2.) Each instruction (L.D ADD.D S.D) S.D F4,0(R1) must be selected at least once to make sure that we don’t leave out Iteration i + 2: L.D F0,0(R1) 3.) any instructions of the original loop ADD.D F4,F0,F2 in the pipelined loop. S.D F4,0(R1)

117 Software Pipelining: Step 3

Iteration i: L.D F0,0(R1) The Pipelined Loop ADD.D F4,F0,F2 S.D F4,0(R1) 1.) Loop : S.D F4,16(R1) Iteration i + 1: L.D F0,0(R1) ADD.D F4,F0,F2 ADD.D F4,F0,F2 2.) L.D F0,0(R1) S.D F4,0(R1) DADDU R1,R1,#-8 I Iteration i + 2: L.D F0,0(R1) 3.) R1,R2,Loop BNE ADD.D F4,F0,F2 S.D F4,0(R1)

118 Software Pipelining: Step 4

Preheader Instructions to fill “software pipeline”

Pipelined Loop Body Loop : S.D F4,16(R1) ; M[ i ] ADD.D F4,F0,F2 ; M[ i – 1 ] L.D F0,0(R1) ; M[ i – 2 ] DADDUI R1,R1,#-8 BNE R1,R2,Loop

Postheader Instructions to drain “software pipeline”

119 Software Pipelined Code

Loop : S.D F4,16(R1) ; M[ i ] ADD.D F4,F0,F2 ; M[ i – 1 ] L.D F0,0(R1) ; M[ i – 2 ] DADDUI R1,R1,#-8 BNE R1,R2,Loop

120 Software Pipelining Issues

● Register management can be tricky. – In more complex examples, we may need to increase the iterations between when data is read and when the results are used.

● Optimal software pipelining has been shown to be an NP-complete problem: – Present solutions are based on heuristics.

121 Software Pipelining versus Loop Unrolling

● Software pipelining takes less code space.

● Software pipelining and loop unrolling reduce different types of inefficiencies: – Loop unrolling reduces loop management overheads. – Software pipelining allows a pipeline to run at full efficiency by eliminating loop- independent dependencies.

122 Hardware Support for ILP: Predicated Instructions

• Consider : If (A == 0) {S = T;} • Following MIPS code would be generated: BNEZ R1,L ADDU R2,R3,R0 L : • With predicated instructions: CMOVZ R2,R3,R1; if (R1 == 0) move R3 to R2

123 Predication instr1 instr2 Predication can if : remove branch p1,p2 <- cmp(a==b) overheads even jump L1 in general (p1) instr 3 conditional (p1) instr 4 then : statements jump L2 situations. (p2) instr 5 else (p2) instr 6 :

instr 7 instr 8 :

124 A Reflection on Predication

Traditional Architectures : 4 basic blocks With Predication : 1 basic block

instr1 instr1 instr2 instr2 if : if : p1,p2 <- cmp(a==b) p1,p2 <- cmp(a==b) jump p2 (p1) instr 3 then else then (p1) instr 4 (p1) instr 3 (p2) instr 5 : (p1) instr 4 (p2) instr 6 jump : :

(p2) instr 5 instr 7 else (p2) instr 6 instr 8 : :

instr 7 New Basic Block instr 8

: Predication results in more effective use of pipeline 125 Predicated Instructions

● Control-dependence is removed

● Number of instructions is reduced

● Should result in higher CPI without much hardware investments.

● Used in almost every modern processor:

– Special predicate registers supported for this purpose, e.g. Intel Pentium.

126 Modern Computer Architectures

Lecture 12: Dynamic Instruction Scheduling

127 Dynamic Instruction Scheduling

● Scheduling: Ordering the execution of instructions in a program so as to improve performance.

● Dynamic Scheduling:

– The hardware determines the order in which instructions execute.

– Contrast it with a statically scheduled processor where the compiler determines the order of execution.

128 Dynamic Instruction Scheduling: The Need

● We have seen that primitive pipelined processors tried to overcome data dependence through: – Forwarding:

● But, many data dependences can not be overcome this way. – Interlocking: brings down pipeline efficiency.

● Software based instruction restructuring: – Handicapped due to inability to detect many dependences.

129 Dynamic Instruction Scheduling: Key Idea

● Based on the idea of data flow computation --- “Execute an instruction as soon as its operands are available.”

130 Dataflow Execution

● Allow an instruction behind a stall to proceed if it is itself not stalled due to a dependency: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14

● Enables out-of-order execution: – May not lead to out-of-order completion.

131 Out of Order Execution cont…

● An instruction is in execution: – Between the time it begins execution and it completes execution.

● In a dynamically scheduled pipeline, – All instructions pass through issue stage in order (in-order issue).

132 Advantages of Dynamic Scheduling

● Can handle dependences unknown at compile time: – E.g. dependences involving memory references.

● Simplifies the compiler.

● Allows code compiled for one pipeline to run efficiently on a different pipeline.

● Hardware speculation can be used: – Can lead to further performance advantages,

builds on dynamic scheduling. 133 Overview of Dynamic Instruction Scheduling

● We shall discuss two schemes for implementing dynamic scheduling: – Scoreboarding: First used in the 1964 CDC 6600 computer. – Tomasulo’s Algorithm: Implemented for the FP unit of the IBM 360/91 in 1966.

● Since scoreboarding is a little closer to in-order execution, we’ll look at it first.

134 A Point to Note About Dynamic Scheduling

● WAR and WAW hazards that did not exist in an in-order pipeline: – Can arise in a dynamically scheduled processor.

135 Scoreboarding cont…

● Scoreboarding allows instructions to execute out of order: –When there are sufficient resources.

● Named after the scoreboard: –Originally developed for CDC 6600.

136 Scoreboarding The 5 Stage MIPS Pipeline

● Split the ID pipe stage of simple 5-stage pipeline into 2 stages:

–Issue: Decode instructions, check for structural hazards. –Read operands: Wait until no data hazards, then read operands.

137 Scoreboarding cont…

● Instructions pass through the issue stage in order.

● Instructions can bypass each other in the read operands stage: – Then enter execution out of order.

138 Scoreboarding Concepts

● We had observed that WAR and WAW hazards can occur in out-of- order execution: –Instructions involved in a dependence are stalled, –But, instructions having no dependence are allowed to continue. –Different units are kept as busy as possible.

139 Scoreboarding Concepts

● Essence of scoreboarding: – Execute instructions as early as possible. – When an instruction is stalled,

● Later instructions are issued and executed if they do not depend on any active or stalled instruction.

140 A Few More Basic Scoreboarding Concepts

● Every instruction goes through the scoreboard: – Scoreboard constructs the data dependences of the instruction. – Scoreboard decides when an instruction can execute. – Scoreboard also controls when an instruction can write its results into the destination register.

141 Scoreboarding

● Out-of-order execution requires multiple instructions to be in the EX stage simultaneously:

– Achieved with multiple functional units, along with pipelined functional units.

● All instructions go through the scoreboard:

– Centralized control of issue, operand reading, execution and writeback. – All hazard resolution is centralized in the scoreboard as well. 142 A Scoreboard for MIPS

R Data buses – source of structural hazard e FP Mult g FP Mult i s FP Divide t e FP Add r s Integer Unit

Scoreboard Control/ Control/ status status 143 4 Steps of Execution with Scoreboarding 1. Issue: when a f.u. for an instruction is free and no other active instruction has the same destination register: • Avoids structural and WAW hazards. 2. Read operands: when all source operands are available: • Note: forwarding not used. • A source operand is available if no earlier issued active instruction is going to write it. • Thus resolves RAW hazards dynamically.

144 Steps in Execution with Scoreboarding 3. Execution: begins when the f.u. receives its operands; scoreboard notified when execution completes. 4. Write Result: after WAR hazards have been resolved.

• Example:

• DIV.D F0, F2, F4

• ADD.D F10, F0, F8

• SUB.D F8, F8, F14 • ADD.D cannot proceed to read operands until DIV.D completes; SUB.D can execute but not write back until ADD.D has read F8. 145 An Assessment of Scoreboarding

● Pro: Factor of 1.7 improvement for FORTRAN and 2.5 for hand-coded assembly on CDC 6600! – Before semiconductor main memory or caches…

● Scoreboard on the CDC 6600: – Required about as much logic as a functional unit --– quite low.

● Cons: – Large number of buses needed:

● However, if we wish to issue multiple instructions per clock, more wires are needed in any case. – Centralized hardware for hazard resolution.

146 An Assessment of Scoreboarding cont…

● Pro: A scoreboard effectively handles true data dependencies:

– Minimizes the number of stalls due to true data dependencies.

● Con: Anti dependences and output dependences (WAR and WAW hazards) are also handled using stalls:

– Could have been better handled.

147 A More Sophisticated Approach: Tomasulo’s Algorithm

● Developed for IBM 360/91: – Goal: To keep the floating point pipeline as busy as possible. – This led Tomasulo to try to figure out how to achieve renaming in hardware!

● The descendants of this have flourished! – , HP 8000, MIPS 10000, Pentium III, PowerPC 604,

148 Key Innovations in Dynamic Instruction Scheduling

● Reservation stations: – Single entry buffer at the head of each functional unit has been replaced by a multiple entry buffer.

● Common Data Bus (CDB): – Connects the output of the functional units to the reservation stations as well as registers.

● Register Tags: – Tag corresponds to the entry number for the instruction producing the result.

149 Reservation Stations

● The basic idea: – An instruction waits in the reservation station, until its operands become available. ● Helps overcome RAW hazards. – A reservation station fetches and buffers an operand as soon as it is available: ● Eliminates the need to get operands from registers.

150 Modern Computer Architectures

Lecture 13: Tomasulo’s Algorithm

151 Tomasulo’s Algorithm

● Control & buffers distributed with Function Units (FU) – In the form of “reservation stations” associated with every function unit. – Store operands for issued but pending instructions.

● Registers in instructions replaced by values and others with pointers to reservation stations (RS): – Achieves register renaming. – Avoids WAR, WAW hazards without stalling. – Many more reservation stations than registers (why?), so can do optimizations that compilers can’t.

152 Tomasulo’s Algorithm cont… ● Results passed to FUs from RSs, – Not through registers, therefore similar to forwarding. – Over Common Data Bus (CDB) that broadcasts results to all FUs.

● Load and Stores: – Treated as FUs with RSs as well.

● Integer instructions can go past branches: – Allows FP ops beyond basic block in FP queue.

153 Tomasulo’s Scheme

From

Instruction Queue

Registers

Address Unit

Store Buffer Reservation Load Stations Buffer Adder Multiplier Memory Unit CDB 154 Three Stages of 1. Issue: Get instruction from Instr Queue – Issue instruction only if a matching reservation station is free (no structural hazard). – Send registers or the functional unit that would produce the result (achieves renaming). 2. Execute: Operate on operands (EX) – When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result: Finish execution (WB) – Write on CDB to all awaiting units; mark reservation station available.

155 Instruction stream Tomasulo Example Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 Load1 No LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 3 Load/Buffers Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No FU count 3 FP Adder R.S. down Add3 No Mult1 No 2 FP Mult R.S. Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 0 FU

Clock cycle 156 Tomasulo Example Cycle 1 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 1 FU Load1

157 Tomasulo Example Cycle 2 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 2 FU Load2 Load1

Note: Can have multiple loads outstanding 158 Tomasulo Example Cycle 3 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 3 FU Mult1 Load2 Load1

• Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued 159 • Load1 completing; what is waiting for Load1? Tomasulo Example Cycle 4 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 4 FU Mult1 Load2 M(A1) Add1

• Load2 completing; what is waiting for Load2? 160 Tomasulo Example Cycle 5 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 5 FU Mult1 M(A2) M(A1) Add1 Mult2

• Timer starts down for Add1, Mult1 161 Tomasulo Example Cycle 6 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 6 FU Mult1 M(A2) Add2 Add1 Mult2

• Issue ADDD here despite name dependency on F6? 162 Tomasulo Example Cycle 7 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 7 FU Mult1 M(A2) Add2 Add1 Mult2

• Add1 (SUBD) completing; what is waiting for it? 163 Tomasulo Example Cycle 8 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 8 FU Mult1 M(A2) Add2 (M-M) Mult2

164 Tomasulo Example Cycle 9 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 1 Add2 Yes ADDD (M-M) M(A2) Add3 No 6 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 9 FU Mult1 M(A2) Add2 (M-M) Mult2

165 Tomasulo Example Cycle 10 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 10 FU Mult1 M(A2) Add2 (M-M) Mult2

• Add2 (ADDD) completing; what is waiting for it? 166 Tomasulo Example Cycle 11 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 11 FU Mult1 M(A2) (M-M+M(M-M) Mult2

• Write result of ADDD here? 167 Tomasulo Example Cycle 12 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 3 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 12 FU Mult1 M(A2) (M-M+M(M-M) Mult2

168 Tomasulo Example Cycle 13 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 2 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 13 FU Mult1 M(A2) (M-M+M(M-M) Mult2

169 Tomasulo Example Cycle 14 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 1 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 14 FU Mult1 M(A2) (M-M+M(M-M) Mult2

170 Tomasulo Example Cycle 15 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 15 FU Mult1 M(A2) (M-M+M(M-M) Mult2

• Mult1 (MULTD) completing; what is waiting for it? 171 Tomasulo Example Cycle 16 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 16 FU M*F4 M(A2) (M-M+M(M-M) Mult2

• Just waiting for Mult2 (DIVD) to complete 172

(skip a couple of cycles)

173 Tomasulo Example Cycle 55 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 55 FU M*F4 M(A2) (M-M+M(M-M) Mult2

174 Tomasulo Example Cycle 56 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 56 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 56 FU M*F4 M(A2) (M-M+M(M-M) Mult2

• Mult2 (DIVD) is completing; what is waiting for it? 175 Tomasulo Example Cycle 57 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 56 57 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 56 FU M*F4 M(A2) (M-M+M(M-M) Result

• Once again: In-order issue, out-of-order execution and out-of-order completion. 176 Tomasulo’s Scheme: Drawbacks ● Performance is limited by CDB: – CDB connects to multiple functional units high capacitance, high wiring density – Number of functional units that can complete per cycle limited to one!

● Multiple CDBs  more FU logic for parallel stores.

● Imprecise exceptions! – Effective handling is a major performance bottleneck.

177 /Exceptions

● Interrupts: external, I/O devices, OS.

● Exceptions: internal, errors – Illegal op code, divide by 0, overflow/underflow, page faults.

● OS needs to intervene to handle exceptions.

178 Imprecise Exceptions

● An exception is called imprecise when: – “The processor state when an exception is raised, does not look exactly the same compared to when the instructions are executed in- order.”

179 Imprecise Exceptions

● In an out of order execution model, an imprecise exception is said to occur if – When exception is raised by an instruction:

● some instructions before it may not be complete

● some instructions after it are already complete

● For example: – A floating point instruction exception could be detected after an integer instruction that is much later in the program order is complete. 180 Handling Imprecise Exceptions in Dynamic Scheduling

● Instructions are issued in-order: – But, may execute out-of-order. – However, unless control-dependence is resolved an instruction is not executed. – No instruction is allowed to initiate execution until all branches that precede the instructions are complete.

● This is a performance bottleneck: – Average basic block size is about 6 instructions. 181 Modern Computer Architectures

Lecture 14: Dynamic Instruction Scheduling: Loop Example

182 Tomasulo’s Scheme- Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop

183 Tomasulo’s Scheme- Loop Example

● Assume Multiply takes 4 clocks. ● Assume: – 1st load takes 8 clocks (L1 cache miss) – 2nd load takes 1 clock (hit) ● To be clear, we will show clocks for SUBI, BNEZ: – Reality: integer instructions ahead of FP Instructions. ● Show 2 iterations 184 Loop Example Instruction status: ExecWrite ITER Instruction j k IssueCompResult Busy Addr Fu 1 LD F0 0 R1 Load1 No 1 MULTD F4 F0 F2 Load2 No Iter- 1 SD F4 0 R1 Load3 No ation Count 2 LD F0 0 R1 Store1 No 2 MULTD F4 F0 F2 Store2 No 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS Added Store Buffers Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 No SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Instruction Loop

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 185 0 80 Fu Loop Example Cycle 1 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 Load2 No Load3 No Store1 No Store2 No Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 No SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 1 80 Fu Load1

186 Loop Example Cycle 2 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No Load3 No Store1 No Store2 No Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 2 80 Fu Load1 Mult1

187 Loop Example Cycle 3 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 No Store1 Yes 80 Mult1 Store2 No Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 3 80 Fu Load1 Mult1

188 ● Implicit renaming sets up data flow graph Loop Example Cycle 4 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 No Store1 Yes 80 Mult1 Store2 No Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 4 80 Fu Load1 Mult1

189 ● Dispatching SUBI Instruction (not in FP queue) Loop Example Cycle 5 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 No Store1 Yes 80 Mult1 Store2 No Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 5 72 Fu Load1 Mult1

● BNEZ instruction (not in FP queue) 190 Loop Example Cycle 6 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 Store2 No Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 6 72 Fu Load2 Mult1

● Notice that F0 never sees Load from location 80 191 Loop Example Cycle 7 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 No Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 7 72 Fu Load2 Mult2

● Register file completely detached from computation 192 ● First and Second iteration completely overlapped Loop Example Cycle 8 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 8 72 Fu Load2 Mult2

193 Loop Example Cycle 9 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 9 72 Fu Load2 Mult2

● Load1 completing: who is waiting? 194

● Note: Dispatching SUBI Loop Example Cycle 10 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 10 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 10 64 Fu Load2 Mult2

● Load2 completing: who is waiting? 195 ● Note: Dispatching BNEZ Loop Example Cycle 11 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 4 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 11 64 Fu Load3 Mult2

● Next load in sequence 196 Loop Example Cycle 12 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 3 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 12 64 Fu Load3 Mult2

● Why not issue third multiply? 197 Loop Example Cycle 13 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 2 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 13 64 Fu Load3 Mult2

● Why not issue third store? 198 Loop Example Cycle 14 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 0 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 1 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 14 64 Fu Load3 Mult2

● Mult1 completing. Who is waiting? 199 Loop Example Cycle 15 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 No SUBI R1 R1 #8 0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 15 64 Fu Load3 Mult2

● Mult2 completing. Who is waiting? 200 Loop Example Cycle 16 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 4 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 16 64 Fu Load3 Mult1

201 Loop Example Cycle 17 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 Store3 Yes 64 Mult1 Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 17 64 Fu Load3 Mult1

202 Loop Example Cycle 18 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 18 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 Store3 Yes 64 Mult1 Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 18 64 Fu Load3 Mult1

203 Loop Example Cycle 19 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 18 19 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 No 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 19 Store3 Yes 64 Mult1 Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 19 56 Fu Load3 Mult1

204 Loop Example Cycle 20 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 Yes 56 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 18 19 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 No 2 MULTD F4 F0 F2 7 15 16 Store2 No 2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1 Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 20 56 Fu Load1 Mult1 • Once again: In-order issue, out-of-order execution and out-of-order completion. 205 Why Can Tomasulo’s Scheme Overlap Iterations of Loops?

● Register renaming using reservation stations: –Avoids the WAR stall that occurred in the scoreboard. –Also, multiple iterations use different physical destinations facilitating dynamic loop unrolling.

206 Tomasulo’s Scheme Offers Three Major Advantages

• 1. Distribution of hazard detection logic: – Distributed reservation stations. – If multiple instructions wait on a single result,

● Instructions can be passed simultaneously by broadcast on CDB. – If a centralized register file were used,

● Units would have to read their results from registers . ● 2. Elimination of stalls for WAW and WAR hazards. ● 3. Possible to have superscalar execution: – Because results directly available to FUs, rather than from registers.

207 Modern Computer Architectures

Lecture 15: (Cont…)

208 Speculative Execution

● So far, we saw dynamic instruction scheduling can pipeline performance: – Effectively takes care of data hazards.

● For higher performance: – We need to overcome control hazards.

● Main idea: After predicting a branch: – Continue executing assuming that the prediction is correct.

209 What is Speculative Execution?

● Allow an instruction to execute even before its control dependencies are resolved: – Speculating that all predictions are accurate.

210 Speculative Execution

● Extends dynamic scheduling scheme.

● In dynamic scheduling, instructions are fetched and issued: – In speculative execution, instructions are fetched, issued, and executed.

● Implemented in many modern processors: – PowerPC, Pentium, AMD Athlon, etc.

211 Speculative Execution

● Recollect that in Tomasulo’s approach: – Until the controlling branches have executed, – Instructions only allowed to be fetched and issued (but not actually executed).

● Speculation takes this approach a step further: – Actually executes an instruction based on branch prediction.

212 Hardware-Based Speculative Execution

● In speculative execution: – Fetch, issue, and execute instructions as if the branch predictions were correct. – Dynamic scheduling only fetches and issues such instructions.

213 Hardware-Based Speculation ● Combines three main ideas: – Profile-based branch prediction. – Speculation: allow instructions to be executed even before control dependences are resolved. – Dynamic instruction scheduling.

● Used in almost every modern processor: – PowerPC, Pentium, Alpha, AMD Athlon, etc.

214 What If Speculative Execution Needs to be Undone

● An instruction should execute in shadow: – Until it commits.

● Also, we must separate passing of results among instructions: – From actual completion of an instruction. – This would allow an instruction to be undone if a branch prediction turns out to be incorrect. 215 Support for Shadow Execution ● Additional set of hardware registers: – Called reorder buffers (ROBs) – Store results of instructions (in shadow) that have completed but not yet committed.

● Reason for calling “reoder buffer”: – Even though instructions may complete in any order: – They are reordered in ROBs so that they commit in-order.

216 Reorder Buffers (ROB)

● Put instructions back to order: – Instructions enter ROB out of order – Instructions leave ROB in order

● Results of an instruction become visible externally when it leaves ROB: – Registers updated; memory updated

217 Speculative Execution: A Central Idea ● Allow instruction to execute out-of-order: – Force them to complete in-order – Also, prevent any irrevocable action (updating state, or generating exceptions) occurring out of order.

218 Major Changes Over Tomasulo’s Scheme

● Two major changes compared to the Tomasulo’s scheme are: – Addition of ROB – Elimination of the Store buffer whose functions are integrated into ROB.

219 Speculative Execution

From Instruction Unit Reorder Buffer Instruction Queue Reg# Data Registers

Address Unit

Load Reservation Buffers Stations Adder Multiplier Memory Unit CDB 220 Four Steps of Speculative Execution

● Issue: – Issue an instruction if there is an empty reservation station and an empty slot in ROB.

● Execute: – If one or more operands not yet available:

● monitor CDB and wait. – Executing an instruction may take multiple cycles.

● Write result: Write to CDB

● Received at waiting reservation stations and ROB.

221 Four Steps of Speculative Execution

● Commit: – Normal commit: update registers, remove from ROB. – Store: Similar to normal commit, except that memory is updated rather than registers. – Branch with incorrect prediction: ROB is flushed, execution restarted at correct successor of the branch. 222 Speculation During Exceptional Events

● Most pipelines allow: – Only low-cost exceptional events to be handled in speculative mode.

● Processor is stalled for expensive exceptional events, such as: – Second-level cache miss, or TLB miss.

223 What About Imprecise Exceptions?

● Speculative execution scheme: –In-order issue, out-of-order execution, and out-of-order completion.

● Need to “fix” the imprecise exceptions in out-of-order completion.

224 Fixing Imprecise Exceptions

● An instruction executes in shadow: – Until it completes (commits). – Exceptions are masked until instruction commits.

● Passing of the results among instruction: – Separated from actual completion of instruction. – Allows an instruction to be undone.

● Instructions execute out of order, but commit in order.

225 Software-based Scheduling vs. Hardware-based Scheduling • Advantage of Software Approach: Unlike with hardware-based approaches: • Overheads due to analysis of an instruction sequence is not an issue. • We can afford to perform more detailed analysis of the instruction sequence. • Possible to consider many more factors in optimizing an instruction sequence.

226 Advanced Compiler Support for Exploiting ILP

● Remember loop-carried dependence: – Data values produced in earlier iterations are used in later iterations. – Prevents concurrent execution of the instructions in the different loops.

227 Loop-Carried Dependence

● Example:

– For(i=0;i<1000;i++){ – a[i+1]=b[i]+c[i] – b[i+1]=a[i+1]+d[i]; – }

● Loop-carried dependence from one iteration to the preceding iteration.

● Also, loop-independent dependence on account of a[i+1]

228 Eliminating Loop-Carried Dependences

● When there are circular dependences across loop iterations: – Loop-carried dependences can not be eliminated through code transformations.

● Circular dependency: – One iteration (m) uses the result of another iteration (say n). – At the same time, the other iteration (n) uses results computed by this iteration (m).

229 Transformation of A Basic Block ● It is possible to rewrite a loop to eliminate loop-carried dependences:

– Only if, there are no cyclic dependences.

for(i=1;i<1000;i++){ a[1]=a[1]+b[1]; a[i]=a[i]+b[i]; for(i=1;i<999;i++){ b[i+1]=c[i]+d[i]; b[i+1]=c[i]+d[i]; } a[i+1]=a[i+1]+b[i+1]; } With dependence b[1000]=c[999]+d[999];

Without dependence 230 Example of Cyclic Dependence

● Iteration i produces for(i=1;i<1000;i++){ result for iteration a[i]=a[i+1]+b[i]; i+1. b[i+1]=c[i]+d[i]; ● Iteration i uses result produced by } iteration i+1. Cyclic dependence

231 Recurrent Dependence

● Sometimes loop- carried for(i=1;i<1000;i++){ dependence can be a[i]=a[i-5]+b[i]; in the form of a recurrence. }

● Dependence distance =5. Recurrent dependence

232 Recurrent Dependence

● The larger the dependence: – The more parallelism can be obtained by unrolling the loop.

● For a loop with dependence distance of 5: – A sequence of 5 statements would have no dependence.

● Some architectures (e.g. vector processors): – Special support for handling recurrences.

233 Dependence Analysis

● Nearly all dependence analysis algorithms work on the assumption: – Array indices are affine.

● An array index is affine, iff: – It can be written in the form a*i+b. – Successive iterations access successive proportionate elements.

● A multidimensional array is affine, iff: – Index along each direction is affine.

● Give example of an array index that is not affine.

234 Dependence Analysis

● Sparse arrays have usually have indices of the form: – x[y[i]] : non-affine

● A loop-carried dependence exists iff: – There are two iteration indices j and k both within the limits of the loop, such that – jth iteration stores at a*j+k and kth iteration reads from c*k+d – a*j+b=c*k+d

235 GCD Test

● When the array indices are affine:

– If a loop-carried dependence exists, GCD(c,a) must divide d-b.

● Example: Examine if there exists any loop- carried dependence. for(i=1;i<1000;i++){ A=2, b=3, c=2, d=0 x[2*i+3]=x[2*i]+5.0; GCD(a,c)=2, d-b=3 }  No dependence exists

236 GCD Test: A Reflection

● Is it possible that the GCD test succeeds,

– But, there is actually no dependence? – Yes. – This may be because the GCD test does not take the loop bound constraint into account.

● It may not be possible to apply GCD test

– When the values of a,b,c,d are not known at compile time – a,b,c, or d may be defined at run time.

237 Modern Computer Architectures

Lecture 16:Superscalar and VLIW Processors

238 A Practice Problem on Dependence Analysis

● Identify all dependences in the following code.

● Transform the code to eliminate the dependences.

for(i=1;i<1000;i++){ y[i]=x[i]/c; x[i]=x[i]+c; z[i]=y[i]+c; y[i]=c-y[i]; }

239 Transformed Code Without Dependence

for(i=1;i<1000;i++){ t[i]=x[i]/c; x[i]=x[i]+c; z[i]=t[i]+c; y[i]=c-t[i]; }

240 Global Code Scheduling

● Simple code transformations work well, only if: – The loop body is a straight line code.

● Issues become more complex in the presence of: – Nested loops, nested branches, etc.

● Instructions might have to be moved across branches: – This is called global code scheduling.

241 Two Paths to Higher ILP

● Superscalar processors: – Multiple issue, dynamically scheduled, speculative execution, branch prediction – More hardware functionalities and complexities.

● VLIW: – Let complier take the complexity. – Simple hardware, smart compiler.

242 Superscalar Execution

● Scheduling of instructions is determined by a number of factors: – True Data Dependency: The result of one operation is an input to the next. – Resource constraints: Two operations require the same resource. – Branch Dependency: Scheduling instructions across conditional branch statements cannot be done deterministically a-priori.

● An appropriate number of instructions issued. – Superscalar processor of degree m.

243 Very Long Instruction Word (VLIW) Processors

● Hardware cost and complexity of superscalar schedulers is a major consideration in processor design. – VLIW processors rely on compile time analysis to identify and bundle together instructions that can be executed concurrently.

● These instructions are packed and dispatched together, – Thus the name very long instruction word. – This concept is employed in the Intel IA64 processors.

244 VLIW Processors

● The compiler has complete responsibility of selecting a set of instructions: – These can be concurrently be executed.

● VLIW processors have static instruction issue capability: – As compared, superscalar processors have dynamic issue capability.

245 The Basic VLIW Approach

● VLIW processors deploy multiple independent functional units.

● Early VLIW processors operated lock step: – There was no hazard detection in hardware at all. – A stall in any functional unit causes the entire pipeline to stall. 246 VLIW Processors

● Assume a 4-issue static superscalar processor: – During fetch stage, 1 to 4 instructions would be fetched. – The group of instructions that could be issued in a single cycle are called:

● An issue packet or a Bundle. – If an instruction could cause a structural or data hazard:

● It is not issued.

247 VLIW (Very Long Instruction Word) ● One single VLIW instruction:

– separately targets differently functional units.

● MultiFlow TRACE, TI C6X, IA-64

Bundle add r1,r2,r3 load r4,r5+4 mov r6,r2 mul r7,r8,r9

FU FU FU FU

Schematic Explanation for a VLIW Instruction

248 VLIW Processors: Some Considerations

● Issue hardware is simpler.

● Compiler has a bigger context from which to select co-scheduled instructions.

● Compilers, however, do not have runtime information such as cache misses. – Scheduling is, therefore, inherently conservative. – Branch and memory prediction is more difficult.

● Typical VLIW processors are limited to 4-way to 8-way parallelism.

249 VLIW Summary

● Each “instruction” is very large – Bundles multiple operations that are independent.

● Complier detects hazard, and determines scheduling.

● There is no (or only partial) hardware hazard detection: – No dependence check logic for instructions issued at the same cycle.

● Tradeoff instruction space for simple decoding – The long instruction word has room for many operations. – But have to fill with NOP if enough operations

cannot be found. 250

VLIW vs Superscalar

● VLIW - Compiler finds parallelism: – Superscalar – hardware finds parallelism

● VLIW – Simpler hardware: – Superscalar – More complex hardware

● VLIW – less parallelism can be exploited for a typical program: – Superscalar – Better performance

251 Superscalar Processors

● Commercial desktop processors now do four or more issues per clock: – Even in the embedded processor market, dual issue superscalar pipelines are becoming common.

252 Superscalar Execution With Dynamic Scheduling

● Multiple instruction issue: – Very well accommodated with dynamic instruction scheduling approach.

● The issue stage can be: – Replicated, pipelined, or both.

253 Limitations of Scalar Pipelines: A Reflection

● Maximum throughput bounded by one instruction per cycle.

● Inefficient unification of instructions into one pipeline: – ALU, MEM stages very diverse eg: FP

● Rigid nature of in-order pipeline: – If a leading instruction is stalled every subsequent instruction is stalled

254 A Rigid Pipeline

Backward Bypassing Propagation of stalled Stalled Instruction of stalling instruction not allowed

255 Solving Problems of Scalar Pipelines: Modern Processors

● Maximum throughput bounded by one instruction per cycle: – parallel pipelines (superscalar)

● Inefficient unification into a single pipeline: – diversified pipelines.

● Rigid nature of in order pipeline – Allow out of ordering or dynamic instruction scheduling.

256 Machine Parallelism

(a) No Parallelism (Nonpipelined) (b) Temporal Parallelism (Pipelined) (c) Spatial Parallelism (Multiple units) (d) Combined Temporal and Spatial Parallelism

257 A Parallel Pipeline

Width = 3

258 Scalar and Parallel Pipeline

(a) The five-stage scalar pipeline (b) The five-stage Pentium Parallel Pipeline of width=2

259 Diversified Parallel Pipeline

260 A Dynamically Scheduled Speculative Pipeline

261 Distributed Reservation Stations

262 A Superscalar Pipeline

A degree six superscalar pipeline

263 Superscalar Pipeline Design

Fetch

Instruction Instruction Buffer Flow Decode Dispatch Buffer

Dispatch

Issuing Buffer

Execute

Data Flow Completion Buffer

Complete

Store Buffer

Retire 264 A Superscalar MIPS Processor

● Assume two instructions can be issued per clock cycle: – One of the instructions can be load, store, or integer ALU operations. – The other can be a floating point operation.

265 MIPS Pipeline with Pipelined Multi- Cycle Operations

EX

M1 M2 M3 M4 M5 M6 M7

IF ID M WB A1 A2 A3 A4

DIV

Pipelined implementations ex: 7 outstanding MUL, 4 outstanding Add, unpipelined DIV. In-order execution, out-of-order completion

Tomasulo w/o ROB: out-of-order execution, out-of-order completion, in- 266 order commit Modern Computer Architectures

Lecture 17:Superscalar and Vector Processors

267 Vector Processing

● What is a ? – A vector processor supports high-level operations (add, subtract, multiply, etc) on vectors. – SIMD processing

● A typical instruction might add two 64 element FP vectors.

● Commercialized long before ILP machines.

268 Vector Processing

SCALAR VECTOR (1 operation) (N operations)

r1 r2 v1 v2 + +

r3 v3 vector length add r3, r1, r2 add.vv v3, v1, v2

269 Why Vector Processors?

● One vector instruction is equivalent to executing an entire loop:

– Reduces instruction fetch and decode overheads and bandwidth.

● Each instruction is guaranteed to be independent of other instructions in the same vector:

– No data hazard check needed in an instruction. – Executed using an array of functional units, or a deep pipeline.

270 Why Vector Processors?

● Hardware needs to only check for data hazards between two instructions:

– Once per two vector instructions. – More instructions handled per data check.

● Memory access for entire vector, not a single word.

– Reduced Latency

● Multiple vector instructions in progress.

– Further parallelism

271 Basic Vector Architectures ● Two Types: – Vector-register:

● All operations except load and store based on registers. – Memory-memory:

● All operations are memory to memory.

● A vector register:

– Fixed length, holds a single vector

272 Issue 1: Memory Bandwidth

● Problem:

– Memory system needs to be able to produce and accept large amounts of data. – But how do we achieve this when there is poor access time?

● Solution:

– Creating multiple memory banks.

● Also, useful for fragmented accesses.

● Supports multiple loads per clock cycle.

273 Issue 2: Vector Length

● Problem:

– How do we support operations where the length is unknown or not the same as vector length?

● Solution:

– Provide a vector-length register, solves problem only if real length is less than Maximum Vector Length. – Use Technique Called strip mining.

274 Vector Length Register (VLR)

● A vector register can hold some maximum number of elements for each data width – maximum vector length or MVL.

● What to do when the application vector length is not exactly MVL? – Vector-length register(VLR) controls the length of any vector operation, including a vector load or store – E.g. vadd with VL=10 is for (I=0; I<10; I++) V1[I]=V2[I]+V3[I]

● VL can be anything from 0 to MVL.

● How do you code an application where the vector length is not known until run-time?

275 Strip Mining

● Helps handle vector operations for sizes greater than MVL.

● Creates 2 loops:

– One that handles any number of iterations multiple of MVL. – Another that handles the remaining iterations.

● Code becomes vectorizable.

● Careful handling of VLR needed.

276 Example: Strip Mining low=1; /*Assume start element at 1*/ vL = n % mvL; /*find the odd – size piece */ for(j=0; j<=n/mvL; j++){ /*Outer Loop*/ for(i=low; i<=low+vL-1;i++){ /*Inner loop-runs for length vL*/ y[i] = a*x[i] + y[i]; /*Start of next vector*/ } low = low + vL; /*Find start of next vector*/ vL = mvL; /* reset length to max */ }

277 Present Applications of Vector Processors: Media Processing

● Desktop: – 3D graphics (games) – Speech recognition (voice input) – Video/audio decoding (mpeg-mp3 playback) ● Servers: – Video/audio encoding (video servers, IP telephony) – Digital libraries and media mining (video servers) – Computer animation, 3D modeling & rendering (movies) ● Embedded: – 3D graphics (game consoles) – Video/audio decoding & encoding (set top boxes) – Image processing (digital cameras) – Signal processing (cellular phones)

278 Basic Idea

● Media Processors are short vector processors.

● Exploit sub-word parallelism: – Treat a 64-bit register as a vector of 2 32-bit or 4 16-bit or 8 8-bit values (short vectors) – Partition 64-bit data paths to handle multiple narrow operations in parallel

279 Characteristics of Multimedia Applications ● Narrow data-types: – Typical width of data in memory: 8 to 16 bits – Typical width of data during computation: 16 to 32 bits – 64-bit data types are rarely needed. – Fixed-point arithmetic often replaces floating- point.

● Fine-grain (data) parallelism: – Identical operation applied on streams of data – Branches have high predictability – High instruction locality in small loops or kernels

280 Characteristics of Media Applications

● Most audio/video samples are processed independently of each other: – Very few branches – Ten or more operation can be scheduled in parallel. – In contrast, other applications suffer from large data hazards and control hazards.

281 Overview of SIMD Extensions

Vendor Extensio Year # Instr Registers n HP MAX-1 and 94,95 9,8 (int) Int 32x64b 2 Sun VIS 95 121 (int) FP 32x64b Intel MMX 97 57 (int) FP 8x64b AMD 3DNow! 98 21 (fp) FP 8x64b Motorola Altivec 98 162 (int,fp) 32x128b (new) Intel SSE 98 70 (fp) 8x128b (new) MIPS MIPS-3D ? 23 (fp) FP 32x64b AMD E 3DNow! 99 24 (fp) 8x128 (new) Intel SSE-2 01 144 (int,fp) 8x128 (new)

282 Compilers for SIMD Extensions

● No commercially available compiler so far.

● Problems: – Language support for expressing fixed- point arithmetic and SIMD parallelism. – Complicated model for loading/storing vectors. – Frequent updates.

● Assembly coding is prevalent.

283 Superscalar Versus Vector Processing

● A vector processor can efficiently exploit parallelism from regular code:

– Matrix operations. – Multimedia operations, scientific computations, etc.

● A superscalar processor can exploit reasonable amount of parallelism in less structured code:

– Typical programs.

284 Intel MMX: A Vector Processor

● 3 data types: 8 8-bit, 4 16-bit, 2 32-bit in 64bits – reuse 8 FP registers (FP and MMX cannot mix)

● short vector: load, add, store 8 8-bit operands

+

● Claim: overall speedup 1.5 to 2X for 2D/3D graphics, audio, video, speech, comm., ... – use in drivers or added to library routines; no compiler.

285 MMX Instructions

● Move 32b, 64b

● Add, Subtract in parallel: 8 8b, 4 16b, 2 32b: – Signed/unsigned saturate (set to max) if overflow

● Shifts (sll,srl, sra), And, And Not, Or, Xor in parallel: 8 8b, 4 16b, 2 32b

● Multiply, Multiply-Add in parallel: 4 16b

● Compare = , > in parallel: 8 8b, 4 16b, 2 32b – sets field to 0s (false) or 1s (true);

286 Modern Computer Architectures

Lecture 18: A Survey of Some Commercial Processors

287 A Survey of Some Modern Processors

● In the subsequent slides: – We shall examine some modern commercial processors. – The objective is to examine how architectural innovations we discussed have been used in commercial processors.

288 Early Intel

● Intel 8080 – 64K addressable RAM – 8-bit registers – CP/M operating system – S-100 BUS architecture – 8-inch floppy disks!

● Intel 8086/8088 – Used in IBM-PCs – 1 MB addressable RAM – 16-bit registers – 16-bit data bus (8-bit for 8088) – separate floating-point unit (8087)

289 Microprocessors

● Intel 8086, 80286

● IA-32 processor family

processor family

● Netburst family

290 x86 Processor History ● IA-16 Processors

● 8086 - Intel's first 16bit PC microprocessor.

● 8088 - A minor refinement of the 8086.

● 80186 - An extension to the 8086.

● 80286 - A reasonably successful extension to the 8086.

● IA-32 Processors

● 80386 - Intel's first (32-bit, protected mode) processor.

● 80486 - A much improved 80386, use of instruction pipe.

● 80586 - A much improved 80486, named the Pentium.

● 80686 - An improved 80586, named the Pentium Pro.

● 80586+MMX - A refined 80586, faster, and with MMX extensions, named the Pentium MMX.

● 80686+MMX - A refined 80686, faster, and with MMX extensions, named the Pentium II.

● 80686+MMX+SSI - A refined 80686+MMX, with SSE extensions, named the Pentium III.

● IA-64 Processors

● Newer "Pentium" processors, Intel's attempt at a 64-bit architecture.

291 Intel IA-32 Family

● Intel 386: 1985 – 4 GB addressable RAM, 32-bit registers, paging ().

● Intel 486: 1989 – Instruction pipelining

● Pentium: 1993 – Superscalar, 32-bit address bus, 64- bit internal data path.

292 IA-32

● 8086 began an evolution:

– Eventually resulted in IA-32 family of object code compatible microprocessors.

● IA-32 is a CISC architecture:

– Variable length instructions and complex addressing modes. – Turned out to be the most dominant architecture of the time in terms of sales volume. – 1985: Intel 386 – 1989: First pipelined version of IA-32 family Intel 486 was introduced.

293 386 and onto 486

● 80386 was first IA-32 implementation: – Included several architectural improvements in addition to the wider data path.

● Perhaps the most important feature was extension of the virtual memory architecture: – Includes both the segmentation used in the 80286 and paging --- the preferred technique in the Unix world.

● 486 entirely improved 386: – Pipelined – With an on-chip floating point unit.

294 Intel 486 5-Stage “CISC” Pipeline

Stage name Function performed

1. Instruction Fetch Fetch instruction from the 32-byte prefetch queue

2. Instruction Decode-1 Translate instr. into control signals or microcode addr. Initiate addr. generation and memory access 3. Instruction Decode-2 Access microcode memory Outputs microinstruction to 4. Execute Execute ALU and memory accessing operations

5. Register Write-back Write back results to register

295 i486 Pipeline

● Fetch – Load 16-bytes of instruction into prefetch buffer ● Decode1 – Determine instruction length, instruction type ● Decode2 – Compute memory address – Generate immediate operands ● Execute – Register Read – ALU operation – Memory read/write ● Write-Back – Update register file 296 A Reflection on 486 Pipeline

● Two Decoding Stages: –Harder to decode CISC instructions. –Inevitable due to microcoded control. –Effective address calculation in D2.

● Multicycle Decoding Stages: –For more difficult decodings. –Stalls incoming instructions.

297 A Note on the 486 CISC Pipeline ● The EXE stage performs : – Both ALU operations as well as cache access.

● Two penalty cycles are incurred: – If an instruction produces a register result and the next instruction uses this result for address generation.

● Pipelined 486 could achieve performance improvement by a factor of about 25 over 386.

298 486 vs. 386

● Cycles Per Instruction Instruction Type 386 Cycles 486 Cycles Load 4 1 Store 2 1 ALU 2 1 Jump taken 9 3 Jump not taken 3 1 Call 9 3 ● Reasons for Improvement: – On chip cache

● Faster loads & stores – Deeper pipeline

299 The Intel P5 and P6 Family

Year Type Transistors Technology Clock Issue Word L1 cache L2 cache (x1000) (mm) (MHz) format 1993 Pentium 3100 0.8 66 2 32-bit 2 X 8 kB 1994 Pentium 3200 0.6 75-100 2 32-bit 2 X 8 kB 1995 Pentium 3200 0.6/0.35 120-133 2 32-bit 2 X 8 kB P5 1996 Pentium 3300 0.35 150-166 2 32-bit 2 X 8 kB 1997 Pentium MMX 4500 0.35 200-233 2 32-bit 2 X 16 kB 1998 Mobile Pentium MMX 4500 0.25 200-233 2 32-bit 2 X 16 kB 1995 PentiumPro 5500 0.35 150-200 3 32-bit 2 X 8 kB 256/512 kB 1997 PentiumPro 5500 0.35 200 3 32-bit 2 X 8 kB 1 MB 1998 Intel Celeron 7500 0.25 266-300 3 32-bit 2 X 16 kB -- 1998 Intel Celeron 19000 0.25 300-333 3 32-bit 2 X 16 kB 128 kB 1997 Pentium II 7000 0.25 233-450 3 32-bit 2 X 16 kB 256 kB/512 kB P6 1998 Mobile Pentium II 7000 0.25 300 3 32-bit 2 X 16 kB 256 kB/512 kB 1998 Pentium II 7000 0.25 400-450 3 32-bit 2 X 16 kB 512 kB/1 MB 1999 Pentium II Xeon 7000 0.25 450 3 32-bit 2 X 16 kB 512 kB/2 MB 1999 Pentium III 8200 0.25 450-1000 3 32-bit 2 X 16 kB 512 kB 1999 Pentium III Xeon 8200 0.25 500-1000 3 32-bit 2 x 16 kB 512 kB NetBurst 2000 Pentium 4 42000 0.18 1500 3 32-bit 8kB / 12kµOps 256 kB

including L2 cache 300 Pentium Block Diagram

Memory Data Bus

(Microcprocessor Report 10/28/92) 301 Pentium Overview

● Architecturally Pentium is vastly different from 486.

● Pentium is essentially one full 486 execution unit (EU) (called U pipe): – Plus a second stripped down unit called (V pipe)

● The two pipes are capable of executing instructions simultaneously: – Separate write buffers and even simultaneous access to data cache. – This is how pentium is superscalar of degree two. 302

Pentium Overview cont… ● How can Pentium supply data and instructions at a much faster rate? – At least twice as fast as 486?

● 486 has a single 8K L1 data/instruction cache: – Pentium has two separate 8K L1 caches, one for code and the other for data.

● Also pentium expands 486’s 32 byte prefetch queue to 128 bytes.

303 486 vs. 586 Pipeline

(a) The five-stage i486 scalar pipeline (b) The five-stage Pentium Parallel Pipeline of width=2

304 Pentium Pipeline

Fetch & Align Instruction

Decode Instr. Generate Control Word

Decode Control Word Decode Control Word Generate Memory Address Generate Memory Address

Access data cache or Access data cache or calculate ALU result calculate ALU result

Write register result Write register result

U-Pipe V-Pipe

305 Superscalar Execution

● Can Execute Instructions I1 & I2 in Parallel if: – I1 or I2 is not a jump – Destination of I1 not source of I2 – Destination of I1 not destination of I2

● If Conditions Don’t Hold – Issue I1 to U Pipe – I2 issued on next cycle

● Possibly paired with following instruction

306 Intel P6 Family

● Pentium Pro (1995)

● Pentium II – MMX (multimedia) instruction set

● Pentium III – SIMD (streaming extensions) instructions

● Pentium 4 and Xeon – Intel NetBurst micro-architecture, tuned for multimedia.

307 The P6

● Forms the basis of Pentium Pro, Pentium II and Pentium III: – Besides some specialized instruction set extensions (MMX and SSE), these processors differ in clock rate and cache architecture.

● Dynamically scheduled processor: – Translates each IA-32 instruction to a series of micro-operations (uops). – uops are similar to typical RISC instructions.

● Hardwired control unit

308 Superscalar Processing in P6

● Intel P6: – Five functional units: 2 IUs, separate load and store units, FPU. – 14-stage pipeline, – Since the P6 must execute the CISC-like 80X86 instruction set, instructions are decoded into simpler RISC-like micro-ops. – Out-of-order execution.

● HP Alpha 21164: – 7-stage pipeline, 2 IUs, and 2 FPUs: one for add/subtract, and one for multiply/divide, branch prediction. 309 Superscalar Processing in P6 Microarchitecture

● Up to 3 IA-32 instructions are fetched, decoded, and translated to uops every clock cycle.

● uops are executed by an out-of-order speculative pipeline: – Using register renaming and ROB.

● Processors in the P6 family may be thought of as three independent engines coupled with a single instruction pool.

310 P6 Microarchitecture

6 uops Instruction Instruction Fetch Decode 16 bytes/cycle 3 instrs/cycle 16 bytes

Execution Renaming Units & Issue (5 total) 3 uops/cycle Reservation Stations (20)

Graduation Unit 3 uops/cycle Reorder Buffer (40 entries) 311 Fetch/Decode Unit of Intel P6 Pipeline

Uops employ load/store model Decoder 0 – generalized decoder Decode needs multiple stages -> Concept of predecoding 312 Modern Computer Architectures

Lecture 19: A Survey of Some Commercial Processors (cont…)

313 Streaming SIMD Extensions 2 (SSE2) Technology

● SSE2 Extends MMX and SSE technology with the addition of 144 new instructions: – 128-bit SIMD integer arithmetic operations. – 128-bit SIMD double precision floating point operations. – Cache and memory management operations.

● Enhances encryption, video, speech, image and photo processing.

314 PentiumPro (1995)

● Supports predicated instructions.

● Instructions decoded into micro- operations (mops): –mops are register-renamed, –Placed into an out-of-order speculative pool of pending operations. –Executed in dataflow order (when operands are ready).

315 Pentium II/III

● The Pentium II/III processors use P6 microarchitecture: –Three-way superscalar, –Pipelined micro-architecture features a 12-stage superpipeline. –Trades less work per pipe stage for more stages –-- achieving higher clock rate.

316 Pentium® and Pentium II/III Microarchitecture

317 External Bus L2 Cache

Memory Reorder Buffer Bus Interface Unit -cache II/III Unit Memory Instruction Fetch Unit Interface (with I-cache) Unit

Branch Target Functional Buffer Units

Instruction Microcode UnitStationReservation Decode Instruction Unit Sequencer Reorder Buffer & Register Retirement Alias Register 318 Table File Pentium II/III: The In-Order Section

● Branch prediction: – Two-level scheme. – BTB contains 512 entries:

● Maintains branch history information and the predicted branch target address. – Branch misprediction penalty:

● At least 11 cycles, on an average 15 cycles.

319 Pentium II/III: The In-Order Section Cont…

● A decoder breaks the IA-32 instruction down to mops: –Each comprised of an opcode, two source and one destination operand. –mops are of fixed length.

320 Pentium II/III: The In-Order Section Cont…

● Most instructions are decoded into one-to-four mops.

● More complex instructions are handled as a sequence of mops.

321 Pentium II/III: The In-Order Section Cont…

● Register renaming: – Logical IA-32 based register references are converted into references to physical registers.

● Reservation station unit (RSU, 20 entries).

● Reorder buffer(40 entries ROB)

322 Out-of-Order Execution

● RSU forms a central with 20 reservation stations (RS): –Each capable of hosting one mop.

● mops are issued to the FUs according to dataflow constraints and resource availability: –Without regard to the original ordering of the program. 323 The Out-of-Order Execute Section

● Execution is out of order.

● After completion the result goes to two different places, – RSU and ROB.

324 MMX Functional Unit Floating-point Functional Unit Issue/Execute Integer Port 0 Unit Functional Unit MMX

Functional Unit Jump Functional Unit to/from Reorder Integer Buffer Port 1 Functional Unit

Reservation Station Unit Load Port 2 Functional Unit

Store Port 3 Functional Unit

Store Port 4 Functional Unit 325 The In-Order Retire Section.

● A mop can be retired – if its execution is completed, – if it is its turn in program order, – and if no , trap, or misprediction occurred.

● Retirement means taking data that was speculatively created and writing it into register file.

● Three mops per clock cycle can be

retired. 326 Pentium III

327 NetBurst MicroArchitecture

● Some times referred to as: – P7, Intel 80786, i786.

● Microarchitecture of Pentium 4: – Pentium 3 is based on P6 microarchitecture.

● Both P6 and Netburst fetch upto 3 IA-32 instructions per cycles: – Send to an out-of-order execution engine that can graduate upto 3 uops per cycle.

328 NetBurst Differences over P6

● Uses a deeper pipeline: – 20 stages compared 10 of P6.

● 7 integer execution units: – Compared to 5 of P6.

● Branch target buffer is 8 times larger: – Also, uses improved prediction algorithm.

● Execution . 329 NetBurst MicroArchitecture

● Execution trace cache: – Incorporated in the L2 cache. – Stores decoded micro-operations. – When executing a new instruction, micro- operations can be fetched directly: ● Instead of fetching and decoding the instruction again. ● With Netburst architecture: – Intel was expecting to touch speed of 10GHz. – Faced increasing problems in keeping power dissipation within limits. – Could achieve 3.8 GHz initially.

330 NetBurst-Based Chips

● Celeron D

● Pentium 4 and Pentium 4 Extreme Edition

● Pentium D

● Intel has since replaced NetBurst: – With micro-architecture.

331 NetBurst Micro-Architecture

332 Pentium 4

● Was announced in mid-2000

● Native IA-32 instructions

● NetBurst micro-architecture.

● 20 pipeline stages (integer pipeline).

● Original clock at 1.5 GHz.

● 42 million transistors.

333 Advanced Dynamic Execution

● Very deep, out-of-order, speculative execution engine –Up to 126 instructions in flight (3 times larger than the Pentium III processor). –Up to 48 loads and 24 stores in pipeline (2 times larger than the Pentium III processor).

334 Branch Prediction

● 4K entry branch target array: – 8 times larger than the Pentium III processor.

● New prediction algorithm (not specified): – Reduces mispredictions compared to P6 by about one third.

335 Second Level Cache

● Included on the die

● size: 256 kB

● Unified 8-way associative

● 256-bit data bus to the level 2 cache

● Delivers ~45 GB/s data throughput (at 1.4 GHz processor frequency) – Bandwidth and performance increases with processor frequency

336 Pentium 4

337 DEC Alpha 21264

● Superscalar of degree 4.

● Out of order execution with renaming.

● Up to 80 instructions in process simultaneously.

338 21264 Block Diagram

● 4 Integer ALUs

– Each can perform simple instructions – 2 handle address calculations

Microprocessor Report 10/28/96 339 21264 Pipeline

● Very Deep Pipeline

– Can’t do much in 2ns clock cycle! – 7 cycles for even simple instructions. – 9 cycles for load or store. – 7 cycle penalty for mispredicted branch.

Microprocessor Report 10/28/96

340 21264 Branch Prediction Logic

– Purpose: Predict whether or not branch taken – 35Kb of prediction information – 2% of total die size – Claim 0.7--1.0% misprediction 341 EPIC, IA-64, and

● EPIC: Explicit Parallel Instruction Computing, an architecture framework proposed by HP.

● IA-64: An architecture that HP and Intel developed under the EPIC framework.

● Itanium: The first commercial processor that implements IA-64 architecture; – Now Itanium 2.

342 EPIC Main Ideas

● Compiler does the scheduling.

● Hardware supports speculation: – Addressing the branch hazard problem: Branch prediction. – Addressing the memory problem: Prefetching.

343 IA-64 Micro- Architecture ● 128 64-bit integer registers + 128 82-bit floating point registers.

● Hardware checks dependencies .

● Nearly all instructions can be predicated.

● 128 bit Bundle: 5-bit template + 3 41-bit instructions.

344 IA-64 Instructions

● The 5-bit template field specifies: – The types of execution units each instruction in the bundle requires.

● Nearly every instruction of IA-64 can be predicated: – The lower order 6 bits of every instruction specifies the predicate register that guards the instruction.

345 Itanium: IA-64 Implementation by Intel

● Itanium™ (2001): –First implementation of IA-64 Architecture. –6-issues per clock, 10-stage pipeline at 800Mhz on 0.18 µ process. - Two bundles can be issued together in Itanium.

346 Itanium Functional Units

● 9 functional units: – 2 Integer, 2 Memory, 3 Branch, and 2 floating point,

● All functional units are pipelined.

● 10 stage pipeline:

● Divided into 4 main parts.

347 Itanium Pipeline

● Front-end: – Prefetches upto 32 bytes per clock. – Can handle upto 8 bundles, 24 instructions

● Instruction Delivery: – Distributes upto 6 instructions to 9 functional units. – Implements register renaming.

348 Itanium Pipeline

● Operand Delivery: – Accesses register file, updates scoreboard. – The scoreboard is used to detect when individual instructions can proceed. – This is the way to avoid lock step operations of the instructions in a bundle.

● Execution

349 Itanium™ Processor Silicon (Copyright: Intel)

IA-32 FPU Control

IA-64 Control TLB

Integer Units Cache Instr. Fetch & Decode Cache Bus

Core Processor Die 4 x 1MB L3 cache 25M xtors 4x75M xtors 350 Circuit View

351 Comments on Itanium

● Performance: 800MHz Itanium, 1GHz 21264, 2GHz P4: – SPEC Int: 85% 21264, 60% P4 – SPEC FP: 108% P4, 120% 21264 – Power consumption: 178% of P4 (watt per FP op)

● Surprising that an approach whose goal is to rely on compiler technology and simpler HW: – At least as complex as dynamically

scheduled processors! 352 A Commercial Superscalar Processor

● PowerPC: – Eleven pipelined functional units:

● 4 IUs, an FPU with a separate floating point register file,

● It is capable of executing sixteen instructions simultaneously.

353 AMD Athlon

● Advanced Micro Devices have carved themselves a niche in the Intel Architecture market: – with their line of instruction set compatible processors.

● The latest AMD offering is the Athlon family of processors

354 Athlon

● The micro-architecture of the Athlon family is of considerable interest, – In many respects more powerful than Pentium core.

● Athlon uses three integer units, three floating point units and three address calculation units: – For a total of nine execution units.

355 Athlon

● Ability to issue 9 operations concurrently: – Three integer, three address, and three floating point.

● A 10 stage integer and 15 stage floating pipeline are used.

● The floating point execution units can perform Intel SIMD MMX instructions as well as AMD 3DNow! instructions.

356 Athlon

● Like Intel's Pentium family, the AMD Athlons use a "RISC-like" core, – Intel CISC instructions are decoded by a three way Instruction Decoder into fixed length "MacroOPs", – Fed into the Instruction Control Unit, which has a 72 entry Reorder Buffer.

● Branch prediction is performed using a two- way 2048 entry branch prediction table: – A branch target address table and return address stack.

357 ARM

● ARM Ltd. (Formerly Advanced RISC Machines)

● Licenses its design to vendors:

– IBM, Intel, Philips, Samsung, TI, etc.

● 32-bit processor architecture:

– Widely used in embedded systems:

● Mobile phones, PDAs, Calculators, Routers, media players, etc. – Low power consumption is one of the critical goals.

358 ARM Architecture

● Load/Store

● 16 32-bit registers

● Predicated execution of most of the instructions.

359 Summary

● Pipeline Hazards – Data – Control – Structural

● Resolution of structural hazards – Stalling – Provide additional resources

360 Summary Cont…

● Resolution of Data hazards: – Stalling – Forwarding – Dynamic scheduling

● Resolution of Control hazards: – Stalling – Prediction schemes

● Dynamic scheduling can boost performance in presence of hazards.

361 Summary Cont…

● The key idea of Tomasulo’s scheme is the use of reservation stations

– implicit register renaming to larger set of registers + buffering source operands

– Avoids WAR, WAW hazards of Scoreboard

– Allows loop unrolling in HW

– Instructions not limited to basic blocks

362 Summary Cont…

● In addition to hardware instruction scheduling, software approaches can help

– Static loop unrolling

– Basic block transformations

– Software pipelining

● Modern processors are superscalar and dynamically scheduled.

363 References

J.L. Hennessy & D.A. Patterson, “Computer Architecture: A Quantitative Approach”.. Morgan Kaufmann Publishers, 3rd Edition, 2003. S. Muchnik, “Optimizing Compilers for Modern Architectures”, Morgan Kaufmann Publishers, 2nd Edition.

364 Modern Computer Architectures

Lecture 20: Memory System Basics

Prof.Sandeep panda Koustuv Group of Institutions

1 Modern Computer Architectures

Module-3: Memory Hierarchy Design and Optimizations

Prof.Sandeep panda Koustuv Group of Institutions

2 Introduction

● Even a sophisticated processor may perform well below an ordinary processor:

– Unless supported by matching performance by the memory system.

– Unfortunately the gap is widening.

● The focus of this module:

– Study how memory system performance has been enhanced through various

innovations and optimizations. 3 Widening Performance Gap Processor-DRAM Performance Gap grows 50% / year Processor 1000 60%/year “Moore’s Law” (2X/1.5 years)

100 CPU

10 DRAM 9%/year Performance Performance DRAM (2X/10 years) 1

1980 1981 1982 1984 1985 1987 1988 1983 1986 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Time 4 An Unbalanced System

Source: Bob Colwell keynote ISCA’29 2002 5 Levels of the Memory

Capacity Upper Level Access Time Hierarchy Staging Cost Transfer Unit faster CPU Registers Registers 100s Bytes <10 ns Compiler Instr. Operands 1-8 bytes Cache K Bytes 10-100 ns Cache 1-0.1 cents/bit Cache controller Cache Lines 8-128 bytesThis Lecture Main Memory M Bytes 200ns- 500ns Memory $.0001-.00001 cents /bit Operating system 512-4K bytes Disk Pages G Bytes, 10 ms (10,000,000 ns) Disk

-5 -6 10 - 10 cents/bit User Files Mbytes Tape Larger infinite sec-min Tape Lower Level 10 -8 6 Model of Memory Hierarchy

SRAM DRAM DISK Reg L1 L2 Main File Data cache Cache Memory L1 Inst cache

7 What is the Role of a Cache?

● A small, fast storage used to improve average access time to a slow memory.

● Improves memory system performance: – Exploits spatial and temporal locality.

8 Processor-Memory Performance Gap Processor % Area %Transistors (~cost) (~power)

● Alpha 21164 37% 77%

● StrongArm SA110 61% 94%

● Pentium Pro 64% 88%

– 2 dies per package: Proc/I Cache/D Cache + L2 Cache

● Caches have no “inherent value”, only try to close the performance gap. 9 Case Study:Intel Core2 Duo

L1 32 KB, 8-Way, 64 Byte/Line, LRU, WB Core0 Core1 3 Cycle Latency

L2 4.0 MB, 16-Way, 64 Byte/Line, LRU, WB 14 Cycle Latency

L2 Cache

Source: http://www.sandpile.org 10 Case study: Intel Itanium 2

3MB 6MB Version Version 180nm 130nm 2 421 mm 374 mm2

11 Memory Issues

● Latency – Time to move through the longest circuit path (from the start of request to the response)

● Bandwidth – Number of bits transported at one time

● Capacity – Size of memory

● Energy

– Cost of accessing memory (to read and write)12 Four Basic Questions

● Q1: Where can a block be placed in the cache? (Block placement) – Fully Associative, Set Associative, Direct Mapped

● Q2: How is a block found if it is in the cache? (Block identification) – Tag/Block

● Q3: Which block should be replaced on a miss? (Block replacement) – Random, LRU

● Q4: What happens on a write? (Write strategy) 13 – Write Back or Write Through (with Write Buffer) Block Placement

● If a block has only one possible place in the cache: direct mapped

● If a block can be placed anywhere: fully associative

● If a block can be placed in a restricted subset of the possible places: set associative – If there are n blocks in each subset: n- way set associative

● Note that direct-mapped = 1-way set

associative 14 Trade-offs ● n-way set associative becomes increasingly difficult (costly) to implement for large n – Most caches today are either 1-way (direct mapped), 2-way, or 4-way set associative

● The larger n the lower the likelihood of thrashing – e.g., two blocks competing for the same block frame and being accessed in sequence over and over

15 Types of Caches Type of Mapping of data from Complexity of cache memory to cache searching the cache Direct A memory value can be Fast indexing mapped •(DM)DM and placed FA can at be a thoughtsingle mechanism as specialcorresponding cases of SA location in •DMthe  1-waycache SA •FA  All-way SA Set- A memory value can be Slightly more involved associative placed in any of a set of search mechanism (SA) locations in the cache Fully- A memory value can be Extensive hardware associative placed in any location in resources required to (FA) the cache search (CAM)

16 Direct Mapping

Tag Index Data

00000 0 0x55 0

00000 1 0x0F 1 00001 0

Direct mapping: A memory value can only be placed at a single corresponding location 11111 0 0xAA in the cache 11111 1 0xF0

17 Set Associative Mapping (2-Way)

Tag Index Data Way 0 Way 1

0000 0 0 0x55 0

0000 1 1 0x0F 1 0

Set-associative mapping: A memory value can be placed in any location of a set in the cache 1111 0 0 0xAA 1111 1 1 0xF0

18 Fully Associative Mapping

Tag Data

0000000000 0x55

0000010000 0x0F 000110

Fully-associative mapping: A memory value can be placed 1111101111 0xAA anywhere in the cache 1111111111 0xF0

19 Direct Mapped Cache Address Memory DM Cache 0 1 Cache Index 2 0 3 1 4 2 5 3 6 7 A Cache Line (or Block) 8 9 ● Cache location 0 is occupied by data from: A B – Memory locations 0, 4, 8, and C

C ● Which one should we place in the cache? D ● How can we tell which one is in the cache? E F 20 Block Identification cont…

● Given an address, how do we find where it goes in the cache?

● This is done by first breaking down an address into three parts

Index of the set tag used for offset of the address in identifying a match the cache block

tag set index block offset

Block address 21 Block Identification cont… ● Consider the following system – Addresses are on 64 bits ● Memory is byte-addressable – Block frame size is 26 = 64 bytes – Cache is 64 MByte (226 bytes) 20 ● Consists of 2 block frames – Direct-mapped ● For each cache block brought in from memory, there is a single possible frame among the 220 available ● A 64-bit address can be decomposed as follows: 58-bit block address 38-bit tag 20-bit set index 6-bit block offset

22 Block Identification cont… set index tag cache block 20 bits 38 bits 26 bits 0...00 0...01 0...10 All addresses with 0...11 similar 20 set index bits “compete” for a single . . . block frame block frames

20 1...01 2 1...10 1...11

23 Block Identification

cont…

Address from CPU @ set index tag cache block 20 bits 58 bits 26 bits 0...00 ? 0...01 Find the set: 0...10 0...11 . . . block frames

20 1...01 2 1...10 1...11

24 Block Identification

Address from CPU @ set index tag cache block 20 bits 58 bits 26 bits 0...00 0...01 Find the set: 0...10 ? 0...11 Compare the tag: . . . block frames

20 1...01 2 If no match: miss 1...10 1...11 If match: hit and access the byte at the desired offset

25 Set Associative Cache (2- way)

● Cache index selects a “set” from the cache

● The•Additional two tagscircuitry in as the compared set are to DM compared caches in parallel •Makes SA caches slower to access than DM of ● Datacomparable is selected size based on the tag result Cache Index Valid Cache Tag Cache Data Cache Data Cache Tag Valid Cache Line 0 Cache Line 0 : : : : : :

Adr Tag Compare Sel1 1 Mux 0 Sel0 Compare

OR Cache Line 26 Hit Set-Associative Cache (2- way) ● 32 bit address Tag Index offset

● lw from 0x77FF1C78

Tag array0 Data aray0 Data array1 Tag array1

27 Fully Associative Cache

tag offset

Tag Data = = =

Associative Search

=

Multiplexor

Rotate and Mask

28 Cache Write Policies

● Write-through: Information is written to both the block in the cache and that in memory

● Write-back: Information is written back to memory only when a block frame is replaced: – Uses a “dirty” bit to indicate whether a block was actually written to, – Saves unnecessary writes to memory when a block is “clean”

29 Trade-offs

● Write back – Faster because writes occur at the speed of the cache, not the memory. – Faster because multiple writes to the same block is written back to memory only once, uses less memory bandwidth.

● Write through – Easier to implement 30 Write Allocate, No-write Allocate

● What happens on a write miss? – On a read miss, a block has to be brought in from a lower level memory

● Two options: – Write allocate: A block allocated in cache. – No-write allocate:no block allocation, but just written to in main memory.

31 Write Allocate, No-write Allocate cont…

● In no-write allocate, – Only blocks that are read from can be in cache. – Write-only blocks are never in cache.

● But typically: – write-allocate used with write-back – no-write allocate used with write-through

● Why does this make sense?

32 Write-Through Policy

0x5678 0x12340x1234 0x1234

Processor Cache

Memory

33 Write Buffer

Cache Processor DRAM

Write Buffer

– Processor: writes data into the cache and the write buffer – : writes contents of the buffer to memory

● Write buffer is a FIFO structure:

– Typically 4 to 8 entries – Desirable: Occurrence of Writes << DRAM write cycles

● Memory system designer’s nightmare:

– Write buffer saturation (i.e., Writes  DRAM write cycles)

34 Writeback Policy

0x12340x1234 0x9ABC0x5678 0x56780x1234

Processor Cache

Memory

35 Memory System Performance ● Memory system performance is largely captured by three parameters,

– Latency, Bandwidth, Average memory access time (AMAT).

● Latency:

– The time it takes from the issue of a memory request to the time the data is available at the processor.

● Bandwidth:

– The rate at which data can be pumped to the processor by the memory system. 36 Average Memory Access Time (AMAT) • AMAT: The average time it takes for the processor to get a data item it requests. • The time it takes to get requested data to the processor can vary: – due to the memory hierarchy. • AMAT can be expressed as: AMAT  Cache hit time  Miss rate  Miss penalty

37 Modern Computer Architectures

Lecture 21: Memory System Basics

Prof.Sandeep Panda Koustuv Group of Institutions

38 Cache Performance Parameters

● Performance of a cache is largely determined by: – Cache miss rate: number of cache misses divided by number of accesses. – Cache hit time: the time between sending address and data returning from cache. – Cache miss penalty: the extra processor stall cycles caused by access to the next-level cache.

39 Impact of Memory System on Processor Performance

CPU Performance with Memory Stall= CPI without stall + Memory Stall CPI Memory Stall CPI = Miss per inst × miss penalty = % Memory Access/Instr × Miss rate × Miss Penalty Example: Assume 20% memory acc/instruction, 2% miss rate, 400-cycle miss penalty. How much is memory stall CPI? Memory Stall CPI= 0.2*0.02*400=1.6 cycles

40 CPU Performance with Memory Stall

CPU Performance with Memory Stall= CPI without stall + Memory Stall CPI

CPU time IC  CPIexecution  CPImem_stall   Cycle Time

CPImem_stall = Miss per inst × miss penalty CPImem_stall  Memory Inst Frequency  Miss Rate  Miss Penalty

41 Performance Example 1

● Suppose:

– Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1

– 50% arith/logic, 30% load/store, 20% control

– 10% of data memory operations get 50 cycles miss penalty

– 1% of instruction memory operations also get 50 cycles miss penalty

● Compute AMAT.

42 Performance Example 1 cont… ● CPI = ideal CPI + average stalls per instruction = 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] +[ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1

● AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x5 0]=2.54

43 Example 2

● Assume 20% Load/Store instructions

● Assume CPI without memory stalls is 1

● Cache hit time = 1 cycle

● Cache miss penalty = 100 cycles

● Miss rate = 1%

● What is: – stall cycles per instruction? – average memory access time? – CPI with and without cache

44 Example 2: Answer

● Average memory accesses per instruction = 1.2

● AMAT = 1 + 1.2*0.01*100= 2.2 cycles

● Stall cycles = 1.2 cycles

● CPI with cache = 1+1.2=2.2

● CPI without cache=1+1.2*100=121

45 Example 3

● Which has a lower miss rate? – A split cache (16KB instruction cache +16KB Data cache) or a 32 KB unified cache?

● Compute the respective AMAT also.

● 40% Load/Store instructions

● Hit time = 1 cycle

● Miss penalty = 100 cycles

● Simulator showed: – 40 misses per thousand instructions for data cache – 4 misses per thousand instr for instruction cache – 44 misses per thousand instr for unified cache 46

Example 3: Answer

● Miss rate = (misses/instructions)/(mem accesses/instruction)

● Instruction cache miss rate= 4/1000=0.004

● Data cache miss rate = (40/1000)/0.4 =0.1

● Unified cache miss rate = (44/1000)/1.4 =0.04

● Overall miss rate for split cache = 0.3*0.1+0.7*0.004=0.0303

47 Example 3: Answer cont…

● AMAT (split cache)= 0.7*(1+0.004*100)+0.3(1+0.1*100)=4.3

● AMAT (Unified)= 0.7(1+0.04*100)+0.3(1+1+0.04*100)=4.5

48 Cache Performance for Out of Order Processors

● Very difficult to define miss penalty to fit in out of order processing model: – Processor sees much reduced AMAT – Overlapping between computation and memory accesses

● We may assume a certain percentage of overlapping – In practice, the degree of overlapping varies significantly.

49 Unified vs Split Caches

● A Load or Store instruction requires two memory accesses: – One for the instruction itself – One for the data

● Therefore, unified cache causes a structural hazard!

● Modern processors use separate data and instruction L1 caches: – As opposed to “unified” or “mixed” caches ● The CPU sends simultaneously: – Instruction and data address to the two ports .

● Both caches can be configured differently – Size, associativity, etc. 50 Modern Computer Architectures

Lecture 22: Memory Hierarchy Optimizations

Prof.Sandeep panda Koustuv Group of Institutions 51

Unified vs Split Caches

● Separate Instruction and Data caches: – Avoids structural hazard – Also each cache can be tailored specific to need.

Processor Processor

I-Cache-1 D-Cache-1 Unified Cache-1 Unified Cache-2 Unified Cache-2

Unified Cache Split Cache

52 Example 4 – Assume 16KB Instruction and Data Cache: – Inst miss rate=0.64%, Data miss rate=6.47% – 32KB unified: Aggregate miss rate=1.99% – Assume 33% data ops  75% accesses from instructions (1.0/1.33) – hit time=1, miss penalty=50 – Data hit has 1 additional stall for unified cache (why?) – Which is better (ignore L2 cache)?

AMATSplit =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05

AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

53 Alpha 21264 Data Cache

● Let us understand the operation of the Alpha 21264 data cache.

● Alpha processor presents a 48-bit virtual address to the cache: – 29 tag bits – 9 index bits – 6 offset bits

● Cache is 2:1 set associative

54 Case Study: The Alpha 21264 Cache

What happens on a read?

55 The Alpha 21264 Cache

Step 1: CPU generates 44-bit address The address is split into 29-bit tag 9-bit set index (29 = 512 sets) 6-bit block offset (26 = 64 bytes blocks)

56 The Alpha 21264 Cache

Step 2: The “right” set is selected using the index bits.

57 The Alpha 21264 Cache

Step 3: The tag is compared to both tags in the set; If a match AND valid=1: then a hit (If not: then a miss)

58 The Alpha 21264 Cache

Step 4: If a match, select the matching block and return the byte at the right offset The election of the matching block is done via a 2:1

59 Example 5

● What is the impact of 2 different cache organizations on the performance of CPU?

● Clock cycle time 1nsec

● 50% load/store instructions

● Size of both caches 64KB: – Both caches have block size 64KB – one is direct mapped the other is 2-way sa.

● Cache miss penalty=75 ns for both caches

● Miss rate DM= 1.4% Miss rate SA=1%

● CPU cycle time must be stretched 25% to accommodate the multiplexor for the SA 60 Example 5: Solution

● AMAT DM= 1+(0.014*75)=2.05nsec

● AMAT SA=1*1.25+(0.01*75)=2ns

● CPU Time= IC*(CPI+(Misses/Instr)*Miss Penalty)* Clock cycle time

● CPU Time DM= IC*(2*1.0+(1.5*0.014*75)=3.58*IC

● CPU Time SA= IC*(2*1.25+(1.5*0.01*75)=3.63*IC

61 How to Improve Cache Performance? AMAT  HitTime  MissRate  MissPenalty 1. Reduce miss rate. 2. Reduce miss penalty.

3. Reduce miss penalty or miss rates via parallelism.

4. Reduce hit time.

62 Reducing Miss Rates

● Techniques: – Larger block size – Larger cache size – Higher associativity – Way prediction – Pseudo-associativity – Compiler optimization

63 Reducing Miss Penalty

● Techniques: – Multilevel caches – Victim caches – Read miss first – Critical word first

64 Reducing Miss Penalty or Miss Rates via Parallelism

● Techniques: –Non-blocking caches –Hardware prefetching –Compiler prefetching

65 Reducing Cache Hit Time ● Techniques: –Small and simple caches –Avoiding address translation –Pipelined cache access –Trace caches

66 Modern Computer Architectures Lecture 23: Cache Optimizations

Prof.Sandeep panda Koustuv Group of Institutions

67 Reducing Miss Penalty

● Techniques: – Multilevel caches – Victim caches – Read miss first – Critical word first – Non-blocking caches

68 Reducing Miss Penalty(1): Multi-Level Cache

● Add a second-level cache.

● L2 Equations:

AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 × Miss PenaltyL2)

69 Multi-Level Cache: Some Definitions

● Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2)

● Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU

– Local Miss RateL1 x Local Miss RateL2

● L1 Global miss rate = L1 Local miss rate

70 Global vs. Local Miss Rates ● At lower level caches (L2 or L3), global miss rates provide more useful information: – Indicate how effective is cache in reducing AMAT. – Who cares if the miss rate of L3 is 50% as long as only 1% of processor memory accesses ever benefit from it?

71 Performance Improvement Due to L2 Cache: Example 6 Assume: • For 1000 instructions: – 40 misses in L1, – 20 misses in L2 • L1 hit time: 1 cycle, • L2 hit time: 10 cycles, • L2 miss penalty=100 • 1.5 memory references per instruction • Assume ideal CPI=1.0 Find: Local miss rate, AMAT, stall cycles per instruction, and those without L2 cache. 72 Example 6: Solution

● With L2 cache: – Local miss rate = 50% – AMAT=1+4%X(10+50%X100)=3.4 – Average Memory Stalls per Instruction = (3.4-1.0)x1.5=3.6

● Without L2 cache: – AMAT=1+4%X100=5 – Average Memory Stalls per Inst=(5-1.0)x1.5=6

● Perf. Improv. with L2 =(6+1)/(3.6+1)=52%

73 Multilevel Cache

● The speed (hit time) of L1 cache affects the clock rate of CPU: – Speed of L2 cache only affects miss penalty of L1.

● Inclusion Policy: – Many designers keep L1 and L2 block sizes the same. – Otherwise on a L2 miss, several L1 blocks may have to be invalidated.

● Multilevel Exclusion: – L1 data never found in L2. – AMD Athlon follows exclusion policy . 74 Reducing Miss Penalty (2): Victim Cache

● How to combine fast hit time of direct mapped cache: – yet still avoid conflict misses?

● Add a fully associative buffer (victim cache) to keep data discarded from cache.

● Jouppi [1990]: – A 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache.

● Used in Alpha, HP machines.

● AMD uses 8-entry victim buffer. 75 Reducing Miss Penalty (3): Read Priority over Write on Miss ● In a write-back scheme: – Normally a dirty block is stored in a write buffer temporarily. – Usual:

● Write all blocks from the write buffer to memory, and then do the read. – Instead:

● Check write buffer first, if not found, then initiate read.

● CPU stall cycles would be less. 76 Reducing Miss Penalty (3): Read Priority over Write on Miss

● A write buffer with a write through: – Allows cache writes to occur at the speed of the cache.

● Write buffer however complicates memory access: – They may hold the updated value of a location needed on a read miss.

77 Reducing Miss Penalty (3): Read Priority over Write on Miss

● Write-through with write buffers: –Read priority over write: Check write buffer contents before read; if no conflicts, let the memory access continue. –Write priority over read: Waiting for write buffer to first empty, can increase read miss penalty. 78 Reducing Miss Penalty (4): Early Restart and Critical Word First

● Simple idea: Don’t wait for full block to be loaded before restarting CPU --- CPU needs only 1 word: – Early restart —As soon as the requested word of the block arrives, send it to the CPU. – Critical Word First —Request the missed word first from memory and send it to the CPU as soon as it arrives; Requested word ● Generally useful for large blocks.

block 79 Example 7

● AMD Athlon has 64-byte cache blocks.

● L2 cache takes 11 cycles to get the critical 8 bytes.

● To fetch the rest of the block: – 2 clock cycles per 8 bytes.

● AMD Athlon can issue 2 loads per cycle.

● Compute access time for 8 successive data accesses.

80 Solution

● 11+(8-1)*2=25 clock cycles for the CPU to read a full cache block.

● Without critical word first it would take 25 cycles to get the full block.

● After the first 8 bytes are delivered, it would take 7/2=4 clock cycles. – Total = 25+4 cycles=29 cycles.

81 Modern Computer Architectures Lecture 24: More Cache Optimizations

Prof.Sandeep panda Koustuv Group of Institutions

82 Reduce Miss Penalty (5): Non- blocking Caches

● Non-blocking cache: –Allow data cache to continue to serve other requests during a miss. –Meaningful only with out-of-order execution processor. –Requires multi-bank memories. –Pentium Pro allows 4 outstanding memory misses. 83 Non-blocking Caches to Reduce Stalls on Misses

● Hit under miss reduces the effective miss penalty by working during miss.

● “Hit under multiple miss” or “miss under miss” may further lower the effective miss penalty: – By overlapping multiple misses. – Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses. – Requires multiple memory banks. 84 Non-Blocking Caches

● Multiple memory controllers: – Allow memory banks to operate almost independently. – Each bank needs separate address lines and data lines.

● Input device may be served by one controller: – At the same time, other controllers may serve cache read/write requests.

85 Fast Hit Time(1): Simple L1 Cache

● Small and simple (direct mapped) caches have lower hit time (why?). ● This is possibly the reason why all modern processors have direct mapped and small L1 cache: – In Pentium 4 L1 cache size has reduced from 16KB to 8KB for later versions. ● L2 cache can be larger and set- associative. 86 Hit Time Reduction (2): Simultaneous Tag Comparison and and Data Reading • After indexing: – Tag can be TAGS DATA compared and at the same time block can be Tag and Comparator One Cache line of Data fetched. Tag and Comparator One Cache line of Data

– If it’s a miss --- Tag and Comparator One Cache line of Data

then no harm Tag and Comparator One Cache line of Data done, miss must To Next Lower Level In be dealt with. Hierarchy

87 Example: 1KB DM Cache, 32-byte Lines M ● The lowest M bits are the Offset (Line Size = 2 )

● Index = log2 (# of sets) Address 31 9 4 0 Tag Index Offset Ex: 0x01 Ex: 0x00

Valid Bit Cache Tag Cache Data

Byte 31 : Byte 1 Byte 0 0

Byte 63 : Byte 33 Byte 32 1 2 3

# of set set # of : : :

Byte 1023 : Byte 992 31 88 Example: 1KB DM Cache, 32-byte Lines ● lw from 0x77FF1C68 Tag Index Offset

77FF1C68 = 0111 0111 1111 1111 0001 1100 0101 1000

Tag array Data array

2 DM Cache

89

27 25 26 24 DM Cache Speed Advantage

● Tag and data access happen in parallel

– Faster cache access! Tag Index Offset Tag array Data array Index Index

90 Reducing Hit Time (3): Write Strategy ● There are many more reads than writes – All instructions must be read! – Consider: 37% load instructions, 10% store instructions – The fraction of memory accesses that are writes is: .10 / (1.0 + .37 + .10) ~ 7% – The fraction of data memory accesses that are writes is: .10 / (.37 + .10) ~ 21%

● Remember the fundamental principle: make the common cases fast. 91 Write Strategy

● Cache designers have spent most of their efforts on making reads fast: – not so much writes.

● But, if writes are extremely slow, then Amdahl’s law tells us that overall performance will be poor. – Writes also need to be made faster.

92 Making Writes Fast

● Several strategies exist for making reads fast: – Simultaneous tag comparison and block reading, etc.

● Unfortunately making writes fast can not be done the same way. – Tag comparison cannot be simultaneous with block writes: – One must be sure one doesn’t overwrite

a block frame that isn’t a hit! 93 Write Policy 1: Write-Through vs Write-Back

● Write-through: all writes update cache and underlying memory/cache.

– Can always discard cached data - most up- to-date data is in memory. – Cache control bit: only a valid bit.

● Write-back: all writes only update cache

– Memory write during block replacement. – Cache control bits: both valid and dirty bits.

94 Write-Through vs Write-Back

● Relative Advantages: – Write-through:

● Memory always has latest data. ● Simpler management of cache. – Write-back:

● Requires much lower bandwidth, since data often overwritten multiple times. ● Better tolerance to long-latency memory. 95 Write Policy 2:Write Allocate vs Non-Allocate

● Write allocate: allocate new block on a miss: –Usually means that you have to do a “read miss” to fill in rest of the cache-line!

● Write non-allocate(or “write-around”): –Simply send write data through to underlying memory/cache - don’t allocate new cache line!

96 Reducing Hit Time (4): Way Prediction and Pseudo-Associative Cache

● Extra bits are associated with each set. – Predicts the next block to be accessed. – Multiplexor could be set early to select the predicted block.

● Alpha 21264 uses way prediction on its 2-way set associative instruction cache. – If prediction is correct, access is 1 cycle. – Experiments with SPEC95 suggests higher that 85% prediction success.

97 Pseudo-Associativity

● How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache?

● Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Hit Time Pseudo Hit Time Miss Penalty

Time

● Drawback: CPU pipeline would have to use slower cycle time if hit takes 1 or 2 cycles.

– Suitable for caches not tied directly to processor (L2) – Used in MIPS R1000 L2 cache, similar in UltraSPARC 98 Reducing Hit Time(5): Virtual Cache Physical Cache – physically indexed and physically tagged cache. Virtual Cache – virtually indexed and virtually tagged cache. – Must flush cache at process , or add PID – Must handle virtual address alias to identical physical address V-addr Virtual V-addr Physical cache TLB cache V-index P-addr

P-tag P-index TLB V-tag data PID P-addr P-tag data To L2 =?

=? 99 Virtually Indexed, Physically Tagged Cache Motivation: • Fast cache hit by V-addr parallel TLB V-index access. TLB P-tag data P-addr Issue: =? • Avoids process Id to be associated with cache entries.

100 Virtual Cache

● Cache both indexed and tag checked using virtual address: – Virtual to Physical translation not necessary for cache hit.

● Issues: – How to get page protection information?

● Page-level protection information is checked during virtual to physical address translation. – How can process context switch? – How can synonyms be handled? 101 Virtually Addressed Cache

● Cache indexed using virtual address: – Tag compared using physical address. – Indexing is carried out while virtual-to physical translation is occurring.

● Issues: – No PID needs to be associated with cache blocks. – No protection information needed for cache blocks. – Synonym handling not a problem.

102 Pipelined Cache Access

● L1 cache access can take multiple cycles: – Giving fast cycle time and slower hits but lower average cache access time.

● Pentium takes 1 cycle to access cache: – Pentium III takes 2 cycles. – Pentium 4 takes 4 cycles.

103 Modern Computer Architectures Lecture 25: Some More Cache Optimizations

Prof.Sandeep panda Koustuv Group of Institutions 104 Reducing Miss Rate: An Anatomy of Cache Misses

● To be able to reduce miss rate, we should be able to classify misses by causes (3Cs): – Compulsory—To bring blocks into cache for the first time.

● Also called cold start misses or first reference misses. Misses in even an Infinite Cache. – Capacity—Cache is not large enough, some blocks are discarded and later retrieved.

● Misses even in Fully Associative cache. – Conflict—Blocks can be discarded and later retrieved if too many blocks map to a set.

● Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) 105 Classifying Cache Misses

● Later we shall discuss a 4th “C”: – Coherence - Misses caused by . To be discussed in multiprocessors part.

106 Three (or Four) Cs (Cache

Miss Terms) 0x1234

● Compulsory Misses:

– cold start misses (Caches do not Processor have valid data at the start of the Cache program) 0x1234 ● Capacity Misses: 0x5678 – Increase cache size 0x91B1

● Conflict Misses: 0x1111

– Increase cache size and/or Processor Cache associativity. 0x1234 – Associative caches reduce conflict misses 0x5678 0x91B1 ● Coherence Misses: 0x1111 – In multiprocessor systems (later Processor lectures…) Cache107 Reducing Miss Rate (1): Hardware Prefetching

● Prefetch both data and instructions: –Instruction prefetching done in almost every processor. –Processors usually fetch two blocks on a miss: requested and the next block.

● Ultra Sparc III computes strides in data access: –Prefetches data based on this. 108 Reducing Miss Rate (1): Software Prefetching

● Prefetch data: –Load data into register (HP PA-RISC loads) –Cache Prefetch: load into a special prefetch cache (MIPS IV, PowerPC, SPARC v. 9) –Special prefetching instructions cannot cause faults; a form of speculative

fetch. 109 Compiler Controlled Prefetching ● Compiler inserts instructions to prefetch data before it is needed.

● Two flavors: – Binding prefetch: Requests to load directly into register.

● Must be correct address and register! – Non-Binding prefetch: Load into cache.

● Can be incorrect. What if Faults occur?

● Prefetch instructions incur instruction execution overhead: – Is cost of prefetch issues less than savings in reduced misses? 110 Example 8

● By how much would AMAT increase when prefetch buffer is removed? – Assume for a 64KB data cache. – 30 data misses per 1000 instructions. – L1 Miss penalty 15 cycles. – 25% are load/store instructions. – Prefetching reduces miss rate by 20%. – 1 extra clock cycle is incurred if miss in cache, but found in prefetch buffer.

111 Solution

● Miss rate (data)=(30/1000)*(100/25) =12/100=0.12

● AMAT (prefetch)= 1+(0.12*20%*1)+(0.12*(1-20%)*15) =1+0.12*0.2+0.12*0.8*15 =1+0.24+1.44= 2.68 cycles

● AMAT (No-prefetch)=1+0.12*15 =1+ 1.8 = 2.8

● 7% increase in AMAT

112 Impact of Cache Parameters ● Parameters: Cache size, block size, and set associativity.

● Other parameters: cache set number, cache blocks per set, and cache block size.

● How do they affect miss rate? – Recall 3Cs: Compulsory, Capacity, Conflict cache misses?

● How about miss penalty?

● How about cache hit time? 113 3Cs Absolute Miss Rate (SPEC92)

0.14 1-way

0.12 2-way Conflict 0.1 4-way 0.08 8-way 0.06 Capacity 0.04

Miss Rate per Type per Rate Miss 0.02 0

1 2 4 8 64 32 16

Compulsory vanishingly Cache Size (KB) 128 small Compulsory 114 Cache Size 0.14 1-way 0.12 2-way 0.1 4-way 0.08 8-way 0.06 Capacity 0.04

0.02 0 1 2 4 8 16 32 64 128 Cache Size (KB) Compulsory

● Rule of thumb: 2x size => 25% cut in miss rate

● Which miss does it reduce? 115 3Cs Relative Miss Rate 100% 1-way 80% 2-way 4-way Conflict 60% 8-way

40% Capacity

20%

0% 1 2 4 8 16 32 64 128 Compulsory Cache Size (KB)

116 Cache Insights

● Assume total cache size not changed:

● What happens if: 1)Change Block Size: 2)Change Associativity:

Which 3Cs are affected?

117 Larger Caches

● Larger caches are obvious ways to reduce capacity misses.

● However: – Larger caches have higher hit times. – Larger caches have higher cost.

● L2 caches have become much larger: – Not true for L1 caches.

118 Higher Associativity

● Higher associativity reduces conflict misses – Higher associativity increases hit time ● How do we decide associativity? ● First observation: associativity higher than 8-way is likely not useful. ● Second observation: “2:1 cache rule of thumb”: – The miss rate of a direct-mapped cache of size N is about the same as a 2-way set associative cache of size N/2. – Seems to hold for caches 128K and under. 119 Larger Block Size?

25%

20% 1K 4K 15% Miss 16K Rate 10% 64K

5% 256K

0% 16 32 64 128 256 Block Size (bytes) 120 Larger Block Sizes

● Larger blocks reduce compulsory misses: – Due to better spatial locality.

● Larger blocks increase miss penalty: – Although “critical word first” can help.

● Larger blocks may increase conflict misses: – Given a cache size, larger blocks mean fewer block frames.

● Therefore, there is a trade-off: – The best block size must be chosen carefully.

121 Compiler Optimization

● Ways in which code can be modified to have fewer misses: – By a compiler – By a user

● McFarling [1989] reported 75% reduction in caches misses on 8KB direct mapped cache: – Reorder instructions so as to reduce conflict misses.

122 Reducing Misses by Compiler Optimizations ● Popular techniques: – Merging Arrays: improve spatial locality by single array of compound elements vs. 2 separate arrays. – Loop Interchange: change nesting of loops to access data. – Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap. – Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. processing whole columns or rows. 123 Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After Merging: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE];

● Reduces potential conflicts between: – val & key; ● Improves spatial locality 124 Loop Interchange: Example 1 2-D Array Initialization: int a[200][200]; int a[200][200]; for (i=0;i<200;i++) { for j=0;j<200;j++){ for (j=0;j<200;j++) { for (i=0;i<200;i++){ a[i][j] = 2; a[i][j] = 2; } } } }

● Which alternative is best? – i,j? – j,i?

● To answer this, one must understand the memory layout of a 2-D array.

125 2-D Arrays in Memory

● The elements of a 2-D array are stored in contiguous memory cells.

● The problem is that array is 2-D – Computer memory is 1-D.

● Therefore, there must be a mapping from 2-D to 1-D: – From a 2-D abstraction to a 1-D implementation.

126 Row-Major, Column- Major ● Row-Major:

– Rows are stored contiguously 1st row 2nd row 3rd row 4th row

● Column-Major: – Columns are stored contiguously 1st col 2nd col 3rd col 4th col

127 Row-Major

● C uses Row-Major

address Rows in memory Memory lines

memory/cache line

● Matrix elements are stored in contiguous memory lines 128 Row-Major Better choice int a[200][200]; for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2;

● Worse int a[200][200]; for (i=0;i<200;i++) for (j=0;j<200;j++) a[j][i]=2; 129

Loop Interchange Example j=0 Row-major ordering i=0 /* Before */ for (j=0; j<100; j++) What is the worst that could happen? for (i=0; i<5000; i++) Hint: DM cache x[i][j] = 2*x[i][j]

/* After */ for (i=0; i<5000; i++) j=0 for (j=0; j<100; j++) i=0 X[i][j] = 2*x[i][j]

Improved cache efficiency

Is this always safe transformation? Does this always lead to higher efficiency?

130 Loop Fusion Example

/* Before */ /* After */ for (i = 0; i < N; i = i+1) for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; {a[i][j] =1/b[i][j] *c[i][j]; for (i = 0; i < N; i = i+1) d[i][j] = a[i][j] + c[i][j];} for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j];

2 misses per access to a & c vs. one miss per access; improve temporal locality

131 Blocking Example: Dense Matrix

/* Before */ Multipication for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; };

● Two Inner Loops:

– Read all NxN elements of z[]

– Read N elements of 1 row of y[] repeatedly

– Write N elements of 1 row of x[]

● Idea: compute on BxB submatrix that fits 132 Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; };

● B called Blocking Factor 3 2 3 2 ● Capacity Misses from 2N + N to N /B+2N

● But may suffer from conflict misses. 133 Cache Optimization Summary Technique MR MP HT Complexity Larger Block Size + – 0 Higher Associativity + – 1 Victim Caches + 2 Pseudo-Associative Caches + 2 HW Prefetching of Instr/Data+ 2 miss rate Compiler Controlled Prefetching+ 3 Compiler Reduce Misses + 0

Priority to Read Misses + 1 Early Restart & Critical Word 1st + 2 Non-Blocking Caches + 3 Second Level Caches + 2 miss penalty

134 Summary of Fast Cache Design- Parameters

135 Higher Bandwidth

● One obvious way to improve main memory performance is to have a higher memory bandwidth: – Can bring in more bytes per time unit from the memory up the hierarchy.

● Three popular techniques: – Wider memory – Interleaved memory – Independent memory banks 136 Wider Memory

● One way in which one can reduce miss penalty is to get data to the CPU faster.

● Achieves by “widening” the memory bus.

● Since the CPU needs one word at a time, there needs to be a multiplexer between the CPU and the cache.

137 Interleaved Memory

● Memory consists of multiple memory chips.

● Therefore, each chip could be made to serve part of a request at any time: – after all, it doesn’t cost us anything (beyond power) to use all the hardware we have.

4-byte word 0 1 2 3

0 1 2 3 138 Independent Memory Banks

● Generalization of the interleaving.

● Multiple memory controllers, multiple banks, multiple buses: – Like having multiple memories systems. – Each such memory can itself be composed of interleaved memory banks. – Each such memory has a distinct use

● e.g. input devices

139 Modern Computer Architectures Lecture 26: Memory Techniques

Prof.Sandeep panda Koustuv Group of Institutions

140 A Typical Main Memory Organization

141 Attributes of memory hierarchy components

Component Technology Bandwidth Latency Cost Per Cost Per Bit ($) Gigabyte($) Disk Drive Magnetic 10+ MB/s 10ms < 1X10E- < 1 9 Main DRAM 2+ GB/s 50+ ns <2X10E-7 < 200 Memory On chip L2 SRAM 10+ GB/s 2+ ns <1X10E-4 < 100K cache On chip L1 SRAM 50+ GB/s 300+ ps >1X10E-4 > 100K cache Register Multiported 200+ 300+ ps >1X10E-2 > 10M (?) file SRAM GB/s (?)

142 Dynamic Random Access Memory (DRAM)

● Main Memory is DRAM-based: – Dynamic since needs to be refreshed periodically (8 ms, 1% time).

● Refreshing causes variability in AMAT. – Addresses divided into 2 halves (Memory as a 2D matrix):

● RAS or Row Access Strobe

● CAS or Column Access Strobe

143 DRAMS

● Access time: – Time between when a read request is made and when desired word arrives.

● Cycle time: – Minimum time between two requests to memory.

● DRAMs require a data to be written back after a read: – Causes access time and cycle time to differ. 144 SRAMs

● Just as virtually all desktops or servers used DRAMs for main memory. – Virtually all processors use SRAMs as cache.

145 Static RAM (SRAM)

● Six transistors in cross connected fashion: – Provides regular AND inverted outputs. – Implemented in CMOS process.

Single Port 6-T SRAM Cell 146 Dynamic RAM

● SRAM cells exhibit high speed/poor density.

● DRAM: simple transistor/capacitor pairs in high density form. Word Line C

Bit Line . . Sense . Amp 147 Synchronous DRAM (SDRAM)

● Traditional DRAMs are asynchronous: –Overhead incurred to synchronize.

● SDRAMs use a clock input to overcome this overhead: –Faster AMAT.

148 DIMM

● Dual Inline Memory Modules

– DIMMs usually contain 4 to 16 DRAMs – Normally organized 8 bytes wide for desktops

149 DIMM ● SDRAM DIMMs - These first synchronous DRAM DIMMs had the same bus frequency for data, address and control lines.

– PC66 = 66 MHz, PC100 = 100 MH, PC133 = 133 MHz

● DDR SDRAM (DDR1) SDRAM DIMMs - DIMMs based on Double Data Rate (DDR). This is achieved by clocking on both the rising and falling edge of the data strobes.

– PC1600 = 200 MHz data & strobe / 100 MHz clock for address and control – PC2100 = 266 MHz data & strobe / 133 MHz clock for address and control – PC2700 = 333 MHz data & strobe / 166 MHz clock for address and control – PC3200 = 400 MHz data & strobe / 200 MHz clock for address and control

● DDR2 SDRAM SDRAM DIMMs - DIMMs based on Double Data Rate 2 (DDR2) DRAM also have data and data strobe frequencies at double the rate of the clock.

– PC2-3200 = 400 MHz data & strobe / 200 MHz clock for address and control – PC2-4200 = 533 MHz data & strobe / 266 MHz clock for address and control – PC2-5300 = 667 MHz data & strobe / 333 MHz clock for address and control – PC2-6400 = 800 MHz data & strobe / 400 MHz clock for address and control

150 RAMBUS DRAM(RDRAM)

● Takes standard DRAM core:

– Dropped RAS/CAS: provided a bus interface called a packet switched bus. – A single chip acts like a memory system.

RAMBUS Bank RDRAM Memory System 151 Packet-Switched Bus

● Between sending the address of a request and return of data: – Allows other accesses over the bus.

152 RDRAM with Integrated Heat Sink

153 Issues with RDRAM

● Compared to other contemporary standards, Rambus shows significantly increased: – Latency, heat output, manufacturing complexity, and cost. ● RDRAM requires larger die size: – Required to house the added interface – Results in a 10-20 percent price premium.

154 Issues with RDRAM

● Few DRAM manufacturers ever obtained the license to produce RDRAM: – Those who did license the technology failed to make enough RIMMs to satisfy PC market demand. – Caused RIMM (Rambus Inline Memory Module) to be priced higher than SDRAM DIMMs.

● During RDRAM's decline, – DDR continued to advance in speed while, at the same time, it was still cheaper than RDRAM.

● While RDRAM is still produced today, – Few motherboards support RDRAM. – Between 2002-2007, market share of RDRAM had never gone beyond 5%. 155 RDRAM in Use

● Though RDRAM has been less successful in desktop: – It has been used in video game consoles starting with Nintendo, Playstation 2, etc. – RDRAM has been implemented by Cirrus logic in their video card.

156 Flash Memory

● A form of EEPROM: – However, allows a block to be erased or written in a single operation (in a flash).

● Floating Gate Avalanche-injection Metal Oxide Semiconductor (FAMOS)

● Electrons are trapped in a floating gate.

● Writing a byte requires creating a new block: – Old block is copied along with the byte to be written.

157 Flash vs. EEPROM

● EEPROM can write to one location or byte at a time: – Flash writes multiKB blocks. – As a result flash memory is faster. – Also flash can be written in-system in contrast to EEPROM. – The control circuitry required for erasing is much less leading to higher capacity. 158 FLASH Memory cont…

● Peformance: –Reads at speed of DRAM (~ns) –Writes like DISK (~ms). Write is a complex operation.

159 Flash Memory cont… ● Memory capacity is increased by reducing the area dedicated to control erasing.

● Number of writes is restricted due to wear in insulating oxide layer.

● Used to take 12V to write: – Present generation flash operates at 2.7V.

● Multilevel flash technology: – With precise multilevel voltage it becomes possible to store more than one bit per cell.

160 Virtual Memory

● Virtual memory – separation of logical memory from physical memory.

– Only a part of the program needs to be in memory for execution. Main memory Hence, logical address space can be is like a cache much larger than physical address space. to the hard disc! – Allows address spaces to be shared by several processes (or threads). – Allows for more efficient process creation.

● Virtual memory can be implemented via:

– Demand paging 161 – Demand segmentation

Virtual Address

● The concept of a virtual (or logical) address space that is bound to a separate physical address space is central to memory management – Virtual address – generated by the CPU – Physical address – seen by the memory

● Virtual and physical addresses are the same in compile-time and load-time address-binding schemes;

– Virtual and physical addresses differ in execution- time address-binding schemes

162 Advantages of Virtual

● Translation: Memory

– Program can be given consistent view of memory, even though physical memory is scrambled

– Only the most important part of program (“Working Set”) must be in physical memory.

– Contiguous structures (like stacks) use only as much physical memory as necessary yet grow later.

● Protection:

– Different threads (or processes) protected from each other.

– Different pages can be given special behavior

● (Read Only, Invisible to user programs, etc).

– Kernel data protected from User programs

– Very important for protection from malicious programs => Far more “viruses” under Microsoft Windows

● Sharing: 163 – Can map same physical page to multiple users (shared memory”) Use of Virtual Memory

stack stack

Shared Libs Shared Shared Libs page

heap heap Static data Static data code code

Process A Process B 164 Virtual vs. Physical Address Space

Virtual Virtual Physical Main Address Memory Address Memory 0 A 0

4k B 4k C 8k C 8k

12k D 12k . 16k A Disk . 20k . 24k B . . 28k . D 4G . 165 Paging

● Divide physical memory into fixed-size blocks (e.g., 4KB) called frames

● Divide logical memory into blocks of same size (4KB) called pages

● To run a program of size n pages, need to find n free frames and load program

● Set up a page table to map page addresses to frame addresses (operating system sets up the page table)

166 Page Table and Address Translation

Virtual page number (VPN) Page offset

Page Main Table Memory

Physical page # (PPN) =

Physical address 167 Page Table Structure Examples

● One-to-one mapping, space?

– Large pages  Internal fragmentation (similar to having large line sizes in caches) – Small pages  Page table size issues

● Multi-level Paging

● Inverted Page Table Example: 64 bit address space, 4 KB pages (12 bits), 512 MB (29 bits) RAM

Number of pages = 264/212 = 252 (The page table has as many entrees)

Each entry is ~4 bytes, the size of the Page table is 254 Bytes = 16 Petabytes!

Can’t fit the page table in the 512 MB RAM! 168 Multi-level (Hierarchical) Page Table ● Divide virtual address into multiple levels

P1 P2 Page offset

Level 1 is stored in the P1 Main memory

P2 = Level 1 page directory Level 2 (pointer array) page table PPN Page offset (stores PPN) 169 Inverted Page Table

● One entry for each real page of memory

● Shared by all active processes

● Entry consists of the virtual address of the page stored in that real memory location, with Process ID information

● Decreases memory needed to store each page table, but increases time needed to search the table when a page reference occurs

170 Linear Inverted Page Table

● Contain entries (size of physical memory) in a linear array

● Need to traverse the array sequentially to find a match

● Can be time consuming PID = 8 Virtual Address PPN VPN = 0x2AA70 Offset Index PID VPN 0 1 0x74094 1 12 0xFEA00 2 1 0x00023 ...... 0x120C 14 0x2409A 0x120D 8 0x2AA70 match ...... PPN = 0x120D Offset Physical Address Linear Inverted Page Table 171 Hashed Inverted Page Table

● Use hash table to limit the search to smaller number of page-table entries Virtual Address PID = 8 VPN = 0x2AA70 Offset

Hash PID VPN Next 0 1 0x74094 0x0012 1 12 0xFEA00 --- 2 1 0x00023 0x120D ...... 0x120C 14 0x2409A 0x0980 2 0x120D 8 0x2AA70 0x00A0 .. match ...... Hash anchor table 172 Fast Address Translation

● How often address translation occurs?

● Where the page table is kept?

● Keep translation in the hardware

● Use Translation Lookaside Buffer (TLB)

– Instruction-TLB & Data-TLB – Essentially a cache (tag array = VPN, data array=PPN) – Small (32 to 256 entries are typical) – Typically fully associative (implemented as a content addressable memory, CAM) or highly associative to minimize conflicts

173 Example: Alpha 21264 data TLB VPN <35> offset <13>

Address Space Number <8> <4><1> <35> <31> ASN Prot V Tag PPN

. . .

. . . 128:1 mux = 44-bit physical address 174 TLB and Caches

● Several Design Alternatives

– VIVT: Virtually-indexed Virtually-tagged Cache – VIPT: Virtually-indexed Physically-tagged Cache – PIVT: Physically-indexed Virtually-tagged Cache

● Not outright useful, R6000 is the only used this. – PIPT: Physically-indexed Physically-tagged Cache

175 Virtually-Indexed Virtually- Tagged (VIVT) cache line return

Processor VIVT TLB Main Memory Core VA Cache miss

hit

• Fast cache access • Only require address translation when going to memory (miss) • Issues? 176 VIVT Cache Issues - Aliasing

● Homonym

– Same VA maps to different PAs

– Occurs when there is a context switch

– Solutions

● Include process id (PID) in cache or

● Flush cache upon context switches

● Synonym (also a problem in VIPT)

– Different VAs map to the same PA

– Occurs when data is shared by multiple processes

– Duplicated cache line in VIPT cache and VIVT$ w/ PID

– Data is inconsistent due to duplicated locations

– Solution 177

● Can Write-through solve the problem?

● Flush cache upon context switch

● Physically-Indexed Physically- Tagged (PIPT) cache line return

Processor PIPT TLB Main Memory Core Cache VA PA miss

hit

• Slower, always translate address before accessing memory • Simpler for data coherence

178 Virtually-Indexed Physically-Tagged (VIPT)

TLB PA mis Main Memory Processor VA s Core VIPT Cache cache line return hit

● Gain benefit of a VIVT and PIPT

● Parallel Access to TLB and VIPT cache

● No Homonym

● How about Synonym? 179

Deal w/ Synonym in VIPT

IndexCache VPN A Process A point to the same location within a page

Process B VPN B • VPN A != VPN B Index • How to eliminate •duplication? make cache Index A == Index B ?

180 Tag array Data array Synonym in VIPT Cache

VPN Page Offset

Cache Tag Set Index Line Offset

a

• If two VPNs do not differ in a then there is no synonym problem, since they will be indexed to the same set of a VIPT cache • Imply # of sets cannot be too big • Max number of sets = page size / cache line size – Ex: 4KB page, 32B line, max set = 128 181 • A complicated solution in MIPS R10000’s Solution to

● 32KB 2-Way Virtually-IndexedSynonym L1

VPN 12 bit

10 bit 4- bit a= VPN[1:0] stored as part of L2 cache Tag ● Direct-Mapped Physical L2

– L2 is Inclusive of L1

– VPN[1:0] is appended to the “tag” of L2

● Given two virtual addresses VA1 and VA2 that differs in VPN[1:0] and both map to the same physical address PA

– Suppose VA1 is accessed first so blocks are allocated in L1&L2

– What happens when VA2 is referenced? 1 VA2 indexes to a different block in L1 and misses 2 VA2 translates to PA and goes to the same block as VA1 in L2 3. Tag comparison fails (since VA1[1:0]VA2[1:0])

4. Treated just like as a L2 conflict miss  VA1’s entry in L1 is ejected (or182 dirty-written back if needed) due to inclusion policy Deal w/ Synonym in MIPS

VA1 R10000 VA2 Page offset Page offset index index a1 a2

1

miss 0 TLB

L1 VIPT cache

L2 PIPT Cache Physical index || a2

a2 !=a1 a1 Phy. Tag data 183 Deal w/ Synonym in MIPS

VA1 R10000 VA2 Page offset Page offset index index a1 a2

0 Only one copy is present in L1 1 TLB

L1 VIPT cache

L2 PIPT Cache Data return

a2 Phy. Tag data 184 Issues in Virtual Memory

● Number of pages are so many that the page tables are kept in MM.

● Assume 8-byte virtual address.

● Page size is 10 bits

● Each page stores 64 page table entries

● How many pages needed to store the page table?

● Does access to a word, require two main memory accesses?

185 TLB Tag Physical Address Prot Valid Dirty ASN

● TLB is a fully associative cache

● TLB entry is like a cache entry: – Tag contains portion of the virtual address.

● Each TLB entry has an Address Space Number (ASN): – Plays the same role as PID.

186 Alpha 21264 Virtual Memory

● 48-bit virtual memory is divided into 3 segments: – Seg0 (bit 63 to bit 47 are 00000…00) – Seg1 (bit 63 to bit 46 are 00000…10) – Seg2 (kseg) (bit 63 to bit 46 are 11111…11)

● Kseg is reserved for kernel: – Has uniform protection for the whole space. – Does not use memory management.

187 Alpha 21264 Virtual Memory

● User processes live in Seg 0.

● Seg 1 is used to keep portions of the page table.

● A 3-level hierarchical page table is used.

188 Main Memory Optimizations: Summary

● Wider Memory

● Interleaved Memory: for sequential or independent accesses

● DRAM specific optimizations.

189 Sun Fire 6800: Case Study

● L1 cache: – On-chip – Data cache: 32KB – Instruction cache: 64KB – Block size 32B – 4-way set associative – Write through, no write allocate ● L2 cache: – 8MB Unified cache – Tags for L2 cache on chip --- Total size 90KB – Write back, write allocate 190 Sun Fire 6800: Case Study

● 2KB write buffer

● Outstanding memory accesses supported by the memory system: – Upto 15.

● Virtually indexed, physically tagged cache – In parallel with cache access, the 64-bit virtual address is translated to a 48-bit physical address.

● 2 TLBs: – A 16-entry fully associative cache – A 128-entry 4-way set associative cache

191 Sun Fire 6800: Case Study

● Both data and instruction prefetch supported.

● Data prefetch:

– Upto 8 prefetches by both hardware and software. – If a load hits in prefetch buffer:

● The next load address is prefetched.

● Small 32B instruction prefetch buffer:

– On an instruction cache miss, two blocks are requested:

● One for the instruction cache the other for the prefetch buffer. 192 Consistency of Cached Data

● Same data can be found in the cache as well as the memory. – When CPU is the sole accessing unit, there is no chance of inconsistency.

● What if I/O can occur independent of CPU? – In fact, in many systems I/O occurs directly to main memory. – In a write through cache, there will be no inconsistency with I/O writes.

● But, what about write-back caches?

193 Solutions to the Cache Consistency Problem 1. Guarantee no blocks marked for I/O are present in cache. 2. I/O always occurs to a buffer marked non-cachable. 3. Hardware solution: Check I/O address to see if it is in cache.

194 Summary Cont… • Any high performance processor would be rendered ineffective –without matching support from memory hierarchy. • Ways to improve performance of a cache: AMAT  Cache hit time  Miss rate  Miss penalty

195 Summary Cont…

● Memory technologies: – DRAM – SRAM – SDRAM – RDRAM – DRDRAM – Flash Memory

196 References [1]J.L. Hennessy & D.A. Patterson, “Computer Architecture: A Quantitative Approach”. Morgan Kaufmann Publishers, 3rd Edition, 2003

[2]John Paul Shen and Mikko Lipasti, “Modern Processor Design,” Tata Mc-Graw-Hill, 2005

[3] S. McFarling, " for Instruction Caches, " Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 183--191, April 1989. [4] R,Mall, Slides from IIT, Kharagpur.

197