Pipelining and Vector Processing

Chapter 8

Pipelining and Vector Processing

8–1 If the pipeline stages are heterogeneous, the slowest stage determines the ﬂow rate of the entire pipeline. This leads to other stages idling.

8–2 Pipeline stalls can be caused by three types of hazards: resource, data, and control hazards. Resource hazards result when two or more instructions in the pipeline want to use the same resource. Such resource conﬂicts can result in serialized execution, reducing the scope for overlapped execution.

Data hazards are caused by data dependencies among the instructions in the pipeline. As a simple example, suppose that the result produced by instruction I1 is needed as an input to instruction I2. We have to stall the pipeline until I1 has written the result so that I2 reads the correct input. If the pipeline is not designed properly, data hazards can produce wrong results by using incorrect operands. Therefore, we have to worry about the correctness ﬁrst.

Control hazards are caused by control dependencies. As an example, consider the ﬂow control altered by a branch instruction. If the branch is not taken, we can proceed with the instructions in the pipeline. But, if the branch is taken, we have to throw away all the instructions that are in the pipeline and ﬁll the pipeline with instructions at the branch target.

8–3 Prefetching is a technique used to handle resource conﬂicts. Pipelining typically uses just-intime mechanism so that only a simple buffer is needed between the stages. We can minimize the performance impact if we relax this constraint by allowing a queue instead of a single buffer. The instruction fetch unit can prefetch instructions and place them in the instruction queue. The decoding unit will have ample instructions even if the instruction fetch is occasionally delayed because of a cache miss or resource conﬂict.

8–4 Data hazards are caused by data dependencies among the instructions in the pipeline. As a simple example, suppose that the result produced by instruction I1 is needed as an input to instruction I2. We have to stall the pipeline until I1 has written the result so that I2 reads the correct input.

1

2

Chapter 8
There are two techniques used to handle data dependencies: register interlocking and register forwarding. Register forwarding works if the two instructions involved in the dependency are in the pipeline. The basic idea is to provide the output result as soon as it is available in the datapath. This technique is demonstrated in the following ﬁgure. For example, if we provide the output of I1 to I2 as we write into the destination register of I1, we will reduce the number of stall cycles by one (see Figure ). We can do even better if we feed the output from the IE stage as shown in Figure . In this case, we completely eliminate the pipeline stalls.

Clock cycle

1

2

3

4

5

6

7

8

9

IF ID OF IE WB
I1

I2 I3 I4

IF

ID

IF
OF IE WB ID OF IE WB IF ID OF IE WB

(a) Forward scheme 1

Clock cycle

1

2

3

4

5

6

7

8

IF ID OF IE WB
IF ID OF IE WB
IF ID OF IE WB
IF ID OF IE WB
I1 I2 I3 I4

(b) Forward scheme 2

Register interlocking is a general technique to solve the correctness problem associated with data dependencies. In this method, a bit is associated with each register to specify whether the contents are correct. If the bit is 0, the contents of the register can be used. Instructions should not read contents of a register when this interlocking bit is 1, as the register is locked by another instruction. The following ﬁgure shows how the register interlocking works for the example given below:

I1: I2: add sub
R2,R3,R4 R5,R6,R2

/* R2 = R3 + R4 */ /* R5 = R6 R2 */

I1 locks the R2 register for clock cycles 3 to 5 so that I2 cannot proceed reading an incorrect R2 value. Clearly, register forwarding is more efﬁcient than the interlocking method.
Chapter 8

3

R2 is locked

Clock cycle

1

2

3

4

5

6

7

8

9

10

IF ID OF IE WB
I1

I2 I3 I4

IF

ID

IF
OF IE WB ID OF IE WB IF ID OF IE WB

8–5 Flow altering instructions such as branch require special handling in pipelined processors. In the following ﬁgure, Figure shows the impact of a branch instruction on our pipeline. Here we are assuming that instruction Ib is a branch instruction; if the branch is taken, it transfers control to instruction It. If the branch is not taken, the instructions in the pipeline are useful. However, for a taken branch, we have to discard all the instructions that are in the pipeline at various stages. In our example, we have to discard instructions I2, I3, and I4. We start fetching instructions at the target address. This causes our pipeline to do wasteful work for three clock cycles. This is called

the branch penalty.

Clock cycle
Branch instruction Ib

1

2

3

4

5

6

7

8

9

IF ID OF IE WB
IF ID OF IE WB
IF ID OF IE WB
IF ID OF IE WB
IF ID OF IE WB
I2
Discarded instructions I3
I4
Branch target instruction It
(a) Branch decision is known during the IE stage

Clock cycle

1

2

3

4

5

6

7

8

9

IF ID OF IE WB
IF ID OF IE WB
IF ID OF IE WB
Discarded instruction I2
(b) Branch decision is known during the ID stage

8–6 Several techniques can be used to reduce branch penalty.
• If we don’t do anything clever, we wait until the execution (IE) stage before initiating the instruction fetch at the branch target address. We can reduce the delay if we can determine this earlier. For example, if we ﬁnd whether the branch is taken along with the target address

4

Chapter 8 information during the decode (ID) stage, we would just pay a penalty of one cycle, as shown in the following ﬁgure.

Clock cycle

1

2

3

4

5

6

7

8

9

IF ID OF IE WB
IF ID OF IE WB
IF ID OF IE WB
IF ID OF IE WB
IF ID OF IE WB
I2
Discarded instructions I3
I4
(a) Branch decision is known during the IE stage

Clock cycle

1

2

3

4

5

6

7

8

9

IF ID OF IE WB
IF ID OF IE WB
IF ID OF IE WB
Discarded instruction I2
(b) Branch decision is known during the ID stage

• Delayed branch execution effectively reduces the branch penalty further. The idea is based on the observation that we always fetch the instruction following the branch before we know whether the branch is taken. Why not execute this instruction instead of throwing it away? This implies that we have to place a useful instruction in this instruction slot. This instruction slot is called the delay slot. In other words, the branching is delayed until after the instruction in the delay slot is executed. Some processors like the SPARC and MIPS use delayed execution for both branching and procedure calls. When we apply this technique, we need to modify our program to put a useful instruction in the delay slot. We illustrate this by using an example. Consider the following code segment:

add

R2,R3,R4

branch target sub ...
R5,R6,R7
. . . target: mult ...
R8,R9,R10
. . .

If the branch is delayed, we can reorder the instructions so that the branch instruction is moved ahead by one instruction, as shown below:

branch target add

R2,R3,R4 /* Branch delay slot */

Chapter 8

5

sub ...
R5,R6,R7
. . . target: mult ...
R8,R9,R10
. . .

Programmers do not have to worry about moving instructions into the delay slots. This job is done by the compilers and assemblers. When no useful instruction can be moved into the delay slot, a no operation (NOP) is placed.

• Branch prediction is traditionally used to handle the branch penalty problem. We discussed three branch prediction strategies: fixed, static, and dynamic. In the fixed strategy, as the name implies, prediction is fixed. These strategies are simple to implement and assume that the branch is either never taken or always taken. The static strategy uses instruction opcode to predict whether the branch is taken. For example, if the instruction is unconditional branch, we use branch “always-taken” decision. Dynamic strategy looks at the run-time history to make more accurate predictions. The basic idea is to take the past branch executions of the branch type in question and use this information to predict the next one.

8–7 Delayed branch execution effectively reduces the branch penalty. The idea is based on the observation that we always fetch the instruction following the branch before we know whether the branch is taken. Why not execute this instruction instead of throwing it away? This implies that we have to place a useful instruction in this instruction slot. This instruction slot is called the delay slot. In other words, the branching is delayed until after the instruction in the delay slot is executed.

8–8 In delayed branch execution, when the branch is not taken, sometimes we do not want to execute the delay slot instruction. That is, we want to nullify the delay slot instruction. Some processors like the SPARC provide this nulliﬁcation option.

8–9 Branch prediction is traditionally used to handle the branch problem. We discussed three branch prediction strategies: ﬁxed, static, and dynamic.

1. Fixed Branch Prediction: In this strategy, prediction is ﬁxed. These strategies are simple to implement and assume that the branch is either never taken or always taken. The Motorola 68020 and VAX 11/780 use the branch-never-taken approach. The advantage of the nevertaken strategy is that the processor can continue to fetch instructions sequentially to ﬁll the pipeline. This involves minimum penalty in case the prediction is wrong. If, on the other hand, we use the always-taken approach, the processor would prefetch the instruction at the branch target address. In a paged environment, this may lead to a page fault, and a special mechanism is needed to take care of this situation. Furthermore, if the prediction were wrong, we would have done lot of unnecessary work. The branch-never-taken approach, however, is not proper for a loop structure. If a loop iterates 200 times, the branch is taken 199 out of 200 times. For loops, the always-taken

6

Chapter 8 approach is better. Similarly, the always-taken approach is preferred for procedure calls and returns.

2. Static Branch Prediction: This strategy, rather than following a ﬁxed strategy, uses instruction opcode to predict whether the branch is taken. For example, if the instruction is unconditional branch, we use branch “always-taken” decision. We use a similar decision for the loop and call/return instructions. On the other hand, for conditional branches, we may use “never-taken” decision. It has been shown that this strategy improves prediction accuracy.

3. Dynamic Branch Prediction: Dynamic strategy looks at the run-time history to make more accurate predictions. The basic idea is to take the past branch executions of the branch type in question and use this information to predict the next one. Will this work in practice? How much additional beneﬁt can we derive over the static approach? An empirical study suggests that we can get signiﬁcant improvement in prediction accuracy.

8–10 To show why the static strategy gives high prediction accuracy, we present sample data for commercial environments. In such environments, of all the branch-type operations, the branches are about 70%, loops are 10%, and the rest are procedure calls/returns. Of the total branches, 40% are unconditional. If we use a never-taken guess for the conditional branch and always-taken for the rest of the branch-type operations, we get a prediction accuracy of about 82% as shown in the following table.

Instruction distribution (%)
Prediction: Branch taken?
Correct prediction

(%)

Instruction type

Unconditional branch Conditional branch Loop

70 0.4 = 28

Yes

No
28

70 0.6 = 42

42 0.6 = 25.2

10 0.9 = 9 20
10 20
Yes

Yes

Call/return

Overall prediction accuracy = 82.2%

The data in this table assume that conditional branches are not taken about 60% of the time. Thus, our prediction that a conditional branch is never taken is correct only 60% of the time. This gives

us

% as the prediction accuracy for conditional branches. Similarly, loops jump

back with 90% probability. Since loops appear about 10% of the time, the prediction is right 9% of the time. Surprisingly, even this simple static prediction strategy gives us about 82% accuracy!

If we apply “never-taken” decision, our prediction accuracy reduces to 26.2% as shown in the following table. On the other hand, the “always-taken” approach gives us a prediction accuracy of 73.8%. In either case, the static strategy gives higher prediction accuracy.
Chapter 8

7

Instruction distribution (%)
Correct prediction: Correct prediction: Never taken (%) Always taken (%)

0 28
Instruction type Unconditional branch Conditional branch Loop
70 0.4 = 28

70 0.6 = 42

42 0.6 = 25.2

10 0.1 = 1 0
42 0.4 = 16.8 10 0.9 = 9 20
10

20

Call/return

Overall prediction accuracy =

26.2%

73.8%

8–11 The static prediction strategy uses instruction opcode to predict whether the branch is taken. Dynamic strategy, on the other hand, looks at the run-time history to make more accurate predictions. The basic idea is to take the past branch executions of the branch type in question and use this information to predict the next one. Since this takes runtime conditions into account, it can potentially perform better than the static strategy.

How much additional beneﬁt can we derive over the static approach? The empirical study by Lee and Smith suggests that we can get signiﬁcant improvement in prediction accuracy. A summary of their study is presented in the following table. The algorithm they implemented is simple: The prediction for the next branch is the majority of the previous branch executions. For example, for
, if two or more times branches were taken in the past three branch executions, the prediction is that the branch will be taken.

Type of mix
Compiler Business Scientiﬁc
012345
64.1 91.9 93.3 93.7 94.5 94.7
64.4 95.2 96.5 96.6 96.8 97.0
70.4 86.6 90.8 91.0 91.8 92.0

The data in this table suggest that looking at the past two branch executions will give us over 90% prediction accuracy for most mixes. Beyond that, we get only marginal improvement.

8–12 The static prediction strategy uses instruction opcode to predict whether the branch is taken. Dynamic strategy, on the other hand, looks at the run-time history to make more accurate predictions. Static strategy is simple to implement compared to the dynamic strategy. Implementation of the dynamic strategy requires maintaining two bits for each branch instruction. However, dynamic strategy improves prediction accuracy. In the example presented in the text (Section 8.4.2), dynamic strategy gives us over 90% prediction accuracy whereas the prediction accuracy of the static strategy is about 82%.

8

Chapter 8
8–13 A summary of the empirical study by Lee and Smith study is presented in the following table. The algorithm they implemented is simple: The prediction for the next branch is the majority of the

previous branch executions. For example, for

, if two or more times branches were taken

in the past three branch executions, the prediction is that the branch will be taken.

Type of mix
Compiler Business Scientiﬁc
012345
64.1 91.9 93.3 93.7 94.5 94.7
64.4 95.2 96.5 96.6 96.8 97.0
70.4 86.6 90.8 91.0 91.8 92.0

The data in this table suggest that looking at the past two branch executions will give us over 90% prediction accuracy for most mixes. Beyond that, we get only marginal improvement. This implies that we need just two bits to take the history of the past two branch executions. The basic idea is simple: keep the current prediction unless the past two predictions were wrong. Speciﬁcally, we do not want to change our prediction just because our last prediction was wrong.

8–14 Superscalar processors improve performance by replicating the pipeline hardware. One simple technique is to have multiple pipelines. The following ﬁgure shows a dual pipeline design, somewhat similar to that present in the Pentium. The instruction fetch unit fetches two instructions each cycle and loads the two pipelines with one instruction each. Since these two pipelines are independent, instruction execution can proceed in parallel.

U pipeline
Instruction decode unit
Operand fetch unit
Instruction execution unit
Result write back unit
Common

instruction fetch unit
Instruction

decode unit
Operand fetch unit
Instruction execution unit

V pipeline

We can also improve performance by providingmultiple execution units, linked to a single pipeline, as shown in the following figure. In this figure, we are using four execution units: two integer units and two floating-point units. Such designs are referred to as superscalar processors.
Chapter 8

9

Integer execution unit 1

Integer execution unit 2
Instruction fetch unit
Instruction decode unit
Operand fetch unit
Floating-point execution unit 1

Floating-point execution unit 2

8–15 Superscalar processors improve performanceby replicating the pipeline hardware (multiple pipelines and multiple execution units).

8–16 Superscalar processors improve performanceby replicating the pipeline hardware (multiple pipelines and multiple execution units). Superpipelined systems improve performance by increasing the pipeline depth.

8–17 The main difference is that vector machines are designed to operate at the vector level. In contrast, traditional processors are designed to work on scalars. Vector machines also exploit pipelining (i.e., overlapped execution) to the maximum extent. Vector machines not only use pipelining for integer and ﬂoating-point operations but also to feed data from one functional unit to another. This process is known as chaining. In addition, load and store operations are also pipelined.

8–18 Vector processing offers improved performance due to several reasons. Some of them are listed below:

• Flynn’s bottleneck can be reduced by using vector instructions as each vector instruction speciﬁes a lot of work.

• Data hazards can be eliminated due to the structured nature of the data used by vector machines.

• Memory latency can be reduced by using pipelined load and store operations. • Control hazards are reduced as a result of specifying a large number of iterations in a single vector instruction.

• Pipelining can be exploited to the maximum extent. This is facilitated by the absence of data and control hazards. Vector machines not only use pipelining for integer and ﬂoating-point

10

Chapter 8 operations but also to feed data from one functional unit to another. This process is known as chaining. In addition, as mentioned before, load and store operations also use pipelining.

8–19 The vector length register holds the valid vector length (VL). All vector operations are done on the ﬁrst VL elements (i.e., elements in the range 0 to VL 1).

8–20 Larger vectors are handled by a technique known as strip mining. As an example, assume that the vector length supported by the machine is 64 and the vector to be processed consists of 200 elements. In strip mining, the vector is partitioned into strips of 64 elements. This leaves one odd-size piece that may be less than 64 elements. The size of this piece is given by ( mod 64). We load each strip into a vector register and apply the vector operation. The number of strips is

given by

. For our example, we divide the 200 elements into four pieces: three pieces

with 64 elements and one odd piece with 8 elements. We use a loop that iterates four times: in one of the iterations, we set VL to 8 and the remaining three iterations will set the VL register to 64.