Pipelining and Vector Processing

Pipelining and Vector Processing

<p><strong>Chapter 8 </strong></p><p><strong>Pipelining and Vector Processing </strong></p><p><strong>8–1 </strong>If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. </p><p><strong>8–2 </strong>Pipeline stalls can be caused by three types of hazards: <em>resource</em>, <em>data</em>, and <em>control hazards</em>. Resource hazards result when two or more instructions in the pipeline want to use the same resource. Such resource conflicts can result in serialized execution, reducing the scope for overlapped execution. </p><p>Data hazards are caused by data dependencies among the instructions in the pipeline. As a simple example, suppose that the result produced by instruction I1 is needed as an input to instruction I2. We have to stall the pipeline until I1 has written the result so that I2 reads the correct input.&nbsp;If the pipeline is not designed properly, data hazards can produce wrong results by using incorrect operands. Therefore, we have to worry about the correctness first. </p><p>Control hazards are caused by control dependencies.&nbsp;As an example, consider the flow control altered by a branch instruction. If the branch is not taken, we can proceed with the instructions in the pipeline. But, if the branch is taken, we have to throw away all the instructions that are in the pipeline and fill the pipeline with instructions at the branch target. </p><p><strong>8–3 </strong>Prefetching is a technique used to handle resource conflicts.&nbsp;Pipelining typically uses just-intime mechanism so that only a simple buffer is needed between the stages.&nbsp;We can minimize the performance impact if we relax this constraint by allowing a queue instead of a single buffer. The instruction fetch unit can prefetch instructions and place them in the instruction queue.&nbsp;The decoding unit will have ample instructions even if the instruction fetch is occasionally delayed because of a cache miss or resource conflict. </p><p><strong>8–4 </strong>Data hazards are caused by data dependencies among the instructions in the pipeline. As a simple example, suppose that the result produced by instruction I1 is needed as an input to instruction I2. We have to stall the pipeline until I1 has written the result so that I2 reads the correct input. </p><p><strong>1</strong></p><p><strong>2</strong></p><p>Chapter 8 <br>There are two techniques used to handle data dependencies:&nbsp;<em>register interlocking </em>and <em>register forwarding</em>. Register forwarding works if the two instructions involved in the dependency are in the pipeline. The basic idea is to provide the output result as soon as it is available in the datapath. This technique is demonstrated in the following figure. For example, if we provide the output of I1 to I2 as we write into the destination register of I1, we will reduce the number of stall cycles by one (see Figure&nbsp;). We can do even better if we feed the output from the IE stage as shown in Figure .&nbsp;In this case, we completely eliminate the pipeline stalls. </p><p></p><ul style="display: flex;"><li style="flex:1">Clock cycle </li><li style="flex:1">1</li><li style="flex:1">2</li><li style="flex:1">3</li><li style="flex:1">4</li><li style="flex:1">5</li><li style="flex:1">6</li><li style="flex:1">7</li><li style="flex:1">8</li><li style="flex:1">9</li></ul><p>IF ID&nbsp;OF IE&nbsp;WB <br>I1 </p><p>I2 I3 I4 </p><ul style="display: flex;"><li style="flex:1">IF </li><li style="flex:1">ID </li></ul><p>IF <br>OF IE&nbsp;WB ID OF IE WB IF ID&nbsp;OF IE&nbsp;WB </p><p>(a) Forward scheme 1 </p><ul style="display: flex;"><li style="flex:1">Clock cycle </li><li style="flex:1">1</li><li style="flex:1">2</li><li style="flex:1">3</li><li style="flex:1">4</li><li style="flex:1">5</li><li style="flex:1">6</li><li style="flex:1">7</li><li style="flex:1">8</li></ul><p>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>I1 I2 I3 I4 </p><p>(b) Forward scheme 2 </p><p>Register interlocking is a general technique to solve the correctness problem associated with data dependencies. In this method, a bit is associated with each register to specify whether the contents are correct.&nbsp;If the bit is 0, the contents of the register can be used.&nbsp;Instructions should not read contents of a register when this interlocking bit is 1, as the register is locked by another instruction. The following figure shows how the register interlocking works for the example given below: </p><p>I1: I2: add sub <br>R2,R3,R4 R5,R6,R2 </p><p>/* R2 = R3 + R4 */ /* R5 = R6&nbsp;R2 */ </p><p>I1 locks the R2 register for clock cycles 3 to 5 so that I2 cannot proceed reading an incorrect R2 value. Clearly, register forwarding is more efficient than the interlocking method. <br>Chapter 8 </p><p><strong>3</strong></p><p>R2 is locked </p><ul style="display: flex;"><li style="flex:1">Clock cycle </li><li style="flex:1">1</li><li style="flex:1">2</li><li style="flex:1">3</li><li style="flex:1">4</li><li style="flex:1">5</li><li style="flex:1">6</li><li style="flex:1">7</li><li style="flex:1">8</li><li style="flex:1">9</li><li style="flex:1">10 </li></ul><p>IF ID&nbsp;OF IE&nbsp;WB <br>I1 </p><p>I2 I3 I4 </p><ul style="display: flex;"><li style="flex:1">IF </li><li style="flex:1">ID </li></ul><p>IF <br>OF IE&nbsp;WB ID OF IE WB IF ID&nbsp;OF IE&nbsp;WB </p><p><strong>8–5 </strong>Flow altering instructions such as branch require special handling in pipelined processors. In the following figure, Figure&nbsp;shows the impact of a branch instruction on our pipeline. Here we are assuming that instruction Ib is a branch instruction; if the branch is taken, it transfers control to instruction It.&nbsp;If the branch is not taken, the instructions in the pipeline are useful.&nbsp;However, for a taken branch, we have to discard all the instructions that are in the pipeline at various stages. In our example, we have to discard instructions I2, I3, and I4.&nbsp;We start fetching instructions at the target address. This causes our pipeline to do wasteful work for three clock cycles. This is called </p><p>the <em>branch penalty</em>. </p><p>Clock cycle <br>Branch instruction Ib </p><ul style="display: flex;"><li style="flex:1">1</li><li style="flex:1">2</li><li style="flex:1">3</li><li style="flex:1">4</li><li style="flex:1">5</li><li style="flex:1">6</li><li style="flex:1">7</li><li style="flex:1">8</li><li style="flex:1">9</li></ul><p>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>I2 <br>Discarded instructions&nbsp;I3 <br>I4 <br>Branch target instruction It <br>(a) Branch decision is known during the IE stage </p><p></p><ul style="display: flex;"><li style="flex:1">Clock cycle </li><li style="flex:1">1</li><li style="flex:1">2</li><li style="flex:1">3</li><li style="flex:1">4</li><li style="flex:1">5</li><li style="flex:1">6</li><li style="flex:1">7</li><li style="flex:1">8</li><li style="flex:1">9</li></ul><p>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>Branch instruction Ib <br>Discarded instruction I2 <br>Branch target instruction It <br>(b) Branch decision is known during the ID stage </p><p><strong>8–6 </strong>Several techniques can be used to reduce branch penalty. <br>• If&nbsp;we don’t do anything clever, we wait until the execution (IE) stage before initiating the instruction fetch at the branch target address. We can reduce the delay if we can determine this earlier. For example, if we find whether the branch is taken along with the target address </p><p><strong>4</strong></p><p>Chapter 8 information during the decode (ID) stage, we would just pay a penalty of one cycle, as shown in the following figure. </p><p></p><ul style="display: flex;"><li style="flex:1">Clock cycle </li><li style="flex:1">1</li><li style="flex:1">2</li><li style="flex:1">3</li><li style="flex:1">4</li><li style="flex:1">5</li><li style="flex:1">6</li><li style="flex:1">7</li><li style="flex:1">8</li><li style="flex:1">9</li></ul><p>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>Branch instruction Ib <br>I2 <br>Discarded instructions&nbsp;I3 <br>I4 <br>Branch target instruction It <br>(a) Branch decision is known during the IE stage </p><p></p><ul style="display: flex;"><li style="flex:1">Clock cycle </li><li style="flex:1">1</li><li style="flex:1">2</li><li style="flex:1">3</li><li style="flex:1">4</li><li style="flex:1">5</li><li style="flex:1">6</li><li style="flex:1">7</li><li style="flex:1">8</li><li style="flex:1">9</li></ul><p>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>IF ID&nbsp;OF IE&nbsp;WB <br>Branch instruction Ib <br>Discarded instruction I2 <br>Branch target instruction It <br>(b) Branch decision is known during the ID stage </p><p>• Delayed&nbsp;branch execution effectively reduces the branch penalty further.&nbsp;The idea is based on the observation that we always fetch the instruction following the branch before we know whether the branch is taken.&nbsp;Why not execute this instruction instead of throwing it away? This implies that we have to place a useful instruction in this instruction slot. This instruction slot is called the <em>delay slot</em>. In other words, the branching is delayed until after the instruction in the delay slot is executed.&nbsp;Some processors like the SPARC and MIPS use delayed execution for both branching and procedure calls. When we apply this technique, we need to modify our program to put a useful instruction in the delay slot. We illustrate this by using an example. Consider the following code segment: </p><p></p><ul style="display: flex;"><li style="flex:1">add </li><li style="flex:1">R2,R3,R4 </li></ul><p>branch target sub ... <br>R5,R6,R7 <br>. . . target: mult ... <br>R8,R9,R10 <br>. . . </p><p>If the branch is delayed, we can reorder the instructions so that the branch instruction is moved ahead by one instruction, as shown below: </p><p>branch target add </p><p>R2,R3,R4 /* Branch delay slot */ </p><p>Chapter 8 </p><p><strong>5</strong></p><p>sub ... <br>R5,R6,R7 <br>. . . target: mult ... <br>R8,R9,R10 <br>. . . </p><p>Programmers do not have to worry about moving instructions into the delay slots.&nbsp;This job is done by the compilers and assemblers. When no useful instruction can be moved into the delay slot, a no operation (NOP) is placed. </p><p>• Branch&nbsp;prediction is traditionally used to handle the branch penalty problem. We discussed three branch prediction strategies:&nbsp;fixed, static, and dynamic.&nbsp;In the fixed strategy, as the name implies, prediction is fixed. These strategies are simple to implement and assume that the branch is either never taken or always taken. The static strategy uses instruction opcode to predict whether the branch is taken. For example, if the instruction is unconditional branch, we use branch “always-taken” decision. Dynamic strategy looks at the run-time history to make more accurate predictions.&nbsp;The basic idea is to take the past&nbsp;branch executions of the branch type in question and use this information to predict the next one. </p><p><strong>8–7 </strong>Delayed branch execution effectively reduces the branch penalty. The idea is based on the observation that we always fetch the instruction following the branch before we know whether the branch is taken. Why not execute this instruction instead of throwing it away? This implies that we have to place a useful instruction in this instruction slot. This instruction slot is called the <em>delay slot</em>. In other words, the branching is delayed until after the instruction in the delay slot is executed. </p><p><strong>8–8 </strong>In delayed branch execution, when the branch is not taken, sometimes we do not want to execute the delay slot instruction. That is, we want to <em>nullify </em>the delay slot instruction. Some processors like the SPARC provide this nullification option. </p><p><strong>8–9 </strong>Branch prediction is traditionally used to handle the branch problem. We discussed three branch prediction strategies: fixed, static, and dynamic. </p><p>1. <em>Fixed Branch Prediction: </em>In this strategy, prediction is fixed. These strategies are simple to implement and assume that the branch is either never taken or always taken. The Motorola 68020 and VAX 11/780 use the branch-never-taken approach.&nbsp;The advantage of the nevertaken strategy is that the processor can continue to fetch instructions sequentially to fill the pipeline. This&nbsp;involves minimum penalty in case the prediction is wrong.&nbsp;If, on the other hand, we use the always-taken approach, the processor would prefetch the instruction at the branch target address.&nbsp;In a paged environment, this may lead to a page fault, and a special mechanism is needed to take care of this situation. Furthermore, if the prediction were wrong, we would have done lot of unnecessary work. The branch-never-taken approach, however, is not proper for a loop structure.&nbsp;If a loop iterates 200 times, the branch is taken 199 out of 200 times.&nbsp;For loops, the always-taken </p><p><strong>6</strong></p><p>Chapter 8 approach is better. Similarly, the always-taken approach is preferred for procedure calls and returns. </p><p>2. <em>Static Branch Prediction: </em>This strategy, rather than following a fixed strategy, uses instruction opcode to predict whether the branch is taken. For example, if the instruction is unconditional branch, we use branch “always-taken” decision.&nbsp;We use a similar decision for the loop and call/return instructions.&nbsp;On the other hand, for conditional branches, we may use “never-taken” decision. It has been shown that this strategy improves prediction accuracy. </p><p>3. <em>Dynamic Branch Prediction: </em>Dynamic strategy looks at the run-time history to make more accurate predictions. The basic idea is to take the past&nbsp;branch executions of the branch type in question and use this information to predict the next one. Will this work in practice? How much additional benefit can we derive over the static approach? An empirical study suggests that we can get significant improvement in prediction accuracy. </p><p><strong>8–10 </strong>To show why the static strategy gives high prediction accuracy, we present sample data for commercial environments. In such environments, of all the branch-type operations, the branches are about 70%, loops are 10%, and the rest are procedure calls/returns.&nbsp;Of the total branches, 40% are unconditional. If we use a never-taken guess for the conditional branch and always-taken for the rest of the branch-type operations, we get a prediction accuracy of about 82% as shown in the following table. </p><p>Instruction distribution (%) <br>Prediction: Branch taken? <br>Correct prediction </p><ul style="display: flex;"><li style="flex:1">(%) </li><li style="flex:1">Instruction type </li></ul><p>Unconditional branch Conditional branch Loop </p><ul style="display: flex;"><li style="flex:1">70 0.4&nbsp;= 28 </li><li style="flex:1">Yes </li></ul><p>No <br>28 </p><ul style="display: flex;"><li style="flex:1">70 0.6&nbsp;= 42 </li><li style="flex:1">42 0.6&nbsp;= 25.2 </li></ul><p>10 0.9&nbsp;= 9 20 <br>10 20 <br>Yes </p><ul style="display: flex;"><li style="flex:1">Yes </li><li style="flex:1">Call/return </li></ul><p>Overall prediction accuracy = 82.2% </p><p>The data in this table assume that conditional branches are not taken about 60% of the time. Thus, our prediction that a conditional branch is never taken is correct only 60% of the time. This gives </p><ul style="display: flex;"><li style="flex:1">us </li><li style="flex:1">% as the prediction accuracy for conditional branches. Similarly, loops jump </li></ul><p>back with 90% probability. Since loops appear about 10% of the time, the prediction is right 9% of the time. Surprisingly, even this simple static prediction strategy gives us about 82% accuracy! </p><p>If we apply “never-taken” decision, our prediction accuracy reduces to 26.2% as shown in the following table. On the other hand, the “always-taken” approach gives us a prediction accuracy of 73.8%. In either case, the static strategy gives higher prediction accuracy. <br>Chapter 8 </p><p><strong>7</strong></p><p>Instruction distribution (%) <br>Correct prediction:&nbsp;Correct prediction: Never taken (%)&nbsp;Always taken (%) </p><p>0 28 <br>Instruction type Unconditional branch Conditional branch Loop <br>70 0.4&nbsp;= 28 </p><ul style="display: flex;"><li style="flex:1">70 0.6&nbsp;= 42 </li><li style="flex:1">42 0.6&nbsp;= 25.2 </li></ul><p>10 0.1&nbsp;= 1 0<br>42 0.4&nbsp;= 16.8 10 0.9&nbsp;= 9 20 <br>10 </p><ul style="display: flex;"><li style="flex:1">20 </li><li style="flex:1">Call/return </li></ul><p></p><ul style="display: flex;"><li style="flex:1">Overall prediction accuracy = </li><li style="flex:1">26.2% </li><li style="flex:1">73.8% </li></ul><p><strong>8–11 </strong>The static prediction strategy uses instruction opcode to predict whether the branch is taken. Dynamic strategy, on the other hand, looks at the run-time history to make more accurate predictions. The basic idea is to take the past&nbsp;branch executions of the branch type in question and use this information to predict the next one.&nbsp;Since this takes runtime conditions into account, it can potentially perform better than the static strategy. </p><p>How much additional benefit can we derive over the static approach? The empirical study by Lee and Smith suggests that we can get significant improvement in prediction accuracy.&nbsp;A summary of their study is presented in the following table. The algorithm they implemented is simple: The prediction for the next branch is the majority of the previous&nbsp;branch executions. For example, for <br>, if two or more times branches were taken in the past three branch executions, the prediction is that the branch will be taken. </p><p>Type of mix <br>Compiler Business Scientific <br>012345<br>64.1 91.9 93.3 93.7 94.5 94.7 <br>64.4 95.2 96.5 96.6 96.8 97.0 <br>70.4 86.6 90.8 91.0 91.8 92.0 </p><p>The data in this table suggest that looking at the past two branch executions will give us over 90% prediction accuracy for most mixes. Beyond that, we get only marginal improvement. </p><p><strong>8–12 </strong>The static prediction strategy uses instruction opcode to predict whether the branch is taken. Dynamic strategy, on the other hand, looks at the run-time history to make more accurate predictions. Static strategy is simple to implement compared to the dynamic strategy.&nbsp;Implementation of the dynamic strategy requires maintaining two bits for each branch instruction.&nbsp;However, dynamic strategy improves prediction accuracy.&nbsp;In the example presented in the text (Section 8.4.2), dynamic strategy gives us over 90% prediction accuracy whereas the prediction accuracy of the static strategy is about 82%. </p><p><strong>8</strong></p><p>Chapter 8 <br><strong>8–13 </strong>A summary of the empirical study by Lee and Smith study is presented in the following table. The algorithm they implemented is simple: The prediction for the next branch is the majority of the </p><ul style="display: flex;"><li style="flex:1">previous branch&nbsp;executions. For example, for </li><li style="flex:1">, if two or more times branches were taken </li></ul><p>in the past three branch executions, the prediction is that the branch will be taken. </p><p>Type of mix <br>Compiler Business Scientific <br>012345<br>64.1 91.9 93.3 93.7 94.5 94.7 <br>64.4 95.2 96.5 96.6 96.8 97.0 <br>70.4 86.6 90.8 91.0 91.8 92.0 </p><p>The data in this table suggest that looking at the past two branch executions will give us over 90% prediction accuracy for most mixes. Beyond that, we get only marginal improvement. This implies that we need just two bits to take the history of the past two branch executions. The basic idea is simple: keep the current prediction unless the past two predictions were wrong. Specifically, we do not want to change our prediction just because our last prediction was wrong. </p><p><strong>8–14 </strong>Superscalar processors improve performance by replicating the pipeline hardware.&nbsp;One simple technique is to have multiple pipelines. The following figure shows a dual pipeline design, somewhat similar to that present in the Pentium.&nbsp;The instruction fetch unit fetches two instructions each cycle and loads the two pipelines with one instruction each.&nbsp;Since these two pipelines are independent, instruction execution can proceed in parallel. </p><p>U pipeline <br>Instruction decode unit <br>Operand fetch unit <br>Instruction execution unit <br>Result write back unit <br>Common </p><p>instruction fetch unit <br>Instruction </p><p>decode unit <br>Operand fetch unit <br>Instruction execution unit <br>Result write back unit </p><p>V pipeline </p><p>We can also improve performance by providingmultiple execution units, linked to a single pipeline, as shown in the following figure. In this figure, we are using four execution units: two integer units and two floating-point units. Such designs are referred to as <em>superscalar processors</em>. <br>Chapter 8 </p><p><strong>9</strong></p><p>Integer execution unit 1 </p><p>Integer execution unit 2 <br>Instruction fetch unit <br>Instruction decode unit <br>Operand fetch unit <br>Result write back unit <br>Floating-point execution unit 1 </p><p>Floating-point execution unit 2 </p><p><strong>8–15 </strong>Superscalar processors improve performanceby replicating the pipeline hardware (multiple pipelines and multiple execution units). </p><p><strong>8–16 </strong>Superscalar processors improve performanceby replicating the pipeline hardware (multiple pipelines and multiple execution units).&nbsp;Superpipelined systems improve performance by increasing the pipeline depth. </p><p><strong>8–17 </strong>The main difference is that vector machines are designed to operate at the vector level. In contrast, traditional processors are designed to work on scalars.&nbsp;Vector machines also exploit pipelining (i.e., overlapped execution) to the maximum extent.&nbsp;Vector machines not only use pipelining for integer and floating-point operations but also to feed data from one functional unit to another. This process is known as chaining. In addition, load and store operations are also pipelined. </p><p><strong>8–18 </strong>Vector processing offers improved performance due to several reasons.&nbsp;Some of them are listed below: </p><p>• <em>Flynn’s bottleneck </em>can be reduced by using vector instructions as each vector instruction specifies a lot of work. </p><p>• <em>Data hazards </em>can be eliminated due to the structured nature of the data used by vector machines. </p><p>• <em>Memory latency </em>can be reduced by using pipelined load and store operations. • <em>Control hazards </em>are reduced as a result of specifying a large number of iterations in a single vector instruction. </p><p>• <em>Pipelining </em>can be exploited to the maximum extent. This is facilitated by the absence of data and control hazards. Vector machines not only use pipelining for integer and floating-point </p><p><strong>10 </strong></p><p>Chapter 8 operations but also to feed data from one functional unit to another.&nbsp;This process is known as chaining. In addition, as mentioned before, load and store operations also use pipelining. </p><p><strong>8–19 </strong>The vector length register holds the valid vector length (VL). All vector operations are done on the first VL elements (i.e., elements in the range 0 to VL&nbsp;1). </p><p><strong>8–20 </strong>Larger vectors are handled by a technique known as <em>strip mining</em>. As&nbsp;an example, assume that the vector length supported by the machine is 64 and the vector to be processed consists of 200 elements. In&nbsp;strip mining, the vector is partitioned into strips of 64 elements.&nbsp;This leaves one odd-size piece that may be less than 64 elements. The size of this piece is given by (&nbsp;mod 64). We load each strip into a vector register and apply the vector operation.&nbsp;The number of strips is </p><ul style="display: flex;"><li style="flex:1">given by </li><li style="flex:1">. For our example, we divide the 200 elements into four pieces: three pieces </li></ul><p>with 64 elements and one odd piece with 8 elements. We use a loop that iterates four times: in one of the iterations, we set VL to 8 and the remaining three iterations will set the VL register to 64. </p>

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    12 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us