Chapter 8 Pipelining and Vector Processing 8–1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8–2 Pipeline stalls can be caused by three types of hazards: resource, data, and control hazards. Re- source hazards result when two or more instructions in the pipeline want to use the same resource. Such resource conflicts can result in serialized execution, reducing the scope for overlapped exe- cution. Data hazards are caused by data dependencies among the instructions in the pipeline. As a simple example, suppose that the result produced by instruction I1 is needed as an input to instruction I2. We have to stall the pipeline until I1 has written the result so that I2 reads the correct input. If the pipeline is not designed properly, data hazards can produce wrong results by using incorrect operands. Therefore, we have to worry about the correctness first. Control hazards are caused by control dependencies. As an example, consider the flow control altered by a branch instruction. If the branch is not taken, we can proceed with the instructions in the pipeline. But, if the branch is taken, we have to throw away all the instructions that are in the pipeline and fill the pipeline with instructions at the branch target. 8–3 Prefetching is a technique used to handle resource conflicts. Pipelining typically uses just-in- time mechanism so that only a simple buffer is needed between the stages. We can minimize the performance impact if we relax this constraint by allowing a queue instead of a single buffer. The instruction fetch unit can prefetch instructions and place them in the instruction queue. The decoding unit will have ample instructions even if the instruction fetch is occasionally delayed because of a cache miss or resource conflict. 8–4 Data hazards are caused by data dependencies among the instructions in the pipeline. As a simple example, suppose that the result produced by instruction I1 is needed as an input to instruction I2. We have to stall the pipeline until I1 has written the result so that I2 reads the correct input. 1 2 Chapter 8 There are two techniques used to handle data dependencies: register interlocking and register forwarding. Register forwarding works if the two instructions involved in the dependency are in the pipeline. The basic idea is to provide the output result as soon as it is available in the datapath. This technique is demonstrated in the following figure. For example, if we provide the output of I1 to I2 as we write into the destination register of I1, we will reduce the number of stall cycles by one (see Figure a). We can do even better if we feed the output from the IE stage as shown in Figure b. In this case, we completely eliminate the pipeline stalls. Clock cycle12357 4 6 8 9 I1 IF ID OF IE WB I2 IF ID OF IE WB I3 IFID OF IE WB I4 IF ID OF IE WB (a) Forward scheme 1 Clock cycle 1 2 3468 5 7 I1 IF ID OF IE WB I2 IF ID OF IE WB I3 IF ID OF IE WB I4 IF ID OF IE WB (b) Forward scheme 2 Register interlocking is a general technique to solve the correctness problem associated with data dependencies. In this method, a bit is associated with each register to specify whether the contents are correct. If the bit is 0, the contents of the register can be used. Instructions should not read contents of a register when this interlocking bit is 1, as the register is locked by another instruction. The following figure shows how the register interlocking works for the example given below: I1: add R2,R3,R4 /* R2 = R3 + R4 */ I2: sub R5,R6,R2 /* R5 = R6 R2 */ I1 locks the R2 register for clock cycles 3 to 5 so that I2 cannot proceed reading an incorrect R2 value. Clearly, register forwarding is more efficient than the interlocking method. Chapter 8 3 R2 is locked Clock cycle 123574689 10 I1 IF ID OF IE WB I2 IF ID OF IE WB I3 IF ID OF IE WB I4 IF ID OF IE WB 8–5 Flow altering instructions such as branch require special handling in pipelined processors. In the following figure, Figure a shows the impact of a branch instruction on our pipeline. Here we are assuming that instruction Ib is a branch instruction; if the branch is taken, it transfers control to instruction It. If the branch is not taken, the instructions in the pipeline are useful. However, for a taken branch, we have to discard all the instructions that are in the pipeline at various stages. In our example, we have to discard instructions I2, I3, and I4. We start fetching instructions at the target address. This causes our pipeline to do wasteful work for three clock cycles. This is called the branch penalty. Clock cycle12357 4 6 8 9 Branch instruction Ib IF ID OF IE WB I2 IF ID OF IE WB Discarded instructions I3 IF ID OF IE WB I4 IF ID OF IE WB Branch target instruction It IF ID OF IE WB (a) Branch decision is known during the IE stage Clock cycle12357 4 6 8 9 Branch instruction Ib IF ID OF IE WB Discarded instruction I2 IF ID OF IE WB Branch target instruction It IF ID OF IE WB (b) Branch decision is known during the ID stage 8–6 Several techniques can be used to reduce branch penalty. • If we don’t do anything clever, we wait until the execution (IE) stage before initiating the instruction fetch at the branch target address. We can reduce the delay if we can determine this earlier. For example, if we find whether the branch is taken along with the target address 4 Chapter 8 information during the decode (ID) stage, we would just pay a penalty of one cycle, as shown in the following figure. Clock cycle12357 4 6 8 9 Branch instruction Ib IF ID OF IE WB I2 IF ID OF IE WB Discarded instructions I3 IF ID OF IE WB I4 IF ID OF IE WB Branch target instruction It IF ID OF IE WB (a) Branch decision is known during the IE stage Clock cycle12357 4 6 8 9 Branch instruction Ib IF ID OF IE WB Discarded instruction I2 IF ID OF IE WB Branch target instruction It IF ID OF IE WB (b) Branch decision is known during the ID stage • Delayed branch execution effectively reduces the branch penalty further. The idea is based on the observation that we always fetch the instruction following the branch before we know whether the branch is taken. Why not execute this instruction instead of throwing it away? This implies that we have to place a useful instruction in this instruction slot. This instruction slot is called the delay slot. In other words, the branching is delayed until after the instruc- tion in the delay slot is executed. Some processors like the SPARC and MIPS use delayed execution for both branching and procedure calls. When we apply this technique, we need to modify our program to put a useful instruction in the delay slot. We illustrate this by using an example. Consider the following code segment: add R2,R3,R4 branch target sub R5,R6,R7 ... target: mult R8,R9,R10 ... If the branch is delayed, we can reorder the instructions so that the branch instruction is moved ahead by one instruction, as shown below: branch target add R2,R3,R4 /* Branch delay slot */ Chapter 8 5 sub R5,R6,R7 ... target: mult R8,R9,R10 ... Programmers do not have to worry about moving instructions into the delay slots. This job is done by the compilers and assemblers. When no useful instruction can be moved into the delay slot, a no operation (NOP) is placed. • Branch prediction is traditionally used to handle the branch penalty problem. We discussed three branch prediction strategies: fixed, static, and dynamic. In the fixed strategy, as the name implies, prediction is fixed. These strategies are simple to implement and assume that the branch is either never taken or always taken. The static strategy uses instruction opcode to predict whether the branch is taken. For exam- ple, if the instruction is unconditional branch, we use branch “always-taken” decision. Dynamic strategy looks at the run-time history to make more accurate predictions. The basic idea is to take the past Ò branch executions of the branch type in question and use this information to predict the next one. 8–7 Delayed branch execution effectively reduces the branch penalty. The idea is based on the observa- tion that we always fetch the instruction following the branch before we know whether the branch is taken. Why not execute this instruction instead of throwing it away? This implies that we have to place a useful instruction in this instruction slot. This instruction slot is called the delay slot.In other words, the branching is delayed until after the instruction in the delay slot is executed. 8–8 In delayed branch execution, when the branch is not taken, sometimes we do not want to execute the delay slot instruction. That is, we want to nullify the delay slot instruction. Some processors like the SPARC provide this nullification option. 8–9 Branch prediction is traditionally used to handle the branch problem. We discussed three branch prediction strategies: fixed, static, and dynamic. 1.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages12 Page
-
File Size-