Parallel Computing

Parallel Computing Kamlesh Tiwari 1 Introduction Parallel Comput- Register-Immediate ALU operation ing ALUoutput = A op Imm Branch Evaluation of the computer architecture have undergone ALUoutput = NPC + Imm following stages. Cond = (A op 0) MEM memory access Sequential machines Pipelined machines Memory reference MDR = Mem[ALUoutput] or, Vector machines Mem[ALUoutput] = B Parallel machines Branch if (Cond) PC = ALUoutput else PC = NPC 1.1 Pipelining WB write-back stage Instruction execution have following stages. ) [Instruction fetch] Reg-Reg ALU instruction ,! [instruction decode] Regs[IR16::20] = ALUoutput ,! [argument decode] Reg-Imm ALU operation ,! [argument fetch] Regs[IR11::15] = ALUoutput ,! [execution] ,! [result storage] Load instruction Regs[IR11::15] = MDR This can be made faster either by pushing technology, or by efficient use of hardware. Instruction flow: A pipeline is the continuous and somewhat overlapped Clock number movement of instruction to the processor or in the arith- Inst No. 1 2 3 4 5 6 7 8 9 metic steps taken by the processor to perform an instruc- Inst 1. IFIDEXMEM WB tion. It is an efficient hardware utilization to increase Inst 2. IF ID EX MEM WB throughput. Inst 3. IF ID EX MEM WB Inst 4. IF ID EX MEM WB 1.1.1 Example Inst 5. IF ID EX MEMWB Question: Assume a non-pipelined version of the proces- Consider a 5 stage pipeline of a RISC processor. sor with 10ns clock. Four cycles for non-memory (ALU and branch ops) Five cycles for memory ops. 40% instruc- 1 2 IF instruction fetch IR = mem[PC ]; tions are memory ops. pipelining the machine needs 1ns 3 NPC = PC + 4 extra on clock. What is speedup? ID instruction decode/Operand fetch A = Regs[IR6::10]; Answer: Average time for the execution of an instruction B = Regs[IR11::15]; on non pipelined processor. (0:4×5+0:6×4)×10ns = 44ns Immediate = [IR16::31]; Pipelined version will execute every instruction in one EX execution stage clock cycle. Since in pipelined version there is an over head of 1ns therefore time to execute one instruction will Memory reference be of 10ns + 1ns = 11ns. ALUoutput = A + Imm Hence the speedup is 44ns=11ns = 4 Register-Register ALU operation ALUoutput = A op B 1.1.2 Pipeline Hazards 1Instruction register 2program counter 3Next program counter There can be basically three type of hazards. 1 1.2 Advanced Techniques 1 INTRODUCTION PARALLEL COMPUTING 1. Structural hazards arises due to resource conflict. 1.2 Advanced Techniques When two different instructions in the pipeline want to use same hardware this kind of hazards arises, the Basic pipeline scheduling only solution is to introduce bubble/stall. To reduce stalls between instructions, instructions Question: Ideal CPI4 1.0 for machine without struc- should be fairly independent of each other. tural hazard. 40% instructions causing structural haz- Consider following loop ards, clock rate of machine with structural hazards for (i = 1; i <= 1000; i + +)X[i] = X[i] + S; 1.05 time faster. compare the performance. Which is compiles to Answer: Out of 100 instructions 60 will be normal X: LD F0, 0(R1) ;F0 = M(0+R1) taking 1 CPI, rest 40 instructions will take one extra ADD F4, F0, F2 ;F4 = F0 + F2 CPI due to structural hazards. Therefore for 100 in- SD 0(R1), F4 ;M(0+R1) = F4 structions total CPI consumed is 60×1+40×(1+1) = SUB R1, R1, 8 ;R1 = R1 + 8 140. On an average one instruction will take 1.4 CPI. BNEX R1, X Since the clock pulse is 1.05 times faster, therefore the time taken by a single instruction is 1:4=1:05 = 1:33 Due to stalls introduced the execution of the program CPI. Conclusion: if machine takes t sec to execute becomes as below an instruction in pipeline without structural hazards, X: LD F0, 0(R1) ;F0 = M(0+R1) then it will take 1:33t sec with structural hazards. Stall ADD F4, F0, F2 ;F4 = F0 + F2 2. Data hazards arises due to instruction dependence. Stall For example in following code Stall ADD R1, R2, R3 ;R1 R4+R3 SD 0(R1), F4 ;M(0+R1) = F4 SUB R4, R5, R1 ;R4 R5-R1 SUB R1, R1, 8 ;R1 = R1 + 8 until the value of R1 becomes available the execution BNEX R1, X of second instruction can not be carried out. Stall To minimize data hazards we can do internal forward- Which takes 9 cycles per iteration. ing, that is ALU output is fed back to the ALU input (if the hardware detects that the previous ALU op Compiler can reschedule loop as below. has modified the register corresponding to the source X: LD F0, 0(R1) ;F0 = M(0+R1) of current ALU op, ALU output is forwarded). Stall Data hazards are classified as below. (in instruction ADD F4, F0, F2 ;F4 = F0 + F2 i and j instruction i occurs before j) SUB R1, R1, 8 ;R1 = R1 + 8 BNEX R1, X RAW (Read After Write) j tries to read a SD 0(R1), F4 ;M(0+R1) = F4 source before i writes it. WAW (Write After Write) j writes an operand Which will require only 6 cycles per iteration and before i writes it. hence will save 33% of execution time. WAR (Write After Read) j writes an operand Loop unrolling before i reads it. Loop unrolling reduces number of branches therefore RAR (Read After Read) NOT a hazard!!! control stalls. It is a better pipeline scheduling which reduces the overhead of loop control code as well. Not all but some data hazards can be avoided. The general practice is to handle data hazards is to in- Continuing with the previous example of for loop, its troduce sufficient bubble by pipeline interlock (This is unrolling is shown below. done by a hardware). X: LD F0, 0(R1) Some time compiler handle the data hazards effi- ADD F4, F0, F2 ciently; for example see following codes (which uses SD 0(R1), F4 rescheduling of instructions). LD F6, -8(R1) With data hazard Without data hazard ADD F8, F6, F2 ADD R1, R2, R3 ADD R1, R2, R3 SD -8(R1), F4 SUB R4, R5, R1 SUB R6, R2, R3 LD F10, -16(R1) SUB R6, R2, R3 MUL R7, R5, R3 ADD F12, F10, F2 MUL R7, R5, R3 SUB R4, R5, R1 SD -16(R1), F4 LD F14, -24(R1) 3. Control hazards arises due to branches5. ADD F16, F14, F2 We can handle control hazards with compiler's as- SD -24(R1), F4 sistance by static branch prediction, or by reducing SUB R1, R1, 32 dependency stalls by instruction rescheduling. BNEX R1, X 4cycle per instruction It will require 27 cycles per four iteration (due to 5Every fifth instruction is typically a branch in compiled program hazards). o 2 of 18 κtiwari [at] cse.iitk.ac.in 1.4 Loop level parallelism 1 INTRODUCTION PARALLEL COMPUTING But if compiler reschedules the instructions as below 1.4 Loop level parallelism X: LD F0, 0(R1) Consider following instructions LD F6, -8(R1) LD F10, -16(R1) 1. for(i = 0; i <= 1000; i + +)A[i] = A[i] + s; LD F14, -24(R1) There is no dependence between two iterations. ADD F4, F0, F2 ADD F8, F6, F2 2. Consider following instructions ADD F12, F10, F2 for(i = 0; i <= 1000; i + +) ADD F16, F14, F2 f SD 0(R1), F4 A[i + 1] = A[i] + C[i]; //S1 SD -8(R1), F4 B[i + 1] = B[i] + A[i + 1]; //S2 SD -16(R1), F4 g SUB R1, R1, 32 BNEX R1, X S1 depends on S1 of previous iterations. S2 depends SD -24(R1), F4 on S2 on previous iterations and S1 of this iterations. S1 can't be parallised. It will require only 14 cycles per four iteration, and thereby saving 60% as compared to original loop. 3. Consider following instructions for(i = 0; i <= 1000; i + +) Dynamic scheduling f A[i] = A[i] + B[i]; //S1 Register renaming B[i + 1] = C[i] + D[i]; //S2 g Branch prediction S1 depends on S1 of previous iterations. Can it be parallised ? ........ Multiple issue of instructions 4. Consider following instructions Dependence analysis at compilation A[1] = A[1] + B[1]; for(i = 0; i <= 1000; i + +) Software pipeline f B[i + 1] = C[i] + D[i]; //S1 Memory pipeline A[i + 1] = B[i + 1] + A[i + 1]; //S2 g B[101] = C[100] + D[100]; 1.3 Data Dependence ?................................. There can be following types of data dependence. 1.5 Dynamic scheduling 1. True dependence (or True Data Dependence) Pipeline forces to in-order instruction execution, which See following example sometimes introduces stall due to dependency between SUB R1, R1, 8 ;R1 = R1 - 8 two closely spaced instructions. This in-order instruction BNEZ R1, X execution is always not necessary. We can overcome data hazards by dynamically scheduling instructions. 2. Name dependencies Consider following example. Arises due to reuse of register/memory. DIV F0, F2, F4 ; F0 = F2/F4 3. Anti-dependence ADD F10, F0, F8 ; F10 = F0 + F8 Instruction i executes first and reads a register which SUB F12, F8, F12 instruction j writes later. WAR hazards. Due to dependence between DIV and ADD instruction we can not execute SUB instruction. ADD instruction have 4. Output dependence to wail until DIV is complete, as a consequence SUB also Instruction i executes first and writes a register which have to wait. This limitation can be removed if in-order instruction j writes later. There must be serialization. execution can be relaxed as below.

Load more