
Computer Architectures Branch Prediction + Hyper-Threading Richard Šusta, Pavel Píša Czech Technical University in Prague, Faculty of Electrical Engineering AE0B36APO Computer Architectures Ver.3.50 2019 1 Control Hazards • Jump and Branch are great performance losses. • Jump instruction needs only the jump target address • Branch instruction requires 2 operations: • Branch Result Taken or Not Taken • Branch Target Address • PC + 4 If Branch is NOT Taken • PC + 4 + 4 × immediate If Branch is Taken AE0B36APO Computer Architectures 2 Branch Not Taken Branch to Z A B C D Z cycle b cycle b+1 cycle b+2 cycle b+3 cycle b+4 Branch fetched Branch decoded Branch decision PC keeps D (br. not taken) A fetched A decoded A executed A continues B fetched B decoded B executed C fetched C decoded D fetched AE0B36APO Computer Architectures 3 Branch Hazard • Consider heuristic – branch Not taken. • Continue fetching instructions in sequence following the branch instructions. • If branch is taken (indicated by zero output of ALU): • Control generates branch signal in ID cycle. • branch activates PCSource signal in the MEM cycle to load PC with new branch address. • Instructions in the pipeline must be flushed if branch is taken – can this penalty be reduced? AE0B36APO Computer Architectures 4 Branch Taken Branch to Z A B C D Z cycle b cycle b+1 cycle b+2 cycle b+3 cycle b+4 Branch fetched Branch decoded Branch decision PC gets Z (br. taken) A fetched A decoded A executed Nop B fetched B decoded Nop C fetched Nop Three instructions are Z fetched flushed if branch is taken AE0B36APO Computer Architectures 5 Pipeline Flush • If branch is taken (as indicated by zero), then control does the following: • Change all control signals to 0, similar to the case of stall for data hazard, i.e., insert bubble in the pipeline. • Generate a signal IF.Flush that changes the instruction in the pipeline register IF/ID to 0 (nop). • Penalty of branch hazard is reduced by • Adding branch detection and address generation hardware in the decode cycle – one bubble needed – a next address generation logic in the decode stage writes PC+4, branch address, or jump address into PC. • Using branch prediction. AE0B36APO Computer Architectures 6 Control Hazards • The result of the comparison is only known in the 4th cycle. Why? AE0B36APO Computer Architectures 7 Control Hazard – rather know the result earlier... • If we can determine the result of the comparison already in the 2nd cycle, we can reduce misprediction penalty. • Moving forward can introduce new RAW hazards !!! AE0B36APO Computer Architectures 8 Solution of Hazards by Flush RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD PCSrcD = 0 PC´ PC InstrD 25:21 WE3 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 1 EN 10 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 10 1 Memory WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdD RdE 1 15:0 SignImmD + Sign 4 Ext SignImmE <<2 + CLR PCPlus4F PCPlus4D CLR EN PCBranchD ResultW Forward Forward RegWrite Stall F Stall D AE BE RegWriteM FlushE W Hazard unit MemToRegE AE0B36APO Computer Architectures 9 1-Cycle Jump Delay • If the control logic detects a Jump instruction in the 2nd Stage, then Next instruction is fetched anyway. • We flush only with one instruction. cc1 cc2 cc3 cc4 cc5 cc6 cc7 J L1 IF ID Next instruction IF Bubble Bubble Bubble Bubble . Jump L1: Target instruction Target IF Reg ALU DM Reg Addr AE0B36APO Computer Architectures 10 Solution of RAW hazards by forwarding Přeposlat / Pozastavit Netřeba Pozastavit RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD PCSrcD = 0 PC´ PC InstrD 25:21 WE3 0 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 1 EN 1 10 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 0 10 1 Memory 1 WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdD RdE 1 15:0 SignImmD + Sign 4 Ext SignImmE <<2 + CLR PCPlus4F PCPlus4D CLR EN PCBranchD ResultW Forward Forward Forward RegWrite Stall F Stall D BranchD AE BE RegWriteM BD FlushE W ForwardAD RegWriteE Hazard unit MemToRegE AE0B36APO Computer Architectures 11 Pipeline version RegWriteD RegWriteE RegWriteM RegWriteW Contr ol unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo 31:26 ALUControlD ALUControlE RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD = PCSrcD 0 PC´ PC InstrD 25:21 WE3 0 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 1 EN 1 10 A RD 20:16 00 SrcBE Instruction A2 RD2 01 0 ALU Data Memory A3 0 10 1 Memory Reg. WriteDataM WD3 1 WriteDataE W 1 25:21 File RsD RsE ALUOutW 20:16 RtD RtE D 0 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdD RdE 1 + 15:0 SignImmD 4 Sign SignImmE Ext <<2 + PCPlus4F PCPlus4D CLR EN PCBranchD ResultW Forward Forward Forward RegWrit Stall F Stall D BranchD AE BE RegWrite BD FlushE M e ForwardAD Hazard unit RegWriteE W MemToRegE AE0B36APO Computer Architectures 12 Single cycle version Return back to single cycle processor MemToReg Control MemWrite Unit Branch 31:26 Opcode ALUControl 2:0 ALUScr 5:0 RegDest Funct AluOutW RegWrite 25:21 WE3 SrcA Zero WE 0 PC’ PC A RD Instr A1 RD1 0 Result ALU 1 20:16 A RD 1 Instr. A2 RD2 AluOutM Data 0 SrcB ReadData Memory Memory A3 Reg. 1 WD3 WD 20:16 File WriteData Rt 0 WriteReg 15:11 Rd 1 + 4 15:0 <<2 Sign Ext SignImm PCBranch PCPlus4F PCPlus4D PCPlus4E + AE0B36APO Computer Architectures 13 What we have designed? Return back to single cycle processor Řídicí část MemToReg Paměti Control MemWrite (control path) Unit Branch 31:26 Opcode ALUControl 2:0 ALUScr 5:0 RegDest Funct AluOutW RegWrite 25:21 WE3 SrcA Zero WE A RD 0 PC’ PC Instr A1 RD1 A RD ALU 0 Result 1 20:16 A RD 1 Instr. A2 RD2 0 SrcB AluOutM ReadData Memory A3 Reg. 1 WD3 WDDatová část 20:16 File WriteData Rt 0 WriteReg (data path) WE 15:11 Rd 1 + A RD 4 15:0 <<2 Data Sign Ext SignImm PCBranch Memory PCPlus4F PCPlus4D PCPlus4E + WD AE0B36APO Computer Architectures 14 What we have designed? Processor Control unit PC Instruction PC A RD RD A Instr. Memory Enable write Datapath Address of WE RD/WR Read data Address A RD RD A Data Results Written data Memory WD WD AE0B36APO Computer Architectures 15 Done - designed pipeline processor RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD PCSrcD = 0 PC´ PC InstrD 25:21 WE3 0 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 1 EN 1 10 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 0 10 1 Memory 1 WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdD RdE 1 15:0 SignImmD + Sign 4 Ext SignImmE <<2 + CLR PCPlus4F PCPlus4D CLR EN PCBranchD ResultW Forward Forward Forward RegWrite Stall F Stall D BranchD AE BE RegWriteM BD FlushE W ForwardAD RegWriteE Hazard unit MemToRegE AE0B36APO Computer Architectures 16 Single cycle CPU – Throughput: IPS = IC / T = IPCstr.fCLK • What is the maximal possible frequency of the CPU? • It is given by latency on the critical path – it is lw instruction in our case: Tc = tPC + tMem + tRFread + tALU + tMem + tMux + tRFsetup 25:21 WE3 SrcA WE 0 PC’ PC Instr A1 RD1 Zero A RD ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4 B35APO Computer Architectures 17 Single cycle CPU – Throughput: IPS = IC / T = IPCstr.fCLK • Tc = Tcinstr + Tcproc = (tPC + tMem) + (tRFread + tALU + tMem + tMux + tRFsetup) • Consider following parameters: tPC = 30 ns tMem = 300 ns tRFread = 50 ns tALU = 200 ns tMux = 20 ns tRFsetup = 20 ns If Tc instr is executed paralel with Tc proc, then Tc i < Tc p, and Tc p = 50+200+300+20+20 = 590 ns = 1.69 MHz -> IPS = 1 690 000 [instructions per second] B35APO Computer Architectures 18 Pipeline Processor Performance If pipeline processor has Tc clock cycle P number of pipeline steps N number of instructions in a program Tprogram = ( P + (N-1) ) * Tc because the 1st instruction needs P cycles to fill the pipeline but each additional instruction only adds one extra clock. AE0B36APO Computer Architectures 19 Pipeline Processor Performance : IPS = IC / T The cycle time is determined by the slowest step In our case memory is weak step: Tmen = 300 ns --> Tcmin = 300 ns --> 3 333 kHz • If we don't consider stall and flush pipelines, then we can say that a program with many N instructions will execute one instruction per cycle.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages51 Page
-
File Size-