Branch Prediction + Hyper-Threading

Computer Architectures Branch Prediction + Hyper-Threading Richard Šusta, Pavel Píša Czech Technical University in Prague, Faculty of Electrical Engineering AE0B36APO Computer Architectures Ver.3.50 2019 1 Control Hazards • Jump and Branch are great performance losses. • Jump instruction needs only the jump target address • Branch instruction requires 2 operations: • Branch Result Taken or Not Taken • Branch Target Address • PC + 4 If Branch is NOT Taken • PC + 4 + 4 × immediate If Branch is Taken AE0B36APO Computer Architectures 2 Branch Not Taken Branch to Z A B C D Z cycle b cycle b+1 cycle b+2 cycle b+3 cycle b+4 Branch fetched Branch decoded Branch decision PC keeps D (br. not taken) A fetched A decoded A executed A continues B fetched B decoded B executed C fetched C decoded D fetched AE0B36APO Computer Architectures 3 Branch Hazard • Consider heuristic – branch Not taken. • Continue fetching instructions in sequence following the branch instructions. • If branch is taken (indicated by zero output of ALU): • Control generates branch signal in ID cycle. • branch activates PCSource signal in the MEM cycle to load PC with new branch address. • Instructions in the pipeline must be flushed if branch is taken – can this penalty be reduced? AE0B36APO Computer Architectures 4 Branch Taken Branch to Z A B C D Z cycle b cycle b+1 cycle b+2 cycle b+3 cycle b+4 Branch fetched Branch decoded Branch decision PC gets Z (br. taken) A fetched A decoded A executed Nop B fetched B decoded Nop C fetched Nop Three instructions are Z fetched flushed if branch is taken AE0B36APO Computer Architectures 5 Pipeline Flush • If branch is taken (as indicated by zero), then control does the following: • Change all control signals to 0, similar to the case of stall for data hazard, i.e., insert bubble in the pipeline. • Generate a signal IF.Flush that changes the instruction in the pipeline register IF/ID to 0 (nop). • Penalty of branch hazard is reduced by • Adding branch detection and address generation hardware in the decode cycle – one bubble needed – a next address generation logic in the decode stage writes PC+4, branch address, or jump address into PC. • Using branch prediction. AE0B36APO Computer Architectures 6 Control Hazards • The result of the comparison is only known in the 4th cycle. Why? AE0B36APO Computer Architectures 7 Control Hazard – rather know the result earlier... • If we can determine the result of the comparison already in the 2nd cycle, we can reduce misprediction penalty. • Moving forward can introduce new RAW hazards !!! AE0B36APO Computer Architectures 8 Solution of Hazards by Flush RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD PCSrcD = 0 PC´ PC InstrD 25:21 WE3 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 1 EN 10 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 10 1 Memory WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdD RdE 1 15:0 SignImmD + Sign 4 Ext SignImmE <<2 + CLR PCPlus4F PCPlus4D CLR EN PCBranchD ResultW Forward Forward RegWrite Stall F Stall D AE BE RegWriteM FlushE W Hazard unit MemToRegE AE0B36APO Computer Architectures 9 1-Cycle Jump Delay • If the control logic detects a Jump instruction in the 2nd Stage, then Next instruction is fetched anyway. • We flush only with one instruction. cc1 cc2 cc3 cc4 cc5 cc6 cc7 J L1 IF ID Next instruction IF Bubble Bubble Bubble Bubble . Jump L1: Target instruction Target IF Reg ALU DM Reg Addr AE0B36APO Computer Architectures 10 Solution of RAW hazards by forwarding Přeposlat / Pozastavit Netřeba Pozastavit RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD PCSrcD = 0 PC´ PC InstrD 25:21 WE3 0 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 1 EN 1 10 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 0 10 1 Memory 1 WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdD RdE 1 15:0 SignImmD + Sign 4 Ext SignImmE <<2 + CLR PCPlus4F PCPlus4D CLR EN PCBranchD ResultW Forward Forward Forward RegWrite Stall F Stall D BranchD AE BE RegWriteM BD FlushE W ForwardAD RegWriteE Hazard unit MemToRegE AE0B36APO Computer Architectures 11 Pipeline version RegWriteD RegWriteE RegWriteM RegWriteW Contr ol unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo 31:26 ALUControlD ALUControlE RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD = PCSrcD 0 PC´ PC InstrD 25:21 WE3 0 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 1 EN 1 10 A RD 20:16 00 SrcBE Instruction A2 RD2 01 0 ALU Data Memory A3 0 10 1 Memory Reg. WriteDataM WD3 1 WriteDataE W 1 25:21 File RsD RsE ALUOutW 20:16 RtD RtE D 0 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdD RdE 1 + 15:0 SignImmD 4 Sign SignImmE Ext <<2 + PCPlus4F PCPlus4D CLR EN PCBranchD ResultW Forward Forward Forward RegWrit Stall F Stall D BranchD AE BE RegWrite BD FlushE M e ForwardAD Hazard unit RegWriteE W MemToRegE AE0B36APO Computer Architectures 12 Single cycle version Return back to single cycle processor MemToReg Control MemWrite Unit Branch 31:26 Opcode ALUControl 2:0 ALUScr 5:0 RegDest Funct AluOutW RegWrite 25:21 WE3 SrcA Zero WE 0 PC’ PC A RD Instr A1 RD1 0 Result ALU 1 20:16 A RD 1 Instr. A2 RD2 AluOutM Data 0 SrcB ReadData Memory Memory A3 Reg. 1 WD3 WD 20:16 File WriteData Rt 0 WriteReg 15:11 Rd 1 + 4 15:0 <<2 Sign Ext SignImm PCBranch PCPlus4F PCPlus4D PCPlus4E + AE0B36APO Computer Architectures 13 What we have designed? Return back to single cycle processor Řídicí část MemToReg Paměti Control MemWrite (control path) Unit Branch 31:26 Opcode ALUControl 2:0 ALUScr 5:0 RegDest Funct AluOutW RegWrite 25:21 WE3 SrcA Zero WE A RD 0 PC’ PC Instr A1 RD1 A RD ALU 0 Result 1 20:16 A RD 1 Instr. A2 RD2 0 SrcB AluOutM ReadData Memory A3 Reg. 1 WD3 WDDatová část 20:16 File WriteData Rt 0 WriteReg (data path) WE 15:11 Rd 1 + A RD 4 15:0 <<2 Data Sign Ext SignImm PCBranch Memory PCPlus4F PCPlus4D PCPlus4E + WD AE0B36APO Computer Architectures 14 What we have designed? Processor Control unit PC Instruction PC A RD RD A Instr. Memory Enable write Datapath Address of WE RD/WR Read data Address A RD RD A Data Results Written data Memory WD WD AE0B36APO Computer Architectures 15 Done - designed pipeline processor RegWriteD RegWriteE RegWriteM RegWriteW Control unit MemToRegD MemToRegE MemToRegM MemWriteD MemWriteE MemWriteM MemTo ALUControlE 31:26 ALUControlD RegW Op ALUSrcD ALUSrcE 5:0 RegDstD RegDstE Funct BranchD EquaD PCSrcD = 0 PC´ PC InstrD 25:21 WE3 0 00 SrcAE WE A RD A1 RD1 01 ALUOutM ReadDataW 1 EN 1 10 20:16 A RD A2 RD2 00 SrcBE ALU Instruction 01 0 Data Memory A3 Reg. 0 10 1 Memory 1 WriteDataE WriteDataM WD3 File WD 1 RsD ALUOutW 25:21 RsE 0 20:16 RtD RtE 0 WriteRegE 4:0 WriteRegM 4:0 WriteRegW 4:0 15:11 RdD RdE 1 15:0 SignImmD + Sign 4 Ext SignImmE <<2 + CLR PCPlus4F PCPlus4D CLR EN PCBranchD ResultW Forward Forward Forward RegWrite Stall F Stall D BranchD AE BE RegWriteM BD FlushE W ForwardAD RegWriteE Hazard unit MemToRegE AE0B36APO Computer Architectures 16 Single cycle CPU – Throughput: IPS = IC / T = IPCstr.fCLK • What is the maximal possible frequency of the CPU? • It is given by latency on the critical path – it is lw instruction in our case: Tc = tPC + tMem + tRFread + tALU + tMem + tMux + tRFsetup 25:21 WE3 SrcA WE 0 PC’ PC Instr A1 RD1 Zero A RD ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4 B35APO Computer Architectures 17 Single cycle CPU – Throughput: IPS = IC / T = IPCstr.fCLK • Tc = Tcinstr + Tcproc = (tPC + tMem) + (tRFread + tALU + tMem + tMux + tRFsetup) • Consider following parameters: tPC = 30 ns tMem = 300 ns tRFread = 50 ns tALU = 200 ns tMux = 20 ns tRFsetup = 20 ns If Tc instr is executed paralel with Tc proc, then Tc i < Tc p, and Tc p = 50+200+300+20+20 = 590 ns = 1.69 MHz -> IPS = 1 690 000 [instructions per second] B35APO Computer Architectures 18 Pipeline Processor Performance If pipeline processor has Tc clock cycle P number of pipeline steps N number of instructions in a program Tprogram = ( P + (N-1) ) * Tc because the 1st instruction needs P cycles to fill the pipeline but each additional instruction only adds one extra clock. AE0B36APO Computer Architectures 19 Pipeline Processor Performance : IPS = IC / T The cycle time is determined by the slowest step In our case memory is weak step: Tmen = 300 ns --> Tcmin = 300 ns --> 3 333 kHz • If we don't consider stall and flush pipelines, then we can say that a program with many N instructions will execute one instruction per cycle.

Branch Prediction + Hyper-Threading

Branch Prediction Side Channel Attacks

BRANCH PREDICTORS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

18-741 Advanced Computer Architecture Lecture 1: Intro And

State of the Art Regarding Both Compiler Optimizations for Instruction Fetch, and the Fetch Architectures for Which We Try to Optimize Our Applications

Branch Prediction for Network Processors

Trends in Processor Architecture

Energy Efficient Branch Prediction

T13: Advanced Processors – Branch Prediction

Bias-Free Branch Predictor

An Evaluation of Multiple Branch Predictor and Trace Cache Advanced Fetch Unit Designs for Dynamically Scheduled Superscalar Processors Slade S

Branchscope: a New Side-Channel Attack on Directional Branch Predictor

Branch Prediction