TheThe PentiumPentium®® II/IIIII/III ProcessorProcessor ““CompilerCompiler onon aa ChipChip””

Ronny Ronen Senior Principal Engineer Director of Architecture Research Labs - Haifa

Intel Corporation

Tel Aviv University January 18, 2005 ® 1 G-Number AgendaAgenda

z Goal,Goal, ExpectationsExpectations…… z GeneralGeneral InformationInformation z µµarchitecurearchitecure basicsbasics ® z PentiumPentium ProPro ProcessorProcessor µµarchitecurearchitecure z SWSW aspectsaspects

® 2 G-Number TechnologyTechnology ProfileProfile Pro - 1995 Pentium-II - 1998 Pentium-III - 1999 z Core @200MHz z Core @333MHz z Core @600MHz z 256K L2 on package, z 512KB L2 in SEC z 512KB L2 @200MHz @167MHz @ ???MHz z Performance: z Performance: z Performance:  8.09 SPECint95  12.8 SPECint95  24.0 SPECint95  6.70 SPECfp95  9.14 SPECfp95  15.9 SPECfp95 z 0.35 µm BiCMOS (P55C: 7.12/5.21) z 5.5M z 0.25 µm CMOS z 0.25 µm CMOS process process z 195 sq mm (14x14) z 7.5M transistors z ???M transistors z 3.3V, 11.2A z 28.1W / 35.0W

® 3 G-Number TechnologyTechnology ProfileProfile (cont.)(cont.) z Pentium-III – 2000 (Coppermine) z Pentium-III - 2002 (Tualatin) z Core @1000MHz z Core @1400MHz z 256KB L2 on chip @ 1000MHz z 512KB L2 on chip @ 1400MHz z Performance: z Performance (estimated):  >46 SPECint95  >60 SPECint95  >20 SPECfp95  >30 SPECfp95 z 0.18 µm CMOS process z 0.13 µm CMOS process z ~20M transistors z ~44M transistors z Processor 2003 (Banias) z Pentium M Processor 2004 (Dothan) z Core @1800MHz z Core @2000MHz z 1024KB L2 on chip @ 1800MHz z 2048KB L2 on chip @ 2000MHz z Performance (estimated): z Performance (estimated):  >80 SPECint95  >90 SPECint95 (est.)  >50 SPECfp95  >60 SPECfp95 (est.) z 0.13 µm CMOS process z 0.09 µm CMOS process z ®~77M transistors z ~127M transistors 4 G-Number TerminologyTerminology z IntelIntel ArchitectureArchitecture z Pipeline,Pipeline, SuperSuper ScalarScalar z BranchBranch PredictionPrediction z SpeculativeSpeculative ExecutionExecution z DynamicDynamic SchedulingScheduling z DataData dependencydependency z RegisterRegister RenamingRenaming z OutOut OfOf OrderOrder z ReRe--orderorder BufferBuffer && MemoryMemory OrderOrder BufferBuffer z ReservationReservation StationsStations z MicroMicro--OperationsOperations

Skip to µarch ® 5 G-Number IntelIntel ArchitectureArchitecture ()(X86) z 88 registersregisters onlyonly – Can be partially accessed ⇒Many memory accesses, short life time z OneOne setset ofof conditioncondition codescodes – Modified by most ALU operations – Various operations affect various flags ⇒Short Generate/Use distance z ExplicitExplicit stackstack – Push/Pop/Call/Return operations z VariableVariable lengthlength instructionsinstructions – Implicit operands ® 6 G-Number PipelinePipeline

z BreakBreak thethe workwork toto smallersmaller piecespieces 0 1 2 3 4 5 6 7 8 9 10 11 12 I1 1/4 IPC = 4 CPI I2 I3

F D E W F D E W 1 IPC = 1 CPI F: Fetch F D E W D: Decode F D E W F D E W E: Execute F D E W W: Write Back z IncreasedIncreased throughputthroughput – increased # of completed instructions per cycle – Number of stages varies – Small: 4-5 (Pentium), “Superpipeline” ~14 () ® 7 G-Number PipelinePipeline StallsStalls z ButBut therethere areare ““stallsstalls”” inin thethe pipelinepipeline – Data flow dependency (instructions output/input) – Solved by: bypasses, renaming – Control flow dependencies – Solved by branch prediction – Other (Cache misses, long latency instructions)

0 1 2 3 4 5 6 7 8 9 10 11 12 F D E W F D stall E W F: Fetch F D stall E W D: Decode stall stall stall F D E W E: Execute F stall stall D E W W: Write Back Data Flow stall Control Flow stall Address Generation Interlock ® 8 G-Number SuperScalarSuperScalar z PerformsPerforms moremore inin aa singlesingle cyclecycle 0 1 2 3 4 5 6 7 8 9 10 11 12 F D E W F D E W F D E W F D E W F D E W F D E W F D E W 2 IPC = 1/2 CPI F D E W F D E W F D E W F D E W F D Stall E W z Ideally,Ideally, cancan multiplymultiply thethe throughputthroughput – but stall penalty is increased

® 9 G-Number SuperSuper PipelinePipeline

z Split to shorter stages - allows higher frequency Old clk = 0 1 2 3 4 5 6 7 8 9 10 11 12 New clk = 0123456789101112

F1 F2 D1 D2 D3 E1 W1 W2 F1 F2 D1 D2 D3 E1 W1 W2 1 IPC = 1 CPI F1 F2 D1 D2 D3 E1 W1 W2 33% higher freq! F1 F2 D1 D2 D3 E1 W1 W2 F: Fetch F1 F2 D1 D2 D3 E1 W1 W2 D: Decode F1 F2 D1 D2 D3 E1 W1 W2 E: Execute F1 F2 D1 D2 D3 E1 W1 W2 W: Write Back F1 F2 D1 D2 D3 E1 W1 W2

z Ideally, can (again) multiply the throughput, but – Stall penalties do not scale (e.g., control flow stall, cache misses) – Clock setup/hold reduces the amount of net cycle time more - each instruction takes longer! Ö In the example above: 2X stages, but performance gain is <<33%

® 10 G-Number OutOut OfOf OrderOrder ExecutionExecution z So far - instructions were processed in their program order. – Parallelism is limited. z OOO: Instructions are executed based on “data flow”

rather than program order 12 1D 1E 1E 1W 2 2 2 2 2 2 Before: src -> dest F D E E E W 3F 3D 3E 3E 3E 3E 3W 4 4 4 4 4 4 4 4 (1) load (r10), r21 F D E E E E E W 5 5 5 5 5 5 5 5 5 (2) mov r21, r31 (2 depends on 1) F D E E E E E E W 1 2 3 4 5 6 7 8 9 (3) load a, r11 In Order Processing (4) mov r11, r22 (4 depends on 3)

(5) mov r22, r23 (5 depends on 4) 12 1D 1E 1E 1W

2F 2D 2E 2E 2E 2W 3 3 3 3 3 3 After: F D E E w W After: 4F 4D 4E 4E 4E 4W 5 5 5 5 5 5 5 (1) load (r10), r21; (3) load a, r11; F D E E E E W 1 2 3 4 5 6 7 8 9 Out of Order Processing (2) mov r21,r31; (4) mov r11,r22; (5) mov r22,r23; In Order vs OOO execution. (5) mov r22,r23; Assuming: - Unlimited resources z Usually highly superscalar - 2 cycles load latency ® 11 G-Number CacheCache -- MotivationMotivation && PrinciplePrinciple Added z Memory consumption is growing about 2X every 2 years – Typical size: (Y2000) 64M-128M, (Y2002) 128M-256M z CPU speed grows faster than memory and buses – CPU/ grew from 1:1 to 6:1, and still growing 486 Pentium P-II P-III P4 25-66MHz 66-233MHz 200-450MH 0.5-1.33GHz 1.4-2.4GHz 33MHz 66MHZ 66-100MHz 133-200MHz 400MHz – Memory: DRAM: 60-100ns (“10-16MHz”), Cost: <10$ per 1M SRAM is faster but much more expensive Bandwidth: SDRAM 100-133MHz*8B; DDR 200-400MHz*8B; RDR 800MHz*2B ÖMemory becomes the bottleneck for both instructions and data! Slow or expensive z Solution: Cache - A Small, Fast, Close memory – Serves as a buffer between CPU and main memory z Contains copy of a portion of the main memory CPU Cache Memory – Small in size – Dynamically changed Small, fast Large, Slow z Exploit space and time locality: Close, Expensive Far, Cheap – Code is fetched sequentially (Space) – Code is re-executed (loops, procedures) (Time) – Access close or previous data (Space, Time)

® 12 G-Number BranchBranch PredictionPrediction z GoalGoal -- ensureensure enoughenough instructioninstruction supplysupply byby correctcorrect prefetchingprefetching z upup toto 486486 -- prefetcherprefetcher assumedassumed fallfall--throughthrough – Lose on unconditional branch (e.g., call) – Lose on frequently taken branches (e.g., loops) z BranchBranch predictionprediction – Predict branch taken/not taken – Predict the branch target address ?

® 13 G-Number BranchBranch PredictionPrediction (cont.)(cont.) z ImplementationImplementation – Use history (private or global) to predict direction (simple Lee&Smith, advanced Yeh&Patt) – Target address taken from table (faster) or from the instruction (slower) – Table updated first based on prediction, later based on actual execution. z MispredictionMisprediction costcost variesvaries (high(high onon PPro)PPro) z CurrentCurrent predictionprediction rate:rate: ~92%~92% -- 95%95% ~60~60--100100 instructionsinstructions betweenbetween mispredictionsmispredictions**

* Assuming branch every 5 instructions

® 14 G-Number SpeculativeSpeculative ExecutionExecution

z ExecutionExecution ofof instructionsinstructions fromfrom aa predictedpredicted (yet(yet unsure)unsure) path.path. Eventually,Eventually, pathpath maymay turnturn wrong.wrong. z Advantages:Advantages: – Ensure instruction supply – Allow scheduling window z Issues:Issues: – Misprediction cost – Misprediction recovery

® 15 G-Number DynamicDynamic SchedulingScheduling

z SchedulingScheduling instructionsinstructions atat runrun time,time, byby thethe HW,HW, andand notnot atat compilecompile time,time, byby thethe SWSW z Advantages:Advantages: – Works on the dynamic instruction flow: Can schedule across procedures, modules... – Can see dynamic values (memory addresses) – Can accommodate varying latencies z DisadvantagesDisadvantages – Can schedule within a limited window only – Should be fast - cannot be too smart

® 16 G-Number DataData DependencyDependency

(1) load R2,(R1) [format: op dest, src] (2) mov R3,R2 (3) mov R1,a (4) mov R2,R3 (5) mov R2,R1 z TrueTrue dependencydependency (R2:1>2, R3:2>4, R1:3>5) z FalseFalse dependenciesdependencies – Anti dependency (R1:1>3, R2: 2>4) – Output dependency (R2:1>4,R2:4>5)

® 17 G-Number RegisterRegister RenamingRenaming before after mapping (1) load R2,(R1) load r21,(r10) [R2->r21] (2) mov R3,R2 mov r31,r21 [R3->r31] (3) mov R1,a mov r11,a [R1->r11] (4) mov R2,R3 mov r22,r31 [R2->r22] (5) add R2,R1 add r23,r11,r22 [R2->r23] z RemoveRemove falsefalse dependenciesdependencies z RemoveRemove architecturearchitecture limitlimit forfor ## ofof regsregs z HelpHelp speculativespeculative executionexecution – Renamed register are kept until speculation is verified to be correct

® 18 G-Number OutOut OfOf OrderOrder ExecutionExecution z ExecuteExecute instructionsinstructions basedbased onon ““datadata flowflow”” ratherrather thanthan programprogram orderorder Before: (1) load r21,(r10) (2) mov r31,r21 (2 depends on 1) (3) mov r11,a (4) mov r22,r31 (4 depends on 2) (5) mov r23,r11 (5 depends on 3) After: (1) loadr21,(r10); (3) mov r11,a; (2) mov r31,r21; (5) mov r23,r11; (4) mov r22,r31;

® 19 G-Number OutOut OfOf OrderOrder (cont.)(cont.) z AdvantagesAdvantages – Help exploit Instruction Level Parallelism (ILP) – Help cover latencies (e.g., cache miss, divide) – Superior/complementary to compiler scheduler – Dynamic instruction window – Can use more than 8 registers z ComplexComplex microarchitecturemicroarchitecture – Complex scheduler – Requires reordering mechanism (retirement) in the back-end for: – Precise interrupt resolution – Misprediction/speculation recovery – Memory ordering ® 20 G-Number ReRe--orderorder BufferBuffer (ROB)(ROB) z MechanismMechanism forfor renamingrenaming andand retirementretirement z TableTable containscontains inin--orderorder instructionsinstructions – Instructions are entered in order – Registers renamed by the entry# – Once assigned: execution order unimportant – After execution: entries marked “executed” – An executed entry can be “retired” once all prior instruction have retired. That is: – Update “real registers” with value of renamed regs – Update memory – Leave the ROB

® 21 G-Number ReservationReservation Station(s)Station(s) z Pool(s)Pool(s) ofof allall ““notnot yetyet executedexecuted”” instructionsinstructions z MaintainsMaintains operandsoperands statusstatus ““readyready//notnot--readyready”” z EachEach cycle,cycle, executedexecuted instructionsinstructions makemake moremore operandsoperands ““readyready”” z InstructionsInstructions whosewhose allall operandsoperands areare ““readyready”” cancan bebe ““dispatcheddispatched”” forfor executionexecution z DispatcherDispatcher chooseschooses whichwhich ofof thethe ““readyready”” instructionsinstructions willwill bebe executedexecuted next.next.

® 22 G-Number MemoryMemory OrderOrder BufferBuffer (MOB)(MOB) z Idea - allow out of order among memory operations z Problem- Memory dependencies cannot fully resolved statically (memory disambiguation) – store r1,a; load r2,b => can advance load before store – store r1,[r3]; load r2,b => load should wait till r3 is known z Structure similar in concept to ROB z Every access is allocated an entry. Address & data (for stores), are updated when known z Load is checked against all previous stores: – Waits if store to same address exist, but data not ready – If store data exists, just use it – Waits if store to unknown address exists – No address collision - go to memory ® 23 G-Number DynamicDynamic ExecutionExecution z CombinationCombination ofof threethree techniques:techniques: z MultipleMultiple BranchBranch predictionprediction z OutOut OfOf OrderOrder execution:execution: DataflowDataflow analysisanalysis z SpeculativeSpeculative ExecutionExecution

HowHow doesdoes thisthis machinemachine reallyreally work?work?

® 24 G-Number OOOOOO demodemo z TheThe "Pentium"Pentium®® ProPro ProcessorProcessor MicroarchitectureMicroarchitecture Overview"Overview" tutorialtutorial – http://developer.intel.com/vtune/cbts/cbts.htm – http://developer.intel.com/vtune/cbts/pproarch/clikngo.htm

® 25 G-Number MicroMicro OperationsOperations (Uops)(Uops)

z Each “CISC” inst is broken into one or more uops – Simplicity: – Each uop is (relatively) simple – Canonical representation of src/dest (3 src, 2 dest) – Increased ILP e.g., pop eax becomes esp1<-esp0+4, eax1<-[esp0] z Typical uop count (it is not necessarily cycle count!) Reg-Reg ALU/Mov inst: 1 uop Mem-Reg Mov (load) 1 uop Mem-Reg ALU (load + op) 2 uops Reg-Mem Mov (store) 2 uops (st addr, st data) Reg-Mem ALU (ld + op + st) 4 uops Varies z MainlyMainly anan X86X86 artifactartifact

® 26 G-Number CPUCPU MicroarchitectureMicroarchitecture

External Bus L2 z In-Order Front End – BIU: Bus Interface Unit – IFU: Inst. Fetch Unit (includes IC) MOB – BTB: Branch Target Buffer BIU – ID: Instruction Decoder DCU – MIS: Micro-Instruction Sequencer – RAT: Register Alias Table MIU z Out-of-order Core IFU – ROB: Reorder Buffer AGU – RRF: Real R – RS: Reservation Stations BTB S IEU – IEU: Integer – FEU: Floating-point Execution Unit I FEU – AGU: Address Generation Unit MIS – MIU: Memory Interface Unit D MIS – DCU: Data Cache Unit ROB – MOB: Memory Order Buffer RAT – L2: Level 2 cache z In-Order Retire ® 27 G-Number MicroarchitectureMicroarchitecture

External Bus L2 L2 z In-Order Front End

MOB BIU z BTB: predicts the address of the DCU next instruction to be fetched z IFU: fetches bytes from the MIU MIU instruction cache (or L2, or IFU memory) AGU z ID: Decodes instructions and R R converts them to uops (up to 3 BTB IEU BTB S IEU uops/cycle). I FEU z MIS: Produces uops for complex MIS instructions. D MIS z RAT: Register Alias Table ROB RAT

® 28 G-Number BranchBranch PredictionPrediction z Implementation – Use local history to predict direction – Need to predict multiple branches ⇒ Need to predict branches before previous branches are resolved ⇒ Branch history updated first based on prediction, later based on actual execution (speculative history). – Target address taken from BTB z Prediction rate: ~92% – ~60 instructions between mispredictions (assuming 1 branch per 5 inst. on average) – High prediction rate is very crucial for long pipelines – Especially important for OOOE, : – On misprediction all instructions following the branch in the instruction window are flushed – Effective size of the window is determined by prediction accuracy. z RSB used for Call/Return pairs z Totally re-done on Banias!

® 29 G-Number TheThe P6P6 BTBBTB

z 2-level, local histories, per-set counters z 4-way set associative: 512 entries in 128 sets

Prediction bit 1001 0

counters IP V Tag ofst T Target Hist P LRR 9 Pred= msb of 128 1 9 4 2 32 4 1 2 32 counter sets 15 Branch Type 00- cond 01- ret 10- call 11- uncond

Return Stack Way 0 1 2 3 Per-Set y y y Buffer a a a W W W

z Up to 4 branches can have a tag match ® 30 G-Number MicroarchitectureMicroarchitecture

External Bus L2 z In-Order Front End

MOB BIU DCU IA instrs z Determine where each IA instruction starts MIU z Direct the bytes of each IFU alignment inst.inst. to the decoder AGU z Convert instructions into R uops. BTB IEU S z Buffers up to 6 uops: D0 D1 D2 Enables decoders keep I FEU working even when next MIS D MIS pipe stages are stalled.

ROBuop0 uop1 uop2 RAT

® 31 G-Number AllocAlloc && RATRAT

z The Allocator (Alloc) assigns each uop an entry number in the ROB z The Register Alias Table (RAT) maps the 8 IA architectural registers into the physical registers z Work together to perform the register allocation and renaming z Are able to work on up to 3 uops per clock z The Alloc also allocates Load & Store buffers in the MOB z Special treatment for FP stack renaming

® 32 G-Number MicroarchitectureMicroarchitecture

External Bus L2 L2 z In-Order Front End

MOB BIU DCU

MIU IFU AGU R BTB S IEU

I FEU MIS D MIS

ROB RAT ® 33 G-Number MicroarchitectureMicroarchitecture

External Bus L2 z Out-of-order Core

z ROB: Mechanism for renaming and MOB retirement BIU – 40 entries that hold instructions in- DCU – 40 entries that hold instructions in- order. MIU z RS: pool of all “not yet executed” IFU instructions (up to 20) AGU z Execution Units R – IEU: Integer Execution Unit BTB S IEU – FEU: Floating-point Execution Unit z Memory related units I FEU – AGU: Address Generation Unit MIU: MIS D MIS Memory Interface Unit – DCU: Data Cache Unit ROB – MOB: Orders Memory operations

® RAT – L2: Level 2 cache ® 34 G-Number ReRe--orderorder BufferBuffer (ROB)(ROB) z MechanismMechanism forfor renamingrenaming andand retirementretirement z BasicBasic ROBROB functionsfunctions – Provide large physical register space for – Buffer results of speculative execution in a 40 entry physical register file – Commit architectural state only after speculation (branch, exception) has resolved – Detect exceptions and mispredictions and initiate repair to get machine back on right track – Holds also the “Real Register File” - RRF

® 35 G-Number ReRe--orderorder BufferBuffer (ROB)(ROB) -- contcont z UopUop flowflow throughthrough thethe ROBROB – Uops are entered in order – Registers renamed by the entry# – Once assigned: execution order unimportant – After execution: entries marked “executed” and wait for retirement – An executed entry can be “retired” once all prior instruction have retired. That is: – Update “real registers” with value of renamed registers – Update memory – Leave the ROB z OnlyOnly 22 readread portsports (for(for upup toto 66 operands)operands)

® 36 G-Number ReservationReservation stationstation (RS)(RS) z Pool of all “not yet executed” uops (up to 20) – Holds the uop attributes and the uop source data z Maintains operands status “ready/not-ready” – Each cycle, executed uops make more operands “ready” – Uops whose all operands are ready can be dispatched for execution – Dispatcher chooses which of the ready uops to execute next z Responsible for: – Holding the uop till it is dispatched – Monitoring the WB bus to capture data needed by awaiting uop – Bypass control of data from WB bus directly to execution unit – Schedule the next uops – Dispatch uops to functional units – Arbitrate the WB busses between the units

® 37 G-Number MemoryMemory OrderOrder BufferBuffer (MOB)(MOB)

z Goal - allow out-of-order among memory operations z Problem- Memory dependencies cannot be fully resolved statically (memory disambiguation) – store r1,a; load r2,b ⇒ can advance load before store – store r1,[r3]; load r2,b ⇒ load should wait till r3 is known z Structure similar in concept to ROB z Every memory uop is allocated an entry in order. z Address & data (for stores), are updated when known z Loads may pass loads/stores z Stores are in order

® 38 G-Number MemoryMemory OrderOrder BufferBuffer (MOB)(MOB) z LoadLoad isis checkedchecked againstagainst allall previousprevious stores:stores: – Waits if store to same address exist, but data not ready – If store data exists, just use it – Waits till all previous store addresses are resolved – In case of no address collision - go to memory

® 39 G-Number MicroarchitectureMicroarchitecture

External Bus L2 SHF FMU MOB FDI IDI BIU V FAV DCU Port 0 IEUU MIU IFU JEU IEU AGU Port 1 R AGU BTB S IEU Port 2 LDA

I FEU Port 3,4 AGU MIS STA D MIS

ROB RAT

® 40 G-Number MicroarchitectureMicroarchitecture

External Bus L2

MOB BIU z Out-of-order Core DCU

MIU IFU AGU R BTB S IEU

I FEU MIS D MIS

ROB RAT ® 41 G-Number MicroarchitectureMicroarchitecture

External Bus L2

MOB BIU DCU zIn-Order Retire

MIU IFU AGU R BTB S IEU

I FEU MIS D MIS

ROB RAT ® 42 G-Number InIn orderorder RetirementRetirement

z The process of committing the results to the architectural state ofof thethe processorprocessor z Retires up to 3 uops per clock z Copies the values to the RRF z Retirement is done In Order z Performs exception checking z An instruction is retired after the following checks – Instruction has executed – All previous instructions have retired – Instruction isn’t mis-predicted – no exceptions

® 43 G-Number FlowFlow ofof UopsUops throughthrough OOOOOO ClusterCluster z ISSUE: – ALLOC unit allocates one entry per uop in the RS and in the ROB (for up to 3 uops per cycle) – If source data is available from the ROB (either from the RRF of from the Result Buffer (RB) it is written in the RS entry – Otherwise, it is marked invalid in the RS (and should be captured from the WB bus) z READY/SCHEDULE: – Data-ready uops are checked to see if desired functional unit available – Up to 5 resource-ready uops are selected, and dispatched per clock z DISPATCH: – Ship scheduled uops to appropriate functional unit (RS) z WRITEBACK: – Capture results returned by the functional units in a result buffer (ROB) – Snoop result writeback ports for results that are sources to uops in RS – Update data-ready status of these uops (RS)

® 44 G-Number FlowFlow ofof UopsUops throughthrough OOOOOO ClusterCluster (cont)(cont)

z RETIREMENT: – 3 consecutive entries read out of the ROB – these entries are candidates for retirement – Algorithm to determine fitness for retirement: candidate is retired – its ready bit is set – it will not cause an exception – all preceding candidates are eligible for retirement – Commit results from result buffer to architecturally visible state in original “Issue” order – Clear machine and restart execution if “badness” occurs (ROB)

® 45 G-Number JumpJump MispredictionMisprediction z When the JEU detects jump misprediction it – Flushes the in-order front-end and starts fetching and decoding from the “correct” execution path – The “correct” path may not be correct if a preceding uop that hasn’t executed yet could cause an exception – Therefore, the “correct” instruction stream is stalled at the RAT until the mispredicted branch retires – Does not flush the OOO part of the machine, since all instructions preceding the jump must be completed and retired z When the mispredicted branch retires from the ROB – Resets all state in the Out-of-Order Engine (RS, RB, MOB, etc.) – Un-stalls the in-order machine – The RS immediately grabs the correct uops from the RAT and starts scheduling and dispatching them

® 46 G-Number PipelinePipeline Next Reg RS Icache Decode IP Ren Wr

I1 I2 I3 I4 I5 I6 I7 I8

• In-Order Front End RS disp Ex

• Out-of-order Core O1 O2 O3 1: Next IP 2: ICache lookup 3: IC2 /ILD (instruction length decode) Retirement 4: IC3/rotate 5: ID1 • In-order Retirement R1 R2 R3 6: ID2 7: RAT- rename sources, ALLOC-assign destinations 8: ROB-read sources RS-schedule data-ready uops for dispatch 9: RS-dispatch uops 10:EX ® 47 11:Retirement G-Number OOOOOO ConceptConcept ExampleExample

z Lets follow this code: program counter Instruction [format: op src, dest] n mov r4,r1 n+1 add r1,r2 Cycle 1 decode n+2 mov M2,r1 n+3 add r1,r3 n+4 jmp L2 n+5 add r3,r4 Cycle 2 decode n+6 mov M3,r1 n+7 add r1,r4 n+8 dec r5 Cycle 3 decode z Every cycle, 4 instructions are decoded

A demonstrating example, where RS and Rob are combined ® 48 G-Number CodeCode ExampleExample (rename(rename && SchedSched)) Lets follow this code: PC Instructions After Renaming Execution n mov r4,r1 r4, r1_1 D E W n+1 add r1,r2 r1_1, r2, r2_1 D E W n+2 mov M2,r1 M2, r1_2 D E W n+3 add r1,r3 r1_2, r3, r3_1 D E W n+4 jmp L2 … D E W n+5 add r3,r4 r3_1, r4, r4_1 D E W n+6 mov M3,r1 M3, r1_3 D E W n+7 add r1,r4 r1_3, r4_1, r4_2 D E W n+8 dec r5 r5, r5_1 D E W cycle:0 1 2 3 4 5 6 z Every cycle, 4 instructions are decoded

® 49 G-Number Implementation Example Implementation Example Op Src, Dest CycleCycle 11 Icache PC Instruction nmovr4,r1 Decode & Issue Logic n+1 add r1,r2

Destination n+2 mov M2,r1 PC Instruction tag S1 val tag S2 val Exec Valid Tag Value n+3 add r1,r3 E 0 0 n+4 jmp L2 X 0 0 E 0 0 n+5 add r3,r4 C 0 0 n+6 mov M3,r1 U T 0 0 n+7 add r1,r4 I n+3 Add R1,R3 R1_2 R3 zzz R3_1 0 1 n+8 dec r5 O n+2 Mov M2,R1 M2 R1_2 0 1 N n+1 Add R1,R2 R1_1 R2 yyy R2_1 0 1 n Mov R4,R1 R4 ttt R1 xxx R1_1 0 1 Commit

R1 xxx Reg R2 yyy • 4 instructions entered into Reservation File R3 zzz R4 ttt stations/ Register renaming

• 2 copies of R1 alive - tags connected

• Available values read from committed reg file

® 50 G-Number ImplementationImplementation ExampleExample (Cont(Cont……)) CycleCycle 22 Op Src, Dest

Icache PC Instruction nmovr4,r1 Decode & Issue Logic n+1 add r1,r2

Destination n+2 mov M2,r1 PC Instruction tag S1 val tag S2 val Exec Valid Tag Value n+3 add r1,r3 E 0 0 n+4 jmp L2 X n+7 Add R1,R4 R1_3 R4_1 R4_2 0 1 E n+5 add r3,r4 n+6 Mov M3,R1 M3 R1_3 0 1 C ttt 0 1 n+6 mov M3,r1 U n+5 Add R3,R4 R3_1 R4 R4_1 T n+4 jmp L2 0 1 n+7 add r1,r4 I n+3 Add R1,R3 R1_2 sss R3 zzz R3_1 0 1 n+8 dec r5 O n+2 Mov M2,R1 M2 R1_2 sss 1 1 N n+1 Add R1,R2 R1_1 ttt R2 yyy R2_1 0 1 n Mov R4,R1 R4 ttt R1 xxx R1_1 ttt 1 1 Commit • 4 new instructions entered into Reservation R1 xxx stations and renamed (N+5, N+7) Reg R2 yyy File R3 zzz • N & N+2 execute - results go to N+1 & N+3 R4 ttt R5 vvv • N will commit next cycle and its value go to real R1

® 51 G-Number ImplementationImplementation ExampleExample (Cont(Cont……)) CycleCycle 33 Op Src, Dest

Icache PC Instruction nmovr4,r1 Decode & Issue Logic n+1 add r1,r2

Destination n+2 mov M2,r1 PC Instruction tag S1 val tag S2 val Exec Valid Tag Value n+3 add r1,r3 E 0 0 n+4 jmp L2 X n+8 Dec R5 R5 vvv R5_1 0 1 E n+7 Add R1,R4 R1_3 jjj R4_1 R4_2 0 1 n+5 add r3,r4 C n+6 Mov M3,R1 M3 jjj 1 1 n+6 mov M3,r1 U R1_3 T n+5 Add R3,R4 R3_1 lll R4 ttt R4_1 0 1 n+7 add r1,r4 I n+4 jmp L2 1 1 n+8 dec r5 O n+3 Add R1,R3 R1_2 sss R3 zzz R3_1 lll 1 1 N n+2 Mov M2,R1 M2 R1_2 sss 1 1 n+1 Add R1,R2 R1_1 ttt R2 yyy R2_1 kkk 1 1 Commit • N+8 added after decoding R1 ttt Reg R2 yyy • N+1, N+3, N+4 & N+6 execute - results go to R3 zzz File N+5 & N+7 R4 ttt • N+7 is not ready it still needs N+5 R5 vvv • N+1 -> N+4 will commit next cycle and their values written to real reg file. ® 52 G-Number ImplementationImplementation ExampleExample (Cont(Cont……)) CycleCycle 44 Op Src, Dest

Icache PC Instruction nmovr4,r1 Decode & Issue Logic n+1 add r1,r2

Destination n+2 mov M2,r1 PC Instruction tag S1 val tag S2 val Exec Valid Tag Value n+3 add r1,r3 E 0 0 X 0 0 n+4 jmp L2 E 0 0 n+5 add r3,r4 C 0 0 n+6 mov M3,r1 U T 0 0 n+7 add r1,r4 I n+8 Dec R5 R5 vvv R5_1 ppp 1 1 n+8 dec r5 O n+7 Add R1,R4 R1_3 jjj R4_1 ggg R4_2 0 1 N n+6 Mov M3,R1 M3 R1_3 jjj 1 1 n+5 Add R3,R4 R3_1 lll R4 ttt R4_1 ggg 1 1 Commit

R1 sss Reg R2 kkk • N+5 & N+8 execute - results go to N+7 File R3 lll R4 ttt • N+5 & 6 will commit next cycle and their R5 vvv values written to real reg file.

® 53 G-Number ImplementationImplementation ExampleExample (Cont(Cont……)) CycleCycle 55 Op Src, Dest

Icache PC Instruction nmovr4,r1 Decode & Issue Logic n+1 add r1,r2

Destination n+2 mov M2,r1 PC Instruction tag S1 val tag S2 val Exec Valid Tag Value n+3 add r1,r3 E 0 0 X 0 0 n+4 jmp L2 E 0 0 n+5 add r3,r4 C 0 0 n+6 mov M3,r1 U T 0 0 n+7 add r1,r4 I 0 0 n+8 dec r5 O 0 0 N n+8 Dec R5 R5 vvv R5_1 ppp 1 1 n+7 Add R1,R4 R1_3 jjj R4_1 ggg R4_2 ooo 1 1 Commit

R1 jjj Reg R2 kkk File R3 lll R4 ggg R5 vvv

® 54 G-Number ImplementationImplementation ExampleExample (Cont(Cont……)) CycleCycle 66 Op Src, Dest

Icache PC Instruction nmovr4,r1 Decode & Issue Logic n+1 add r1,r2

Destination n+2 mov M2,r1 PC Instruction tag S1 val tag S2 val Exec Valid Tag Value n+3 add r1,r3 E 0 0 X 0 0 n+4 jmp L2 E 0 0 n+5 add r3,r4 C 0 0 n+6 mov M3,r1 U T 0 0 n+7 add r1,r4 I 0 0 n+8 dec r5 O 0 0 N 0 0 0 0 Commit

R1 jjj Reg R2 kkk File R3 lll R4 ooo R5 ppp

® 55 G-Number RegisterRegister RenamingRenaming exampleexample

® 56 G-Number AnAn OOOEOOOE ExampleExample -- IssueIssue add EAX, EBX Î EAX Entry Uopcode Src1/Src2 V Psrcs Pdst subA nEAX,41 O-4 O-Î ECXO Example Valid 229 1 xxx 0->1 add 42 ISSUE 522 1 35 / 2 h 1 e e r c t e 1 2 t s de de u m e e h to r a c c o o a c p o D c c e ta s tir tir n a l o ob/ e e r re Ic IL de de al r di ex r r RS

add EAX, EBX, EAX

Data V Ldst

19 RB/RRF pointer RRFV 229 1 EAX EAX EAX --> 42 1 --> 0 35 EBX 522 1 EBX 35 0 RB EC X 42 xxx 0 EAX 57

Data EA X RAT 229 EBX xxx, 35, 42 312 RRF add EAX, EBX, EAX EC X RRFV ROB

® 57 G-Number AnAn OOOEOOOE ExampleExample -- DispatchDispatch READY/SCHEDULE/DISPATCH

Entry Uopcode Src1/Src2 V Psrcs Pdst add 229, 522, pdst=42 Valid 229 1 xxx add EAX, EBX Î EAX 1->0 add 42 sub EAX,414 Î ECX 522 1 35 / 1 2 h e e r e e c t 1 2 e t s o d d m h r t c re re o o a /

c RS a i i o c c ecu D n b t spa l a t t x Ic IL ro re re re de de al ro di e

Data VLdst 19 RB/RRF pointer RRFV 229 1 EAX EAX 42 0 35 EBX 522 1 EBX 35 0 RB EC X 42 xxx 0 EAX 57

Data EA X RAT 229 EBX 312 RRF EC X

ROB

® 58 G-Number AnAn OOOEOOOE ExampleExample -- ExecuteExecute WRITEBACK

result=751, pdst=42 Entry Uopcode Src1/Src2 V Psrcs Pdst Valid add EAX, EBX Î EAX sub EAX,414 Î ECX

xxx -> 751 0->1 42 0->1 sub 57 414 1 45 / 1 2 e e r ch t 1 2 e s u m h r at to c re re ode ode c a /

c RS p i i o D e c c n ta b l s i o Ica IL r re ret ret de de al ro d ex

Data V Ldst 19 RB/RRF pointer RRFV 229 1 EAX EAX 42 0 35 EBX 522 1 EBX 35 0 RB EC X 42 zzz --> 57 0 xxx --> 751 0->1 EAX 57 yyy 0 ECX

Data RAT EA X 229 EBX 312 RRF EC X

ROB

® 59 G-Number AnAn OOOEOOOE ExampleExample -- RetireRetire RETIREMENT

Entry Uopcode Src1/Src2 V Psrcs Pdst Valid

add EAX, EBX Î EAX sub EAX,414 Î ECX 751 1 42 1sub 57 / 1 2 e e 414 1 45 r ch t 1 2 e s o de de m h e e r at t c o o a / c a p o ecu c c D n b t l s a tir tir x i RS e e re r r Ic IL ro al ro d e de de

Data V Ldst 19 RB/RRF pointer RRFV 229 1 EAX EAX 42 0->1 35 EBX 522 1 EBX 35 0->1 RB EC X 42 57 0 751 1 EAX 57 yyy 0 ECX

Data RAT EA X 229->751 EBX 312->522 RRF EC X

ROB

® 60 G-Number HopeHope youyou likedliked that...that...

® 61 G-Number SWSW AspectsAspects z ““FlakyFlaky”” (unpredictable)(unpredictable) branchesbranches z PartialPartial StallsStalls z DataData mismis--alignmentalignment

® 62 G-Number PentiumPentium--MM ProcessorProcessor (Banias)(Banias) z MostMost significant,significant, majormajor improvementimprovement overover P6P6 architecturearchitecture everever z KeyKey µµarcharch changeschanges – New, improved branch prediction – BLG, loop predictor, indirect branch predictor – “Stack manager” – reduce >5% of uops – Uop fusion – reduce >10% of uops – P4 bus, bigger caches z AA lotlot ofof powerpower--awareaware techniquestechniques allall overover ® 63 G-Number