P6: Microarchitecture

TheThe PentiumPentium®® II/IIIII/III ProcessorProcessor ““CompilerCompiler onon aa ChipChip”” Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa Intel Corporation Tel Aviv University January 18, 2005 ® 1 G-Number AgendaAgenda z Goal,Goal, ExpectationsExpectations…… z GeneralGeneral InformationInformation z µµarchitecurearchitecure basicsbasics ® z PentiumPentium ProPro ProcessorProcessor µµarchitecurearchitecure z SWSW aspectsaspects ® 2 G-Number TechnologyTechnology ProfileProfile Pentium Pro - 1995 Pentium-II - 1998 Pentium-III - 1999 z Core @200MHz z Core @333MHz z Core @600MHz z 256K L2 on package, z 512KB L2 in SEC z 512KB L2 @200MHz @167MHz @ ???MHz z Performance: z Performance: z Performance: 8.09 SPECint95 12.8 SPECint95 24.0 SPECint95 6.70 SPECfp95 9.14 SPECfp95 15.9 SPECfp95 z 0.35 µm BiCMOS (P55C: 7.12/5.21) z 5.5M transistors z 0.25 µm CMOS z 0.25 µm CMOS process process z 195 sq mm (14x14) z 7.5M transistors z ???M transistors z 3.3V, 11.2A z 28.1W / 35.0W ® 3 G-Number TechnologyTechnology ProfileProfile (cont.)(cont.) z Pentium-III – 2000 (Coppermine) z Pentium-III - 2002 (Tualatin) z Core @1000MHz z Core @1400MHz z 256KB L2 on chip @ 1000MHz z 512KB L2 on chip @ 1400MHz z Performance: z Performance (estimated): >46 SPECint95 >60 SPECint95 >20 SPECfp95 >30 SPECfp95 z 0.18 µm CMOS process z 0.13 µm CMOS process z ~20M transistors z ~44M transistors z Pentium M Processor 2003 (Banias) z Pentium M Processor 2004 (Dothan) z Core @1800MHz z Core @2000MHz z 1024KB L2 on chip @ 1800MHz z 2048KB L2 on chip @ 2000MHz z Performance (estimated): z Performance (estimated): >80 SPECint95 >90 SPECint95 (est.) >50 SPECfp95 >60 SPECfp95 (est.) z 0.13 µm CMOS process z 0.09 µm CMOS process z ®~77M transistors z ~127M transistors 4 G-Number TerminologyTerminology z IntelIntel ArchitectureArchitecture z Pipeline,Pipeline, SuperSuper ScalarScalar z BranchBranch PredictionPrediction z SpeculativeSpeculative ExecutionExecution z DynamicDynamic SchedulingScheduling z DataData dependencydependency z RegisterRegister RenamingRenaming z OutOut OfOf OrderOrder z ReRe--orderorder BufferBuffer && MemoryMemory OrderOrder BufferBuffer z ReservationReservation StationsStations z MicroMicro--OperationsOperations Skip to µarch ® 5 G-Number IntelIntel ArchitectureArchitecture (X86)(X86) z 88 registersregisters onlyonly – Can be partially accessed ⇒Many memory accesses, short life time z OneOne setset ofof conditioncondition codescodes – Modified by most ALU operations – Various operations affect various flags ⇒Short Generate/Use distance z ExplicitExplicit stackstack – Push/Pop/Call/Return operations z VariableVariable lengthlength instructionsinstructions – Implicit operands ® 6 G-Number PipelinePipeline z BreakBreak thethe workwork toto smallersmaller piecespieces 0 1 2 3 4 5 6 7 8 9 10 11 12 I1 1/4 IPC = 4 CPI I2 I3 F D E W F D E W 1 IPC = 1 CPI F: Fetch F D E W D: Decode F D E W F D E W E: Execute F D E W W: Write Back z IncreasedIncreased throughputthroughput – increased # of completed instructions per cycle – Number of stages varies – Small: 4-5 (Pentium), “Superpipeline” ~14 (Pentium Pro) ® 7 G-Number PipelinePipeline StallsStalls z ButBut therethere areare ““stallsstalls”” inin thethe pipelinepipeline – Data flow dependency (instructions output/input) – Solved by: bypasses, renaming – Control flow dependencies – Solved by branch prediction – Other (Cache misses, long latency instructions) 0 1 2 3 4 5 6 7 8 9 10 11 12 F D E W F D stall E W F: Fetch F D stall E W D: Decode stall stall stall F D E W E: Execute F stall stall D E W W: Write Back Data Flow stall Control Flow stall Address Generation Interlock ® 8 G-Number SuperScalarSuperScalar z PerformsPerforms moremore inin aa singlesingle cyclecycle 0 1 2 3 4 5 6 7 8 9 10 11 12 F D E W F D E W F D E W F D E W F D E W F D E W F D E W 2 IPC = 1/2 CPI F D E W F D E W F D E W F D E W F D Stall E W z Ideally,Ideally, cancan multiplymultiply thethe throughputthroughput – but stall penalty is increased ® 9 G-Number SuperSuper PipelinePipeline z Split to shorter stages - allows higher frequency Old clk = 0 1 2 3 4 5 6 7 8 9 10 11 12 New clk = 0123456789101112 F1 F2 D1 D2 D3 E1 W1 W2 F1 F2 D1 D2 D3 E1 W1 W2 1 IPC = 1 CPI F1 F2 D1 D2 D3 E1 W1 W2 33% higher freq! F1 F2 D1 D2 D3 E1 W1 W2 F: Fetch F1 F2 D1 D2 D3 E1 W1 W2 D: Decode F1 F2 D1 D2 D3 E1 W1 W2 E: Execute F1 F2 D1 D2 D3 E1 W1 W2 W: Write Back F1 F2 D1 D2 D3 E1 W1 W2 z Ideally, can (again) multiply the throughput, but – Stall penalties do not scale (e.g., control flow stall, cache misses) – Clock setup/hold reduces the amount of net cycle time more - each instruction takes longer! Ö In the example above: 2X stages, but performance gain is <<33% ® 10 G-Number OutOut OfOf OrderOrder ExecutionExecution z So far - instructions were processed in their program order. – Parallelism is limited. z OOO: Instructions are executed based on “data flow” rather than program order 12 1D 1E 1E 1W 2 2 2 2 2 2 Before: src -> dest F D E E E W 3F 3D 3E 3E 3E 3E 3W 4 4 4 4 4 4 4 4 (1) load (r10), r21 F D E E E E E W 5 5 5 5 5 5 5 5 5 (2) mov r21, r31 (2 depends on 1) F D E E E E E E W 1 2 3 4 5 6 7 8 9 (3) load a, r11 In Order Processing (4) mov r11, r22 (4 depends on 3) (5) mov r22, r23 (5 depends on 4) 12 1D 1E 1E 1W 2F 2D 2E 2E 2E 2W 3 3 3 3 3 3 After: F D E E w W After: 4F 4D 4E 4E 4E 4W 5 5 5 5 5 5 5 (1) load (r10), r21; (3) load a, r11; F D E E E E W <wait for loads to complete> 1 2 3 4 5 6 7 8 9 Out of Order Processing (2) mov r21,r31; (4) mov r11,r22; (5) mov r22,r23; In Order vs OOO execution. (5) mov r22,r23; Assuming: - Unlimited resources z Usually highly superscalar - 2 cycles load latency ® 11 G-Number CacheCache -- MotivationMotivation && PrinciplePrinciple Added z Memory consumption is growing about 2X every 2 years – Typical size: (Y2000) 64M-128M, (Y2002) 128M-256M z CPU speed grows faster than memory and buses – CPU/Bus grew from 1:1 to 6:1, and still growing 486 Pentium P-II P-III P4 25-66MHz 66-233MHz 200-450MH 0.5-1.33GHz 1.4-2.4GHz 33MHz 66MHZ 66-100MHz 133-200MHz 400MHz – Memory: DRAM: 60-100ns (“10-16MHz”), Cost: <10$ per 1M SRAM is faster but much more expensive Bandwidth: SDRAM 100-133MHz*8B; DDR 200-400MHz*8B; RDR 800MHz*2B ÖMemory becomes the bottleneck for both instructions and data! Slow or expensive z Solution: Cache - A Small, Fast, Close memory – Serves as a buffer between CPU and main memory z Contains copy of a portion of the main memory CPU Cache Memory – Small in size – Dynamically changed Small, fast Large, Slow z Exploit space and time locality: Close, Expensive Far, Cheap – Code is fetched sequentially (Space) – Code is re-executed (loops, procedures) (Time) – Access close or previous data (Space, Time) ® 12 G-Number BranchBranch PredictionPrediction z GoalGoal -- ensureensure enoughenough instructioninstruction supplysupply byby correctcorrect prefetchingprefetching z upup toto 486486 -- prefetcherprefetcher assumedassumed fallfall--throughthrough – Lose on unconditional branch (e.g., call) – Lose on frequently taken branches (e.g., loops) z BranchBranch predictionprediction – Predict branch taken/not taken – Predict the branch target address ? ® 13 G-Number BranchBranch PredictionPrediction (cont.)(cont.) z ImplementationImplementation – Use history (private or global) to predict direction (simple Lee&Smith, advanced Yeh&Patt) – Target address taken from table (faster) or from the instruction (slower) – Table updated first based on prediction, later based on actual execution. z MispredictionMisprediction costcost variesvaries (high(high onon PPro)PPro) z CurrentCurrent predictionprediction rate:rate: ~92%~92% -- 95%95% ~60~60--100100 instructionsinstructions betweenbetween mispredictionsmispredictions** * Assuming branch every 5 instructions ® 14 G-Number SpeculativeSpeculative ExecutionExecution z ExecutionExecution ofof instructionsinstructions fromfrom aa predictedpredicted (yet(yet unsure)unsure) path.path. Eventually,Eventually, pathpath maymay turnturn wrong.wrong. z Advantages:Advantages: – Ensure instruction supply – Allow scheduling window z Issues:Issues: – Misprediction cost – Misprediction recovery ® 15 G-Number DynamicDynamic SchedulingScheduling z SchedulingScheduling instructionsinstructions atat runrun time,time, byby thethe HW,HW, andand notnot atat compilecompile time,time, byby thethe SWSW z Advantages:Advantages: – Works on the dynamic instruction flow: Can schedule across procedures, modules... – Can see dynamic values (memory addresses) – Can accommodate varying latencies z DisadvantagesDisadvantages – Can schedule within a limited window only – Should be fast - cannot be too smart ® 16 G-Number DataData DependencyDependency (1) load R2,(R1) [format: op dest, src] (2) mov R3,R2 (3) mov R1,a (4) mov R2,R3 (5) mov R2,R1 z TrueTrue dependencydependency (R2:1>2, R3:2>4, R1:3>5) z FalseFalse dependenciesdependencies – Anti dependency (R1:1>3, R2: 2>4) – Output dependency (R2:1>4,R2:4>5) ® 17 G-Number RegisterRegister RenamingRenaming before after mapping (1) load R2,(R1) load r21,(r10) [R2->r21] (2) mov R3,R2 mov r31,r21 [R3->r31] (3) mov R1,a mov r11,a [R1->r11] (4) mov R2,R3 mov r22,r31 [R2->r22] (5) add R2,R1 add r23,r11,r22 [R2->r23] z RemoveRemove falsefalse dependenciesdependencies z RemoveRemove architecturearchitecture limitlimit forfor ## ofof regsregs z HelpHelp speculativespeculative executionexecution – Renamed register are kept until speculation is verified to be correct ® 18 G-Number OutOut OfOf OrderOrder ExecutionExecution z ExecuteExecute

P6: Microarchitecture

Intel® Architecture Instruction Set Extensions and Future Features Programming Reference

Inside Intel® Core™ Microarchitecture Setting New Standards for Energy-Efficient Performance

Pentium II Processor Performance Brief

Intel Architecture Optimization Manual

A Superscalar Out-Of-Order X86 Soft Processor for FPGA

Multiprocessing Contents

2Nd Generation Intel® Core™ Processor Family Mobile with ECC

Travelmate P6 Series Product Sheet

The Intel X86 Microarchitectures Map Version 2.0

Intel Xeon Processor Can Be Identified by the Following Values

I386-Based Computer Architecture and Elementary Data Operations

Energy Per Instruction Trends in Intel® Microprocessors