3. The microarchitecture of Intel, AMD and VIA CPUs An optimization guide for assembly programmers and compiler makers By Agner Fog. Copenhagen University College of Engineering. Copyright © 1996 - 2012. Last updated 2012-02-29. Contents 1 Introduction ....................................................................................................................... 4 1.1 About this manual ....................................................................................................... 4 1.2 Microprocessor versions covered by this manual........................................................ 6 2 Out-of-order execution (All processors except P1, PMMX)................................................ 8 2.1 Instructions are split into µops..................................................................................... 8 2.2 Register renaming ...................................................................................................... 9 3 Branch prediction (all processors) ................................................................................... 11 3.1 Prediction methods for conditional jumps.................................................................. 11 3.2 Branch prediction in P1............................................................................................. 16 3.3 Branch prediction in PMMX, PPro, P2, and P3 ......................................................... 20 3.4 Branch prediction in P4 and P4E .............................................................................. 21 3.5 Branch prediction in PM and Core2 .......................................................................... 24 3.6 Branch prediction in Intel Nehalem ........................................................................... 26 3.7 Branch prediction in Intel Sandy Bridge .................................................................... 27 3.8 Branch prediction in Intel Atom ................................................................................. 27 3.9 Branch prediction in VIA Nano .................................................................................. 28 3.10 Branch prediction in AMD K8 and K10.................................................................... 29 3.11 Branch prediction in AMD Bulldozer........................................................................ 31 3.12 Branch prediction in AMD Bobcat ........................................................................... 32 3.13 Indirect jumps on older processors ......................................................................... 33 3.14 Returns (all processors except P1) ......................................................................... 33 3.15 Static prediction ...................................................................................................... 33 3.16 Close jumps............................................................................................................ 34 4 Pentium 1 and Pentium MMX pipeline............................................................................. 36 4.1 Pairing integer instructions........................................................................................ 36 4.2 Address generation interlock..................................................................................... 40 4.3 Splitting complex instructions into simpler ones ........................................................ 40 4.4 Prefixes..................................................................................................................... 41 4.5 Scheduling floating point code .................................................................................. 42 5 Pentium Pro, II and III pipeline......................................................................................... 45 5.1 The pipeline in PPro, P2 and P3 ............................................................................... 45 5.2 Instruction fetch ........................................................................................................ 45 5.3 Instruction decoding.................................................................................................. 46 5.4 Register renaming .................................................................................................... 50 5.5 ROB read.................................................................................................................. 50 5.6 Out of order execution .............................................................................................. 54 5.7 Retirement ................................................................................................................ 55 5.8 Partial register stalls.................................................................................................. 56 5.9 Store forwarding stalls .............................................................................................. 59 5.10 Bottlenecks in PPro, P2, P3.................................................................................... 60 6 Pentium M pipeline.......................................................................................................... 62 6.1 The pipeline in PM .................................................................................................... 62 6.2 The pipeline in Core Solo and Duo ........................................................................... 63 6.3 Instruction fetch ........................................................................................................ 63 6.4 Instruction decoding.................................................................................................. 63 6.5 Loop buffer ............................................................................................................... 65 6.6 Micro-op fusion ......................................................................................................... 65 6.7 Stack engine............................................................................................................. 67 6.8 Register renaming .................................................................................................... 69 6.9 Register read stalls ................................................................................................... 69 6.10 Execution units ....................................................................................................... 71 6.11 Execution units that are connected to both port 0 and 1.......................................... 71 6.12 Retirement .............................................................................................................. 73 6.13 Partial register access............................................................................................. 73 6.14 Store forwarding stalls ............................................................................................ 75 6.15 Bottlenecks in PM ................................................................................................... 75 7 Core 2 and Nehalem pipeline .......................................................................................... 78 7.1 Pipeline..................................................................................................................... 78 7.2 Instruction fetch and predecoding ............................................................................. 78 7.3 Instruction decoding.................................................................................................. 81 7.4 Micro-op fusion ......................................................................................................... 81 7.5 Macro-op fusion........................................................................................................ 82 7.6 Stack engine............................................................................................................. 83 7.7 Register renaming .................................................................................................... 83 7.8 Register read stalls ................................................................................................... 84 7.9 Execution units ......................................................................................................... 85 7.10 Retirement .............................................................................................................. 89 7.11 Partial register access............................................................................................. 89 7.12 Store forwarding stalls ............................................................................................ 90 7.13 Cache and memory access..................................................................................... 92 7.14 Breaking dependency chains .................................................................................. 92 7.15 Multithreading in Nehalem ...................................................................................... 93 7.16 Bottlenecks in Core2 and Nehalem......................................................................... 94 8 Sandy Bridge pipeline ..................................................................................................... 96 8.1 Pipeline..................................................................................................................... 96 8.2 Instruction fetch and decoding .................................................................................
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages180 Page
-
File Size-