The Microarchitecture of Intel and AMD Cpus

Total Page:16

File Type:pdf, Size:1020Kb

The Microarchitecture of Intel and AMD Cpus 3. The microarchitecture of Intel, AMD and VIA CPUs An optimization guide for assembly programmers and compiler makers By Agner Fog. Copenhagen University College of Engineering. Copyright © 1996 - 2012. Last updated 2012-02-29. Contents 1 Introduction ....................................................................................................................... 4 1.1 About this manual ....................................................................................................... 4 1.2 Microprocessor versions covered by this manual........................................................ 6 2 Out-of-order execution (All processors except P1, PMMX)................................................ 8 2.1 Instructions are split into µops..................................................................................... 8 2.2 Register renaming ...................................................................................................... 9 3 Branch prediction (all processors) ................................................................................... 11 3.1 Prediction methods for conditional jumps.................................................................. 11 3.2 Branch prediction in P1............................................................................................. 16 3.3 Branch prediction in PMMX, PPro, P2, and P3 ......................................................... 20 3.4 Branch prediction in P4 and P4E .............................................................................. 21 3.5 Branch prediction in PM and Core2 .......................................................................... 24 3.6 Branch prediction in Intel Nehalem ........................................................................... 26 3.7 Branch prediction in Intel Sandy Bridge .................................................................... 27 3.8 Branch prediction in Intel Atom ................................................................................. 27 3.9 Branch prediction in VIA Nano .................................................................................. 28 3.10 Branch prediction in AMD K8 and K10.................................................................... 29 3.11 Branch prediction in AMD Bulldozer........................................................................ 31 3.12 Branch prediction in AMD Bobcat ........................................................................... 32 3.13 Indirect jumps on older processors ......................................................................... 33 3.14 Returns (all processors except P1) ......................................................................... 33 3.15 Static prediction ...................................................................................................... 33 3.16 Close jumps............................................................................................................ 34 4 Pentium 1 and Pentium MMX pipeline............................................................................. 36 4.1 Pairing integer instructions........................................................................................ 36 4.2 Address generation interlock..................................................................................... 40 4.3 Splitting complex instructions into simpler ones ........................................................ 40 4.4 Prefixes..................................................................................................................... 41 4.5 Scheduling floating point code .................................................................................. 42 5 Pentium Pro, II and III pipeline......................................................................................... 45 5.1 The pipeline in PPro, P2 and P3 ............................................................................... 45 5.2 Instruction fetch ........................................................................................................ 45 5.3 Instruction decoding.................................................................................................. 46 5.4 Register renaming .................................................................................................... 50 5.5 ROB read.................................................................................................................. 50 5.6 Out of order execution .............................................................................................. 54 5.7 Retirement ................................................................................................................ 55 5.8 Partial register stalls.................................................................................................. 56 5.9 Store forwarding stalls .............................................................................................. 59 5.10 Bottlenecks in PPro, P2, P3.................................................................................... 60 6 Pentium M pipeline.......................................................................................................... 62 6.1 The pipeline in PM .................................................................................................... 62 6.2 The pipeline in Core Solo and Duo ........................................................................... 63 6.3 Instruction fetch ........................................................................................................ 63 6.4 Instruction decoding.................................................................................................. 63 6.5 Loop buffer ............................................................................................................... 65 6.6 Micro-op fusion ......................................................................................................... 65 6.7 Stack engine............................................................................................................. 67 6.8 Register renaming .................................................................................................... 69 6.9 Register read stalls ................................................................................................... 69 6.10 Execution units ....................................................................................................... 71 6.11 Execution units that are connected to both port 0 and 1.......................................... 71 6.12 Retirement .............................................................................................................. 73 6.13 Partial register access............................................................................................. 73 6.14 Store forwarding stalls ............................................................................................ 75 6.15 Bottlenecks in PM ................................................................................................... 75 7 Core 2 and Nehalem pipeline .......................................................................................... 78 7.1 Pipeline..................................................................................................................... 78 7.2 Instruction fetch and predecoding ............................................................................. 78 7.3 Instruction decoding.................................................................................................. 81 7.4 Micro-op fusion ......................................................................................................... 81 7.5 Macro-op fusion........................................................................................................ 82 7.6 Stack engine............................................................................................................. 83 7.7 Register renaming .................................................................................................... 83 7.8 Register read stalls ................................................................................................... 84 7.9 Execution units ......................................................................................................... 85 7.10 Retirement .............................................................................................................. 89 7.11 Partial register access............................................................................................. 89 7.12 Store forwarding stalls ............................................................................................ 90 7.13 Cache and memory access..................................................................................... 92 7.14 Breaking dependency chains .................................................................................. 92 7.15 Multithreading in Nehalem ...................................................................................... 93 7.16 Bottlenecks in Core2 and Nehalem......................................................................... 94 8 Sandy Bridge pipeline ..................................................................................................... 96 8.1 Pipeline..................................................................................................................... 96 8.2 Instruction fetch and decoding .................................................................................
Recommended publications
  • Intel® IA-64 Architecture Software Developer's Manual
    Intel® IA-64 Architecture Software Developer’s Manual Volume 1: IA-64 Application Architecture Revision 1.1 July 2000 Document Number: 245317-002 THIS DOCUMENT IS PROVIDED “AS IS” WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY, NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. Intel® IA-64 processors may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800- 548-4725, or by visiting Intel’s website at http://developer.intel.com/design/litcentr.
    [Show full text]
  • A Superscalar Out-Of-Order X86 Soft Processor for FPGA
    A Superscalar Out-of-Order x86 Soft Processor for FPGA Henry Wong University of Toronto, Intel [email protected] June 5, 2019 Stanford University EE380 1 Hi! ● CPU architect, Intel Hillsboro ● Ph.D., University of Toronto ● Today: x86 OoO processor for FPGA (Ph.D. work) – Motivation – High-level design and results – Microarchitecture details and some circuits 2 FPGA: Field-Programmable Gate Array ● Is a digital circuit (logic gates and wires) ● Is field-programmable (at power-on, not in the fab) ● Pre-fab everything you’ll ever need – 20x area, 20x delay cost – Circuit building blocks are somewhat bigger than logic gates 6-LUT6-LUT 6-LUT6-LUT 3 6-LUT 6-LUT FPGA: Field-Programmable Gate Array ● Is a digital circuit (logic gates and wires) ● Is field-programmable (at power-on, not in the fab) ● Pre-fab everything you’ll ever need – 20x area, 20x delay cost – Circuit building blocks are somewhat bigger than logic gates 6-LUT 6-LUT 6-LUT 6-LUT 4 6-LUT 6-LUT FPGA Soft Processors ● FPGA systems often have software components – Often running on a soft processor ● Need more performance? – Parallel code and hardware accelerators need effort – Less effort if soft processors got faster 5 FPGA Soft Processors ● FPGA systems often have software components – Often running on a soft processor ● Need more performance? – Parallel code and hardware accelerators need effort – Less effort if soft processors got faster 6 FPGA Soft Processors ● FPGA systems often have software components – Often running on a soft processor ● Need more performance? – Parallel
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • Operating Guide
    Operating Guide EPIA-P830 Mainboard EPIA-P830 Operating Guide Table of Contents Table of Contents .......................................................................................................................................................................................... i VIA EPIA-P830 overview.............................................................................................................................................................................1 VIA EPIA-P830 layout ..................................................................................................................................................................................2 VIA EPIA-P830 specifications ...................................................................................................................................................................3 VIA EPIA-P830 processor SKUs ...............................................................................................................................................................4 VIA VX900 chipset overview.....................................................................................................................................................................5 VIA EPIA-P830 and P830-A board dimensions.................................................................................................................................6 VIA P830-B board dimensions.................................................................................................................................................................7
    [Show full text]
  • The Microarchitecture of the Pentium 4 Processor
    The Microarchitecture of the Pentium 4 Processor Glenn Hinton, Desktop Platforms Group, Intel Corp. Dave Sager, Desktop Platforms Group, Intel Corp. Mike Upton, Desktop Platforms Group, Intel Corp. Darrell Boggs, Desktop Platforms Group, Intel Corp. Doug Carmean, Desktop Platforms Group, Intel Corp. Alan Kyker, Desktop Platforms Group, Intel Corp. Patrice Roussel, Desktop Platforms Group, Intel Corp. Index words: Pentium® 4 processor, NetBurst™ microarchitecture, Trace Cache, double-pumped ALU, deep pipelining provides an in-depth examination of the features and ABSTRACT functions of the Intel NetBurst microarchitecture. This paper describes the Intel® NetBurst™ ® The Pentium 4 processor is designed to deliver microarchitecture of Intel’s new flagship Pentium 4 performance across applications where end users can truly processor. This microarchitecture is the basis of a new appreciate and experience its performance. For example, family of processors from Intel starting with the Pentium it allows a much better user experience in areas such as 4 processor. The Pentium 4 processor provides a Internet audio and streaming video, image processing, substantial performance gain for many key application video content creation, speech recognition, 3D areas where the end user can truly appreciate the applications and games, multi-media, and multi-tasking difference. user environments. The Pentium 4 processor enables real- In this paper we describe the main features and functions time MPEG2 video encoding and near real-time MPEG4 of the NetBurst microarchitecture. We present the front- encoding, allowing efficient video editing and video end of the machine, including its new form of instruction conferencing. It delivers world-class performance on 3D cache called the Execution Trace Cache.
    [Show full text]
  • Apparecchiature Medicali: Il Ruolo Delle Nanotecnologie
    EO Medical APPAreCCHIAture MeDICALI: IL ruoLo DeLLe NANoteCNoLoGIe IN queSto NuMero III Mercati/Attualità VIII Stanford, in fase di sviluppo una ‘pelle elettronica’ X Dialisi direttamente a casa XII Affrontare richieste ad alte prestazioni per la visualizzazione di immagini mediche XIV Criteri di scelta per alimentatori conformi a Iec60601-1 3a edizione XVII News Foto: Future Electronics Murata MEMS Solutions for Medical and Healthcare Enabling MEMS Sensing Improved Care Elements (Dies) SCG12S and SCG14S In medical and healthcare applications Vertical Accelerometer Elements (Dies) Murata’s medical MEMS sensors enable • Size 3mm x 2.12mm x 1.95 or 1.25mm • Various measuring ranges possible (1 - 12g) improved care and a better quality of life • Proven capacitive 3D-MEMS Technology for patients and elderly people. Medical sensors increase the intelligence of life supporting SCG10X and SCG10Z Horizontal Accelerometer Elements (Dies) transplants, and they can be used in new types of patient • Size SCG10X: 2.55mm x 2.95mm x 1.91mm monitoring applications that allow patients to lead more • Size SCG10Z: 1.50mm x 1.70mm x 1.83mm independent lives. Detecting signals triggered by symptoms • Various measuring ranges possible (1 - 12g) • Proven capacitive 3D-MEMS Technology helps optimize medication and prevent serious attacks of illness. Murata’s unique MEMS design, which combines single SCB10H crystal silicon and glass, ensures exceptional reliability, Pressure Sensor Elements (Dies) unprecedented accuracy and excellent stability over time. The • Size 1.4mm x 1.4mm x 0.85mm • High pressure shock survival (> 200 bar) power requirements of these medical sensors are extremely • Various pressure ranges possible (1.2 - 25 bar) low, which gives them a significant advantage in small • Proven capacitive 3D-MEMS technology battery-operated devices.
    [Show full text]
  • Memorandum in Opposition to Hewlett-Packard Company's Motion to Quash Intel's Subpoena Duces Tecum
    ORIGINAL UNITED STATES OF AMERICA BEFORE THE FEDERAL TRADE COMMISSION ) In the Matter of ) ) DOCKET NO. 9341 INTEL. CORPORATION, ) a corporation ) PUBLIC ) .' ) MEMORANDUM IN OPPOSITION TO HEWLETT -PACKARD COMPANY'S MOTION TO QUASH INTEL'S SUBPOENA DUCES TECUM Intel Corporation ("Intel") submits this memorandum in opposition to Hewlett-Packard Company's ("HP") motion to quash Intel's subpoena duces tecum issued on March 11,2010 ("Subpoena"). HP's motion should be denied, and it should be ordered to comply with Intel's Subpoena, as narrowed by Intel's April 19,2010 letter. Intel's Subpoena seeks documents necessary to defend against Complaint Counsel's broad allegations and claimed relief. The Complaint alleges that Intel engaged in unfair business practices that maintained its monopoly over central processing units ("CPUs") and threatened to give it a monopoly over graphics processing units ("GPUs"). See CompI. iiii 2-28. Complaint Counsel's Interrogatory Answers state that it views HP, the world's largest manufacturer of personal computers, as a centerpiece of its case. See, e.g., Complaint Counsel's Resp. and Obj. to Respondent's First Set ofInterrogatories Nos. 7-8 (attached as Exhibit A). Complaint Counsel intends to call eight HP witnesses at trial on topics crossing virtually all of HP' s business lines, including its purchases ofCPUs for its commercial desktop, commercial notebook, and server businesses. See Complaint Counsel's May 5, 2010 Revised Preliminary Witness List (attached as Exhibit B). Complaint Counsel may also call HP witnesses on other topics, including its PUBLIC FTC Docket No. 9341 Memorandum in Opposition to Hewlett-Packard Company's Motion to Quash Intel's Subpoena Duces Tecum USIDOCS 7544743\'1 assessment and purchases of GPUs and chipsets and evaluation of compilers, benchmarks, interface standards, and standard-setting bodies.
    [Show full text]
  • MICROPROCESSOR REPORT the INSIDERS’ GUIDE to MICROPROCESSOR HARDWARE Slot Vs
    VOLUME 12, NUMBER 1 JANUARY 26, 1998 MICROPROCESSOR REPORT THE INSIDERS’ GUIDE TO MICROPROCESSOR HARDWARE Slot vs. Socket Battle Heats Up Intel Prepares for Transition as Competitors Boost Socket 7 A A look Look by Michael Slater ship as many parts as they hoped, especially at the highest backBack clock speeds where profits are much greater. The past year has brought a great deal The shift to 0.25-micron technology will be central to of change to the x86 microprocessor 1998’s CPU developments. Intel began shipping 0.25-micron A market, with Intel, AMD, and Cyrix processors in 3Q97; AMD followed late in 1997, IDT plans to LookA look replacing virtually their entire product join in by mid-98, and Cyrix expects to catch up in 3Q98. Ahead ahead lines with new devices. But despite high The more advanced process technology will cut power con- hopes, AMD and Cyrix struggled in vain for profits. The sumption, allowing sixth-generation CPUs to be used in financial contrast is stark: in 1997, Intel earned a record notebook systems. The smaller die sizes will enable higher $6.9 billion in net profit, while AMD lost $21 million for the production volumes and make it possible to integrate an L2 year and Cyrix lost $6 million in the six months before it was cache on the CPU chip. acquired by National. New entrant IDT added another com- The processors from Intel’s challengers have lagged in petitor to the mix but hasn’t shipped enough products to floating-point and MMX performance, which the vendors become a significant force.
    [Show full text]
  • World's First High- Performance X86 With
    World’s First High- Performance x86 with Integrated AI Coprocessor Linley Spring Processor Conference 2020 April 8, 2020 G Glenn Henry Dr. Parviz Palangpour Chief AI Architect AI Software Deep Dive into Centaur’s New x86 AI Coprocessor (Ncore) • Background • Motivations • Constraints • Architecture • Software • Benchmarks • Conclusion Demonstrated Working Silicon For Video-Analytics Edge Server Nov, 2019 ©2020 Centaur Technology. All Rights Reserved Centaur Technology Background • 25-year-old startup in Austin, owned by Via Technologies • We design, from scratch, low-cost x86 processors • Everything to produce a custom x86 SoC with ~100 people • Architecture, logic design and microcode • Design, verification, and layout • Physical build, fab interface, and tape-out • Shipped by IBM, HP, Dell, Samsung, Lenovo… ©2020 Centaur Technology. All Rights Reserved Genesis of the AI Coprocessor (Ncore) • Centaur was developing SoC (CHA) with new x86 cores • Targeted at edge/cloud server market (high-end x86 features) • Huge inference markets beyond hyperscale cloud, IOT and mobile • Video analytics, edge computing, on-premise servers • However, x86 isn’t efficient at inference • High-performance inference requires external accelerator • CHA has 44x PCIe to support GPUs, etc. • But adds cost, power, another point of failure, etc. ©2020 Centaur Technology. All Rights Reserved Why not integrate a coprocessor? • Very low cost • Many components already on SoC (“free” to Ncore) Caches, memory, clock, power, package, pins, busses, etc. • There often is “free” space on complex SOCs due to I/O & pins • Having many high-performance x86 cores allows flexibility • The x86 cores can do some of the work, in parallel • Didn’t have to implement all strange/new functions • Allows fast prototyping of new things • For customer: nothing extra to buy ©2020 Centaur Technology.
    [Show full text]
  • Reverse Engineering X86 Processor Microcode
    Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz, Ruhr-University Bochum https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/koppe This paper is included in the Proceedings of the 26th USENIX Security Symposium August 16–18, 2017 • Vancouver, BC, Canada ISBN 978-1-931971-40-9 Open access to the Proceedings of the 26th USENIX Security Symposium is sponsored by USENIX Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz Ruhr-Universitat¨ Bochum Abstract hardware modifications [48]. Dedicated hardware units to counter bugs are imperfect [36, 49] and involve non- Microcode is an abstraction layer on top of the phys- negligible hardware costs [8]. The infamous Pentium fdiv ical components of a CPU and present in most general- bug [62] illustrated a clear economic need for field up- purpose CPUs today. In addition to facilitate complex and dates after deployment in order to turn off defective parts vast instruction sets, it also provides an update mechanism and patch erroneous behavior. Note that the implementa- that allows CPUs to be patched in-place without requiring tion of a modern processor involves millions of lines of any special hardware. While it is well-known that CPUs HDL code [55] and verification of functional correctness are regularly updated with this mechanism, very little is for such processors is still an unsolved problem [4, 29]. known about its inner workings given that microcode and the update mechanism are proprietary and have not been Since the 1970s, x86 processor manufacturers have throughly analyzed yet.
    [Show full text]
  • MPR Article Template
    CENTAUR ADDS AI TO SERVER PROCESSOR First x86 SoC to Integrate Deep-Learning Accelerator By Linley Gwennap (December 2, 2019) ................................................................................................................... Centaur is galloping back into the x86 market with Centaur Technology has designed x86 CPUs and pro- an innovative processor design that combines eight high- cessors for Via for more than 20 years, but we’ve heard little performance CPUs with a custom deep-learning accelerator from the Texan design shop since the debut of the dual-core (DLA). The company is the first to announce a server- Nano X2, which was built in a leading-edge 40nm process processor design that integrates a DLA. The new accelerator, (see MPR 1/24/11, “Via Doubles Down at CES”). Henry, called Ncore, delivers better neural-network performance who managed the company since its inception, recently than even the most powerful Xeon, but without incurring handed the reins to new president Al Loper, another long- the high cost of an external GPU card. The Via Technologies time Centaurian. Henry, boasting 50 years of CPU-design subsidiary began testing the silicon in September; we esti- experience, continues as the company’s AI architect. mate the first products based on this design could enter pro- duction in 2H20, although Via has disclosed no product One Tractor Pulling 4,096 Trailers plans. When designing the Ncore accelerator, Centaur was con- Ncore, which operates as a coprocessor, avoids the cerned that the rapid pace of neural-network evolution vogue MAC array, instead relying on a more traditional pro- grammable SIMD engine. But Centaur took the multiple in 4x DDR4 To other CHA SIMD to the extreme, designing a unit that processes 4,096 bytes in parallel to achieve peak performance of 20 trillion operations per second (TOPS) for 8-bit integers (INT8).
    [Show full text]
  • 6Th Gen Intel® Core™ Processors
    6th Generation Intel® Processor Family Specification Update Supporting the Intel® Pentium® Processor Family based on the U-Processor Supporting the 6th Generation Intel® Core™ Processor Family based on the Y-Processor September 2015 Version 1.0 Order Number: 332994-001EN Preface You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or visit www.intel.com/design/literature.htm.
    [Show full text]