Ibm Power8 Processors Analýza Výkonnosti Procesorů Ibm Power8

Total Page:16

File Type:pdf, Size:1020Kb

Ibm Power8 Processors Analýza Výkonnosti Procesorů Ibm Power8 BRNO UNIVERSITY OF TECHNOLOGY VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF COMPUTER SYSTEMS FAKULTA INFORMAČNÍCH TECHNOLOGIÍ ÚSTAV POČÍTAČOVÝCH SYSTÉMŮ PERFORMANCE ANALYSIS OF IBM POWER8 PROCESSORS ANALÝZA VÝKONNOSTI PROCESORŮ IBM POWER8 MASTER’S THESIS DIPLOMOVÁ PRÁCE AUTHOR Bc. JAKUB JELEN AUTOR PRÁCE SUPERVISOR Ing. JIŘÍ JAROŠ, Ph.D. VEDOUCÍ PRÁCE BRNO 2016 Abstract This paper describes the IBM Power8 system in comparison to the Intel Xeon processors, widely used in today’s solutions. The performance is not evaluated only on the whole system level but also on the level of threads, cores and a memory. Different metrics are demonstrated on typical optimized algorithms. The benchmarked Power8 processor pro- vides extremely fast memory providing sustainable bandwidth up to 145 GB/s between main memory and processor, which Intel is unable to compete. Computation power is comparable (Matrix multiplication) or worse (N-body simulation, division, more complex algorithms) in comparison with current Intel Haswell-EP. The IBM Power8 is able to com- pete Intel processors these days and it will be interesting to observe the future generation of Power9 and its performance in comparison to current and future Intel processors. Abstrakt Práce se zabývá systémem IBM Power8 v porovnání s dnes běžně používanými řešeními s procesory Intel Xeon. Výkonnost je vyhodnocována nejen na úrovni celého systému, ale také na úrovni jednotlivých vláken a jader a paměti. Různé metriky jsou demonstrovány na typických optimalizovaných algoritmech. Testovaný stroj Power8 disponuje extrémně rychlou pamětí poskytující rychlost až 145 GB/s mezi pamětí a procesorem, které se dnešní procesory Intel nevyrovnají. Výpočetní síla je pouze srovnatelná (Násobení matic) nebo slabší (N-body simulace, dělení, složitější algoritmy) v porovnání s aktuálním Intel Haswell- EP. Procesor IBM Power8 je dnes schopný konkurovat procesorům Intel a bude zajímavé sledovat následující generaci Power9 a jeho výkonnost v porovnání s aktuálními a budoucími procesory Intel. Keywords Power 8, IBM, Intel, Haswell, Haswell-EP, Performance Analysis, STREAM, Steady state, N-body Klíčová slova Power 8, IBM, Intel, Haswell, Haswell-EP, Analýza výkonnosti, STREAM, Steady state, N-body Reference JELEN, Jakub. Performance analysis of IBM Power8 processors. Brno, 2016. Master’s thesis. Brno University of Technology, Faculty of Information Technology. Supervisor Jaroš Jiří. Performance analysis of IBM Power8 processors Declaration Hereby I declare that this master’s thesis was prepared as an original author’s work under the supervision of Mr. Jiří Jaroš. All the relevant information sources, which were used during preparation of this thesis, are properly cited and included in the list of references. ....................... Jakub Jelen May 23, 2016 Acknowledgements I would like to thank my thesis supervisor Jiří Jaroš for the useful comments, remarks and engagement through the learning process of this master thesis. Furthermore I would like to thank Red Hat and IBM Linux Technology Center for providing access to the Open Power Hub with latest Power8 servers. This work was supported by The Ministry of Education, Youth and Sports from the Large Infrastructures for Research, Experimental Development and Innovations project "IT4Innovations National Supercomputing Center – LM2015070". c Jakub Jelen, 2016. This thesis was created as a school work at the Brno University of Technology, Faculty of Information Technology. The thesis is protected by copyright law and its use without author’s explicit consent is illegal, except for cases defined by law. Contents 1 Motivation4 2 Architecture overview5 2.1 IBM Power8 architecture............................5 2.2 Intel Haswell-EP architecture.......................... 12 2.3 Architecture comparison............................. 14 3 Development tools overview 16 3.1 GNU Compiler Collection (GCC)........................ 16 3.2 Advance Toolchain for PowerLinux 8.0 and 9.0................ 16 3.3 IBM XL C/C++ for AIX............................ 17 3.4 Intel C++ Compiler............................... 17 4 Performed tests 18 4.1 Optimization techniques and performance monitoring............ 18 4.2 Micro-tests.................................... 22 4.3 Test-suites..................................... 27 4.4 Power Consumption............................... 33 5 Real world applications 35 5.1 N-body simulation................................ 35 5.2 Steady state heat distribution.......................... 38 5.3 Architectures comparison............................ 40 6 Conclusion 43 Bibliography 44 Appendices 45 List of Appendices................................... 46 A CD Content 47 1 List of Figures 2.1 IBM Power8 processor die with Centaur chips.................6 2.2 IBM Power8 2-hop SMP interconnect.....................9 2.3 The front view of the IBM Power S812L server................ 12 2.4 Intel Haswell-EP processor block diagram................... 13 4.1 Example optimization report from GCC.................... 19 4.2 Matrix and vector multiplication: Single thread performance comparison.. 26 4.3 Copy memory benchmark: Caches usage.................... 27 4.4 Copy memory benchmark: Bandwidth..................... 28 4.5 STREAM benchmark: IBM Power8...................... 30 4.6 STREAM benchmark: Intel Haswell-EP and different compilers....... 31 4.7 Comparison of power consumption....................... 34 5.1 N-body: IBM Power8 unroll factor effect.................... 37 5.2 N-body performance for small problem..................... 38 5.3 Steady state heat distribution: IBM Power8 scaling performance...... 40 5.4 Steady state heat distribution: Scaling performance on Intel processors... 41 2 List of Tables 2.1 List of available PAPI events on Power8 architecture............. 10 2.2 List of derived hardware counters........................ 11 2.3 Comparison of all evaluated processor architectures.............. 15 4.1 Matrix and vector multiplication: Various effects on performance...... 25 4.2 Native Linux benchmark: perf bench...................... 33 5.1 Architecture comparison............................. 42 3 Chapter 1 Motivation At the time of writing this document, there is no supercomputer in the list (from November 2015) of TOP 500 1 based on IBM Power8 architecture. There are only some based on previous generation IBM Power7, but in the latest model IBM did a good job with improving specifications in per-thread, per-core and per-socket performance, based onthe published numbers. Although Power8 does not have ambitions to became one of the top supercomputers, the overall performance is increasing and it will be an interesting base for the next generation of Power9 processors. Power8 instead aims for cloud based workloads, as well as big data and business analytics [9]. The improvements applicable in these areas are also important back in the scientific and technical computing and supercomputers. In this technical report, I would like to describe what is improved on this new archi- tecture and how it performs in comparison with a recent Intel Haswell-EP architecture, which provides similar improvements, but is better known and understood, since it is foun- dation for many supercomputers in the above mentioned list, but also for general purpose computers on our tables. I will not aim for the results of mass distributed parallel algorithms on thousands cores, but instead I will focus on the level of one processor core, one socket, instructions vectoriza- tion, simultaneous multithreading and memory management. These parameters measured on the low level also affect results of supercomputers in large scale. After gaining the theoretical knowledge about the architecture, I will describe advanced tools for development on both architectures. Later on, I will prepare a set of micro-tests to evaluate this architecture under extreme cases for memory access, raw computational power or combinations in comparison with an adequate Intel architecture. Based on these results, I will implement and optimize more complex simulation algo- rithms, such as N-body or Steady state heat distribution and compare their performance on Power8 and Intel processors with appropriate explanations and recommend workload for this architecture. 1Source http://www.top500.org/list/2015/11/ 4 Chapter 2 Architecture overview To measure the performance of specific architecture, there are different aspects to follow. First of them is certainly clock frequency, but with this one we have reached a limit of few Gigahertz, which exceeding does not provide significant gain and the energy cost is getting unbearable. A direct response to the frequency limit is placing more processor cores on one chip. This allows to run more or less separate programs on different cores, but it gets quite expensive because all the logic has to be duplicated and synchronization mechanisms on cores and memory introduced. Also, one might needs to speed up just a part of application, which brings a significant overhead of synchronization between cores. This was solved by the introduction of vector instructions working not on one value but on the vector of values. Another solution for processor is to execute more instructions from different threads at same time on the single core, so called Simultaneous multi-threading (SMT). All these features are available in nowadays processors and can make a huge difference both for everyday use or special applications. Some of them are implicit so they don’t
Recommended publications
  • Wind Rose Data Comes in the Form >200,000 Wind Rose Images
    Making Wind Speed and Direction Maps Rich Stromberg Alaska Energy Authority [email protected]/907-771-3053 6/30/2011 Wind Direction Maps 1 Wind rose data comes in the form of >200,000 wind rose images across Alaska 6/30/2011 Wind Direction Maps 2 Wind rose data is quantified in very large Excel™ spreadsheets for each region of the state • Fields: X Y X_1 Y_1 FILE FREQ1 FREQ2 FREQ3 FREQ4 FREQ5 FREQ6 FREQ7 FREQ8 FREQ9 FREQ10 FREQ11 FREQ12 FREQ13 FREQ14 FREQ15 FREQ16 SPEED1 SPEED2 SPEED3 SPEED4 SPEED5 SPEED6 SPEED7 SPEED8 SPEED9 SPEED10 SPEED11 SPEED12 SPEED13 SPEED14 SPEED15 SPEED16 POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 WEIBC1 WEIBC2 WEIBC3 WEIBC4 WEIBC5 WEIBC6 WEIBC7 WEIBC8 WEIBC9 WEIBC10 WEIBC11 WEIBC12 WEIBC13 WEIBC14 WEIBC15 WEIBC16 WEIBK1 WEIBK2 WEIBK3 WEIBK4 WEIBK5 WEIBK6 WEIBK7 WEIBK8 WEIBK9 WEIBK10 WEIBK11 WEIBK12 WEIBK13 WEIBK14 WEIBK15 WEIBK16 6/30/2011 Wind Direction Maps 3 Data set is thinned down to wind power density • Fields: X Y • POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 • Power1 is the wind power density coming from the north (0 degrees). Power 2 is wind power from 22.5 deg.,…Power 9 is south (180 deg.), etc… 6/30/2011 Wind Direction Maps 4 Spreadsheet calculations X Y POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 Max Wind Dir Prim 2nd Wind Dir Sec -132.7365 54.4833 0.643 0.767 1.911 4.083
    [Show full text]
  • Copyrighted Material
    CHAPTER 1 MULTI- AND MANY-CORES, ARCHITECTURAL OVERVIEW FOR PROGRAMMERS Lasse Natvig, Alexandru Iordan, Mujahed Eleyat, Magnus Jahre and Jorn Amundsen 1.1 INTRODUCTION 1.1.1 Fundamental Techniques Parallelism hasCOPYRIGHTED been used since the early days of computing MATERIAL to enhance performance. From the first computers to the most modern sequential processors (also called uni- processors), the main concepts introduced by von Neumann [20] are still in use. How- ever, the ever-increasing demand for computing performance has pushed computer architects toward implementing different techniques of parallelism. The von Neu- mann architecture was initially a sequential machine operating on scalar data with bit-serial operations [20]. Word-parallel operations were made possible by using more complex logic that could perform binary operations in parallel on all the bits in a computer word, and it was just the start of an adventure of innovations in parallel computer architectures. Programming Multicore and Many-core Computing Systems, 3 First Edition. Edited by Sabri Pllana and Fatos Xhafa. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. 4 MULTI- AND MANY-CORES, ARCHITECTURAL OVERVIEW FOR PROGRAMMERS Prefetching is a 'look-ahead technique' that was introduced quite early and is a way of parallelism that is used at several levels and in different components of a computer today. Both data and instructions are very often accessed sequentially. Therefore, when accessing an element (instruction or data) at address k, an auto- matic access to address k+1 will bring the element to where it is needed before it is accessed and thus eliminates or reduces waiting time.
    [Show full text]
  • Power1.Ps (Mpage)
    Low Energy & Power Design Issues Low Power Design Problem • Processor trends Microprocessor Power • Circuit and Technology Issues (source ISSCC) 30 • Architectural optimizations • Low power µP research project 20 10 Power (Watt) 0 75 80 85 90 95 Year When supply voltage drops to 1Volt, then 100Watts = 100 Amps Slide 2 Portable devices Two Kinds of Computation Required • General purpose processing (what you have been Portable Functions studying so far) • Multimodal radio • Bursty - mostly idle with bursts of computation • Protocols, ECC, ... • Maximum possible throughput required during active • Voice I/O compression & periods decompression • Handwriting recognition • Signal processing (for multimedia, wireless Battery • Text/Graphics processing communications, etc.) (40+ lbs) • Video decompression • Stream based computation • Speech recognition • No advantage in increasing processing rate above • Java interpreter required for real-time requirements How to get 1 month of operation? Slide 3 Slide 4 Optimizing for Energy Consumption Switching Energy • Conventional General Purpose processors (e.g. Vdd Pentiums) • Performance is everything ... somehow we’ll get the Vin Vout power in and back out • 10-100 Watts, 100-1000 Mips = .01 Mips/mW CL • Energy Optimized but General Purpose • Keep the generality, but reduce the energy as much as 2 possible - e.g. StrongArm Energy/transition = CL * Vdd • .5 Watts, 160 Mips = .3 Mips/mW 2 Power = Energy/transition * f = CL * Vdd * f • Energy Optimized and Dedicated • 100 Mops/mW Slide 5 Slide 6 Low Power
    [Show full text]
  • Chapter 1-Introduction to Microprocessors File
    Chapter 1 Introduction to Microprocessors Expected Outcomes Explain the role of the CPU, memory and I/O device in a computer Distinguish between the microprocessor and microcontroller Differentiate various form of programming languages Compare between CISC vs RISC and Von Neumann vs Harvard architecture NMKNYFKEEUMP Introduction A microprocessor is an integrated circuit built on a tiny piece of silicon It contains thousands or even millions of transistors which are interconnected via superfine traces of aluminum The transistors work together to store and manipulate data so that the microprocessor can perform a wide variety of useful functions The particular functions a microprocessor perform are dictated by software The first microprocessor was the Intel 4004 (16-pin) introduced in 1971 containing 2300 transistors with 46 instruction sets Power8 processor, by contrast, contains 4.2 billion transistors NMKNYFKEEUMP Introduction Computer is an electronic machine that perform arithmetic operation and logic in response to instructions written Computer requires hardware and software to function Hardware is electronic circuit boards that provide functionality of the system such as power supply, cable, etc CPU – Central Processing Unit/Microprocessor Memory – store all programming and data Input/Output device – the flow of information Software is a programming that control the system operation and facilitate the computer usage Programming is a group of instructions that inform the computer to perform certain task NMKNYFKEEUMP Introduction Computer
    [Show full text]
  • Power Architecture® ISA 2.06 Stride N Prefetch Engines to Boost Application's Performance
    Power Architecture® ISA 2.06 Stride N prefetch Engines to boost Application's performance History of IBM POWER architecture: POWER stands for Performance Optimization with Enhanced RISC. Power architecture is synonymous with performance. Introduced by IBM in 1991, POWER1 was a superscalar design that implemented register renaming andout-of-order execution. In Power2, additional FP unit and caches were added to boost performance. In 1996 IBM released successor of the POWER2 called P2SC (POWER2 Super chip), which is a single chip implementation of POWER2. P2SC is used to power the 30-node IBM Deep Blue supercomputer that beat world Chess Champion Garry Kasparov at chess in 1997. Power3, first 64 bit SMP, featured a data prefetch engine, non-blocking interleaved data cache, dual floating point execution units, and many other goodies. Power3 also unified the PowerPC and POWER Instruction set and was used in IBM's RS/6000 servers. The POWER3-II reimplemented POWER3 using copper interconnects, delivering double the performance at about the same price. Power4 was the first Gigahertz dual core processor launched in 2001 which was awarded the MicroProcessor Technology Award in recognition of its innovations and technology exploitation. Power5 came in with symmetric multi threading (SMT) feature to further increase application's performance. In 2004, IBM with 15 other companies founded Power.org. Power.org released the Power ISA v2.03 in September 2006, Power ISA v.2.04 in June 2007 and Power ISA v.2.05 with many advanced features such as VMX, virtualization, variable length encoding, hyper visor functionality, logical partitioning, virtual page handling, Decimal Floating point and so on which further boosted the architecture leadership in the market place and POWER5+, Cell, POWER6, PA6T, Titan are various compliant cores.
    [Show full text]
  • Floboss 107 Flow Manager Instruction Manual
    Part Number D301232X012 May 2018 FloBoss™ 107 Flow Manager Instruction Manual Remote Automation Solutions FloBoss 107 Flow Manager Instruction Manual System Training A well-trained workforce is critical to the success of your operation. Knowing how to correctly install, configure, program, calibrate, and trouble-shoot your Emerson equipment provides your engineers and technicians with the skills and confidence to optimize your investment. Remote Automation Solutions offers a variety of ways for your personnel to acquire essential system expertise. Our full-time professional instructors can conduct classroom training at several of our corporate offices, at your site, or even at your regional Emerson office. You can also receive the same quality training via our live, interactive Emerson Virtual Classroom and save on travel costs. For our complete schedule and further information, contact the Remote Automation Solutions Training Department at 800-338-8158 or email us at [email protected]. ii Revised May-2018 FloBoss 107 Flow Manager Instruction Manual Contents Chapter 1 – General Information 1-1 1.1 Scope of Manual ............................................................................................................................1-2 1.2 FloBoss 107 Overview ...................................................................................................................1-2 1.3 Hardware ........................................................................................................................................1-5
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • The POWER4 Processor Introduction and Tuning Guide
    Front cover The POWER4 Processor Introduction and Tuning Guide Comprehensive explanation of POWER4 performance Includes code examples and performance measurements How to get the most from the compiler Steve Behling Ron Bell Peter Farrell Holger Holthoff Frank O’Connell Will Weir ibm.com/redbooks International Technical Support Organization The POWER4 Processor Introduction and Tuning Guide November 2001 SG24-7041-00 Take Note! Before using this information and the product it supports, be sure to read the general information in “Special notices” on page 175. First Edition (November 2001) This edition applies to AIX 5L for POWER Version 5.1 (program number 5765-E61), XL Fortran Version 7.1.1 (5765-C10 and 5765-C11) and subsequent releases running on an IBM ^ pSeries POWER4-based server. Unless otherwise noted, all performance values mentioned in this document were measured on a 1.1 GHz machine, then normalized to 1.3 GHz. Note: This book is based on a pre-GA version of a product and may not apply when the product becomes generally available. We recommend that you consult the product documentation or follow-on versions of this redbook for more current information. Comments may be addressed to: IBM Corporation, International Technical Support Organization Dept. JN9B Building 003 Internal Zip 2834 11400 Burnet Road Austin, Texas 78758-3493 When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. © Copyright International Business Machines Corporation 2001. All rights reserved. Note to U.S Government Users – Documentation related to restricted rights – Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
    [Show full text]
  • Ilore: Discovering a Lineage of Microprocessors
    iLORE: Discovering a Lineage of Microprocessors Samuel Lewis Furman Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science in Computer Science & Applications Kirk Cameron, Chair Godmar Back Margaret Ellis May 24, 2021 Blacksburg, Virginia Keywords: Computer history, systems, computer architecture, microprocessors Copyright 2021, Samuel Lewis Furman iLORE: Discovering a Lineage of Microprocessors Samuel Lewis Furman (ABSTRACT) Researchers, benchmarking organizations, and hardware manufacturers maintain repositories of computer component and performance information. However, this data is split across many isolated sources and is stored in a form that is not conducive to analysis. A centralized repository of said data would arm stakeholders across industry and academia with a tool to more quantitatively understand the history of computing. We propose iLORE, a data model designed to represent intricate relationships between computer system benchmarks and computer components. We detail the methods we used to implement and populate the iLORE data model using data harvested from publicly available sources. Finally, we demonstrate the validity and utility of our iLORE implementation through an analysis of the characteristics and lineage of commercial microprocessors. We encourage the research community to interact with our data and visualizations at csgenome.org. iLORE: Discovering a Lineage of Microprocessors Samuel Lewis Furman (GENERAL AUDIENCE ABSTRACT) Researchers, benchmarking organizations, and hardware manufacturers maintain repositories of computer component and performance information. However, this data is split across many isolated sources and is stored in a form that is not conducive to analysis. A centralized repository of said data would arm stakeholders across industry and academia with a tool to more quantitatively understand the history of computing.
    [Show full text]
  • POWER Processor
    POWER Processor Technology Overview Myron Slota POWER Systems, IBM Systems © 2017 IBM Corporation Quarter Century of POWER 22nm Legacy of Leadership Innovation 45/32nm Driving Client Value 65nm POWER8 0.18um 0.25um 130/90nm POWER7/7+ 0.35um Business 0.5um RS64IV Sstar 180/130nm POWER6 0.5um RS64III Pulsar RS64II North Star 0.5um POWER5/5+ RS64I Apache 0.22um Cobra A10 Muskie POWER4/4+ A35 Modern UNIX Era 0.35um Workstation POWER3 -630 0.72um POWER2 P2SC 1.0um RSC 0.25um POWER1 0.35um PC 0.6um 604e 603 601 1990 1995 2000 2005 2010 2015 © 2017 IBM Corporation 2 IBM Optimized Semiconductor Technology World class technology with value-added features for server business. POWER9 is built on 14nm finFET technology transitioned to Global Foundaries 17-layer copper wire Silicon On Insulator On-chip eDRAM (14nm) -Faster Transistor, Less Noise - 6x latency improvement - No off-chip signaling required - 8x bandwidth improvement - 3x less area than SRAM - 5x less energy than SRAM Dense interconnect - Faster connections - Low latency distance paths - High density complex circuits - 2X wire per transistor DT DT eDRAM Cell “IBM is committed to meeting the rising demands of cognitive systems and cloud computing. GF’s leading performance in 7LP process technology, reflecting our joint Research collaboration, will allow IBM Power and Mainframe systems to push beyond limitations to provide high-performance computing solutions while aggressively pursuing 5nm to advance our leadership for years to come.” Tom Rosamilia, Senior Vice President, IBM Systems © 2017
    [Show full text]
  • Qoriq LS1012A SDK V0.3 Contents
    NXP Semiconductors Document Number: QORIQLS1012ASDK03 Rev. 0, Aug 2016 QorIQ LS1012A SDK v0.3 Contents Contents Chapter 1 SDK Overview................................................................................ 8 1.1 What's New.......................................................................................................................... 8 1.2 Components.........................................................................................................................8 1.3 Known Issues.......................................................................................................................9 Chapter 2 Getting Started with Yocto Project.............................................11 2.1 Yocto SDK File System Images..........................................................................................11 2.1.1 fsl-image-minimal.................................................................................................................11 2.1.2 fsl-image-mfgtool................................................................................................................. 11 2.1.3 fsl-image-full........................................................................................................................12 2.1.4 fsl-image-core......................................................................................................................12 2.1.5 fsl-image-virt........................................................................................................................13 2.2 Essential
    [Show full text]
  • The Design Space of Shelving
    The Design Space of Shelving Dezs Sima Institute of Informatics, Kandó Polytechnic Budapest, 19 Nagyszombat utca, 1034 Budapest, Hungary Abstract While using the direct issue mode, dependent instructions cause issue blockages and thus an issue bottleneck. Shelving is a technique to avoid this and to increase the sustained issue rate. It takes advantage of two concepts: (a) the decoupling of dependency checking from instruction issue and (b) significantly widening the instruction window that is scanned in each clock cycle for executable instructions. In this paper we identify and explore the design space of shelving. We first outline its main dimensions, then we present and discuss feasible design alternatives along three of its crucial dimensions. Finally, we point out which design choices have been made in important superscalar processors. For a concise graphical representation of the design space we make use of DS-trees. Keywords: Shelving, Dispatching, Instruction issue, Microarchitecture of superscalar processors, ILP-processing 1. Introduction Although shelving was introduced more than thirty years ago in scalar supercomputers of that time, it has come into widespread use only recently in high end superscalars. While pioneering the basic approaches of parallel instruction execution, that is replication of execution units (EUs) [1] and pipelining [2], designers of Control Data’s 6600 and IBM’s 360/91 made use of shelving to avoid instruction issue blockages caused by data dependencies. In this way, the sustained issue rate and the overall performance of their machines could be increased. Nevertheless, for a number of reasons, including the complexity of its implementation, and the slight performance gain in scalar processors, the concept of shelving was itself shelved for a quarter of a century.
    [Show full text]