Empirical Study of Power Consumption of X86-64 Instruction Decoder

Total Page:16

File Type:pdf, Size:1020Kb

Empirical Study of Power Consumption of X86-64 Instruction Decoder Empirical Study of Power Consumption of x86-64 Instruction Decoder Mikael Hirki1,2, Zhonghong Ou3, Kashif N. Khan1,2, Jukka K. Nurminen1,2,4, and Tapio Niemi1 1Helsinki Institute of Physics 2Aalto University 3Beijing University of Posts and Telecommunications 4VTT Research Abstract are more complex: a single instruction can load a value It has been a common myth that x86-64 processors from memory and add it to a register. Today these CISC suffer in terms of energy efficiency because of their com- instructions are decoded into smaller micro-operations, plex instruction set. In this paper, we aim to investigate i.e. one instruction is translated into one or more micro- whether this myth holds true, and determine the power operations. A common myth is that decoding the x86-64 consumption of the instruction decoders of an x86-64 instructions is somehow expensive. processor. To that end, we design a set of microbench- In this paper, we aim to investigate the myth that de- marks that specifically trigger the instruction decoders by coding x86-64 instructions is expensive. We leverage exceeding the capacity of the decoded instruction cache. the new features that are present in Intel processor ar- We measure the power consumption of the processor chitectures from Sandy Bridge onwards. The two fea- package using a hardware-level energy metering model tures are the Running Average Power Limit (RAPL) and called the Running Average Power Limit (RAPL), which a micro-op (short for micro-operation) cache. RAPL is supported in the latest Intel architectures. We leverage [6, 8] allows measuring the processor energy consump- linear regression modeling to break down the power con- tion at high accuracy. The micro-op cache allows the sumption of each processor component, including the in- instruction decoders to be shut down while decoded in- struction decoders. Through a comprehensive set of ex- structions are served from the cache, thus improving en- periments, we demonstrate that the instruction decoders ergy efficiency. Our core idea is to design microbench- can consume between 3% and 10% of the package power marks that exhaust the micro-op cache, thus trigger- when the capacity of the decoded instruction cache is ex- ing the instruction decoders. This allows us to specifi- ceeded. Overall, this is a somewhat limited amount of cally measure the power consumed by the instruction de- power compared with the other components in the pro- coders. The code for our microbenchmarks is available cessor core, e.g., the L2 cache. We hope our finding can at https://github.com/mhirki/idq-bench2. shed light on the future optimization of processor archi- Our paper makes the following major contributions: tectures. • We develop a set of microbenchmarks to accurately measure the power consumption of the instruction 1 Introduction decoders in an x86-64 processor. Making data centers more energy efficient requires • We show that the percentage of power consumed by finding and eliminating new sources of inefficiency. the instruction decoders is between 3% and 10% of With the recent introduction of ARM-based servers, we the total package power. will soon have more CPU architectures to choose from. These servers are based on the new ARMv8 architecture • We conclude that the x86-64 instruction set is not that adds support for 64-bit computing. The first proto- a major hindrance in producing an energy-efficient types started shipping in 2014 [1]. Energy efficiency of processor architecture. the first models has been somewhat poor but efficiency is The rest of the paper is structured as follows. Section expected to improve in later models [2]. 2 gives a more detailed explanation of processor archi- The most commonly cited difference between ARM tecture and the Intel RAPL feature. Section 5 reviews and Intel processors is the instruction set [5]. Intel the related work. Section 3 presents the hardware and processors use the x86-64 instruction set, which is a software configuration as well as the implementation of Complex Instruction Set Computer (CISC) architecture; our microbenchmarks. Section 4 presents the results ob- while ARM designs are based on the Reduced Instruction tained using our microbenchmarks. Finally, Section 6 Set Computer (RISC) architecture. CISC instructions concludes our paper and presents an idea for future work. 1 2 Background Package domain estimates the energy consumed by the 2.1 Processor Architecture entire processor chip. Power plane 0 is a subdomain which estimates the energy consumed by the processor cores. Power plane 1 is another subdomain which esti- L1 instruction cache mates the energy consumed by the integrated graphics in 16B desktop models. The DRAM domain is a separate do- Pre-decoding main which estimates the energy used by the dual in-line 6 instructions memory modules (DIMMs) installed in the system. In Instruction queue this paper, we present measurements of the RAPL Pack- age domain exclusively. Micro- Complex Simple Simple Simple code decoder decoder decoder decoder 3 Methodology 4 µops 1 µop 1 µop 1 µop 3.1 Hardware and Software Micro-ops 4 µops Table 1: Hardware used in the experiments. Decode queue (56 micro-ops) cache Processor: Intel Core i7-4770 @ 3.40 GHz 4 µops Architecture: Haswell Cores: 4 Renaming and scheduling L3 cache: 8 MB Turbo Boost: Supported but disabled Execution ports Hyperthreading: Supported but disabled RAM: 2x 8 GB Kingston DDR3, 1600 MHz (dual-channel) Retirement Motherboard: Intel DH87RL Figure 1: Simplified version of the Intel Haswell Power supply: Corsair TX750 pipeline. The hardware used in the experiments is listed in Ta- Figure 1 depicts a simplified version of the Intel ble 1. An Intel Haswell desktop processor, released in Haswell pipeline. This simplified version highlights the 2013, is used for the experiments. The Turbo Boost fea- micro-op cache, a new feature first introduced in the ture is disabled to make the experiments repeatable with- Sandy Bridge architecture. Before the new cache, the out relying on thermal conditions. Meanwhile, disabling Intel architecture relied on a very small decoded instruc- Turbo Boost ensures that the processor does not exceed tion buffer and four instruction decoders. Three of the its base frequency of 3.4 GHz. In addition, the hyper- decoders are simple decoders that can only decode a sin- threading feature is disabled in the BIOS of the machine gle instruction into a single micro-op. The fourth de- on purpose. This gives access to eight programmable coder can receive any instruction and produce up to four performance counters, compared with only four when micro-ops per cycle. Thus, instructions that are trans- hyperthreading is enabled. lated into more than one micro-op can only be decoded by the complex decoder. This creates a potential bottle- Table 2: Software and kernel parameters used in the ex- neck if the frequency of complex instructions is high. periments. The micro-op cache solves performance problems in Operating system: Scientific Linux 6.6 some cases. In addition, it allows shutting down the Kernel: Linux 4.1.6 instruction decoders when they are not used, thus sav- Cpufreq governor: Performance ing power. The capacity of the micro-op cache is 1536 NMI watchdog: Disabled micro-ops, which is equivalent to eight kilobytes of x86 Transparent hugepages: Disabled code at maximum. Compiler: GCC 4.4.7 Performance measurement: PAPI library 5.4.1 and Perf 4.1.6 2.2 Running Average Power Limit (RAPL) Intel introduced the RAPL feature in their Sandy Table 2 states the software components used in the Bridge architecture. It produces an estimate of the en- experiments. Certain kernel settings are adjusted to en- ergy consumed by the processor using a model based on sure that the experiments are repeatable. The Cpufreq microarchitectural counters. RAPL is always enabled in governor is set to performance that runs the processor supported models and the energy estimates are updated at the highest available frequency when it is not idle. every millisecond. RAPL supports up to four different This prevents the processor from running at frequencies power domains depending on the processor model. The lower than 3.4 GHz. The Non-maskable interrupt (NMI) 2 watchdog is disabled, which frees up one programmable stored in a register. The benchmarks do not write to performance counter. In addition, support for transparent memory because that would require an additional param- hugepages is disabled since it can change performance eter in the power model. The following C code shows characteristics while programs are running. The perfor- microbenchmark #2: mance counters are read using both the PAPI (Perfor- D += (A[ j ] << 3) ∗ (A[ j ] << 4) ∗ mance Application Programming Interface) library [14] ( ( B[ j ] << 2) ∗ 5 + 1 ) ; and the Perf tool. Perf is a low-level performance analy- sis tool that is shipped with the Linux kernel. The RAPL In this case, A and B are integer arrays. Since instruc- counters are read using the MSR driver interface. tions that access memory generate at least two micro-ops GCC 4.4.7 is used as the compiler in the experiments and hence are forced through the complex decoder, we since it is the default for the Linux distribution. The need to add several filler operations that generate only same compiler is likely still used by many applications one micro-op. Here we use bitwise shifts, multiplica- running on Scientific Linux 6. The compiler flag -O2 is tions and additions as filler operations. This approach used to produce optimized binaries. Model-specific opti- eliminates the complex decoder as a performance bottle- mizations, such as Advanced Vector eXtensions (AVX), neck, allowing us to stress all four decoders. are not used. We write many variations of each benchmark. In ad- dition to the previous two benchmarks, we have dozens 3.2 Microbenchmark Design of different variations of them.
Recommended publications
  • 1 Introduction
    Cambridge University Press 978-0-521-76992-1 - Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors Jean-Loup Baer Excerpt More information 1 Introduction Modern computer systems built from the most sophisticated microprocessors and extensive memory hierarchies achieve their high performance through a combina- tion of dramatic improvements in technology and advances in computer architec- ture. Advances in technology have resulted in exponential growth rates in raw speed (i.e., clock frequency) and in the amount of logic (number of transistors) that can be put on a chip. Computer architects have exploited these factors in order to further enhance performance using architectural techniques, which are the main subject of this book. Microprocessors are over 30 years old: the Intel 4004 was introduced in 1971. The functionality of the 4004 compared to that of the mainframes of that period (for example, the IBM System/370) was minuscule. Today, just over thirty years later, workstations powered by engines such as (in alphabetical order and without specific processor numbers) the AMD Athlon, IBM PowerPC, Intel Pentium, and Sun UltraSPARC can rival or surpass in both performance and functionality the few remaining mainframes and at a much lower cost. Servers and supercomputers are more often than not made up of collections of microprocessor systems. It would be wrong to assume, though, that the three tenets that computer archi- tects have followed, namely pipelining, parallelism, and the principle of locality, were discovered with the birth of microprocessors. They were all at the basis of the design of previous (super)computers. The advances in technology made their implementa- tions more practical and spurred further refinements.
    [Show full text]
  • Instruction Latencies and Throughput for AMD and Intel X86 Processors
    Instruction latencies and throughput for AMD and Intel x86 processors Torbj¨ornGranlund 2019-08-02 09:05Z Copyright Torbj¨ornGranlund 2005{2019. Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved. This report is work-in-progress. A newer version might be available here: https://gmplib.org/~tege/x86-timing.pdf In this short report we present latency and throughput data for various x86 processors. We only present data on integer operations. The data on integer MMX and SSE2 instructions is currently limited. We might present more complete data in the future, if there is enough interest. There are several reasons for presenting this report: 1. Intel's published data were in the past incomplete and full of errors. 2. Intel did not publish any data for 64-bit operations. 3. To allow straightforward comparison of an important aspect of AMD and Intel pipelines. The here presented data is the result of extensive timing tests. While we have made an effort to make sure the data is accurate, the reader is cautioned that some errors might have crept in. 1 Nomenclature and notation LNN means latency for NN-bit operation.TNN means throughput for NN-bit operation. The term throughput is used to mean number of instructions per cycle of this type that can be sustained. That implies that more throughput is better, which is consistent with how most people understand the term. Intel use that same term in the exact opposite meaning in their manuals. The notation "P6 0-E", "P4 F0", etc, are used to save table header space.
    [Show full text]
  • The Von Neumann Computer Model 5/30/17, 10:03 PM
    The von Neumann Computer Model 5/30/17, 10:03 PM CIS-77 Home http://www.c-jump.com/CIS77/CIS77syllabus.htm The von Neumann Computer Model 1. The von Neumann Computer Model 2. Components of the Von Neumann Model 3. Communication Between Memory and Processing Unit 4. CPU data-path 5. Memory Operations 6. Understanding the MAR and the MDR 7. Understanding the MAR and the MDR, Cont. 8. ALU, the Processing Unit 9. ALU and the Word Length 10. Control Unit 11. Control Unit, Cont. 12. Input/Output 13. Input/Output Ports 14. Input/Output Address Space 15. Console Input/Output in Protected Memory Mode 16. Instruction Processing 17. Instruction Components 18. Why Learn Intel x86 ISA ? 19. Design of the x86 CPU Instruction Set 20. CPU Instruction Set 21. History of IBM PC 22. Early x86 Processor Family 23. 8086 and 8088 CPU 24. 80186 CPU 25. 80286 CPU 26. 80386 CPU 27. 80386 CPU, Cont. 28. 80486 CPU 29. Pentium (Intel 80586) 30. Pentium Pro 31. Pentium II 32. Itanium processor 1. The von Neumann Computer Model Von Neumann computer systems contain three main building blocks: The following block diagram shows major relationship between CPU components: the central processing unit (CPU), memory, and input/output devices (I/O). These three components are connected together using the system bus. The most prominent items within the CPU are the registers: they can be manipulated directly by a computer program. http://www.c-jump.com/CIS77/CPU/VonNeumann/lecture.html Page 1 of 15 IPR2017-01532 FanDuel, et al.
    [Show full text]
  • Lecture Notes
    Lecture #4-5: Computer Hardware (Overview and CPUs) CS106E Spring 2018, Young In these lectures, we begin our three-lecture exploration of Computer Hardware. We start by looking at the different types of computer components and how they interact during basic computer operations. Next, we focus specifically on the CPU (Central Processing Unit). We take a look at the Machine Language of the CPU and discover it’s really quite primitive. We explore how Compilers and Interpreters allow us to go from the High-Level Languages we are used to programming to the Low-Level machine language actually used by the CPU. Most modern CPUs are multicore. We take a look at when multicore provides big advantages and when it doesn’t. We also take a short look at Graphics Processing Units (GPUs) and what they might be used for. We end by taking a look at Reduced Instruction Set Computing (RISC) and Complex Instruction Set Computing (CISC). Stanford President John Hennessy won the Turing Award (Computer Science’s equivalent of the Nobel Prize) for his work on RISC computing. Hardware and Software: Hardware refers to the physical components of a computer. Software refers to the programs or instructions that run on the physical computer. - We can entirely change the software on a computer, without changing the hardware and it will transform how the computer works. I can take an Apple MacBook for example, remove the Apple Software and install Microsoft Windows, and I now have a Window’s computer. - In the next two lectures we will focus entirely on Hardware.
    [Show full text]
  • Exam 1 Solutions
    Midterm Exam ECE 741 – Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points Score 1 40 2 20 3 15 4 20 5 25 6 20 7 (bonus) 15 Total 140+15 • This is a closed book midterm. You are allowed to have only two letter-sized cheat sheets. • No electronic devices may be used. • This exam lasts 1 hour 50 minutes. • If you make a mess, clearly indicate your final answer. • For questions requiring brief answers, please provide brief answers. Do not write an essay. You can be penalized for verbosity. • Please show your work when needed. We cannot give you partial credit if you do not clearly show how you arrive at a numerical answer. • Please write your name on every sheet. EXAM 1 SOLUTIONS Problem 1 (Short answers – 40 points) i. (3 points) A cache has the block size equal to the word length. What property of program behavior, which usually contributes to higher performance if we use a cache, does not help the performance if we use THIS cache? Spatial locality ii. (3 points) Pipelining increases the performance of a processor if the pipeline can be kept full with useful instructions. Two reasons that often prevent the pipeline from staying full with useful instructions are (in two words each): Data dependencies Control dependencies iii. (3 points) The reference bit (sometimes called “access” bit) in a PTE (Page Table Entry) is used for what purpose? Page replacement The similar function is performed by what bit or bits in a cache’s tag store entry? Replacement policy bits (e.g.
    [Show full text]
  • Trends in Processor Architecture
    A. González Trends in Processor Architecture Trends in Processor Architecture Antonio González Universitat Politècnica de Catalunya, Barcelona, Spain 1. Past Trends Processors have undergone a tremendous evolution throughout their history. A key milestone in this evolution was the introduction of the microprocessor, term that refers to a processor that is implemented in a single chip. The first microprocessor was introduced by Intel under the name of Intel 4004 in 1971. It contained about 2,300 transistors, was clocked at 740 KHz and delivered 92,000 instructions per second while dissipating around 0.5 watts. Since then, practically every year we have witnessed the launch of a new microprocessor, delivering significant performance improvements over previous ones. Some studies have estimated this growth to be exponential, in the order of about 50% per year, which results in a cumulative growth of over three orders of magnitude in a time span of two decades [12]. These improvements have been fueled by advances in the manufacturing process and innovations in processor architecture. According to several studies [4][6], both aspects contributed in a similar amount to the global gains. The manufacturing process technology has tried to follow the scaling recipe laid down by Robert N. Dennard in the early 1970s [7]. The basics of this technology scaling consists of reducing transistor dimensions by a factor of 30% every generation (typically 2 years) while keeping electric fields constant. The 30% scaling in the dimensions results in doubling the transistor density (doubling transistor density every two years was predicted in 1975 by Gordon Moore and is normally referred to as Moore’s Law [21][22]).
    [Show full text]
  • Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus
    Theoretical Peak FLOPS per instruction set on modern Intel CPUs Romain Dolbeau Bull – Center for Excellence in Parallel Programming Email: [email protected] Abstract—It used to be that evaluating the theoretical and potentially multiple threads per core. Vector of peak performance of a CPU in FLOPS (floating point varying sizes. And more sophisticated instructions. operations per seconds) was merely a matter of multiplying Equation2 describes a more realistic view, that we the frequency by the number of floating-point instructions will explain in details in the rest of the paper, first per cycles. Today however, CPUs have features such as vectorization, fused multiply-add, hyper-threading or in general in sectionII and then for the specific “turbo” mode. In this paper, we look into this theoretical cases of Intel CPUs: first a simple one from the peak for recent full-featured Intel CPUs., taking into Nehalem/Westmere era in section III and then the account not only the simple absolute peak, but also the full complexity of the Haswell family in sectionIV. relevant instruction sets and encoding and the frequency A complement to this paper titled “Theoretical Peak scaling behavior of current Intel CPUs. FLOPS per instruction set on less conventional Revision 1.41, 2016/10/04 08:49:16 Index Terms—FLOPS hardware” [1] covers other computing devices. flop 9 I. INTRODUCTION > operation> High performance computing thrives on fast com- > > putations and high memory bandwidth. But before > operations => any code or even benchmark is run, the very first × micro − architecture instruction number to evaluate a system is the theoretical peak > > - how many floating-point operations the system > can theoretically execute in a given time.
    [Show full text]
  • COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011
    COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Edgar Gabriel Spring 2011 Edgar Gabriel Moore’s Law • Long-term trend on the number of transistor per integrated circuit • Number of transistors double every ~18 month Source: http://en.wikipedia.org/wki/Images:Moores_law.svg COSC 6385 – Computer Architecture Edgar Gabriel 1 What do we do with that many transistors? • Optimizing the execution of a single instruction stream through – Pipelining • Overlap the execution of multiple instructions • Example: all RISC architectures; Intel x86 underneath the hood – Out-of-order execution: • Allow instructions to overtake each other in accordance with code dependencies (RAW, WAW, WAR) • Example: all commercial processors (Intel, AMD, IBM, SUN) – Branch prediction and speculative execution: • Reduce the number of stall cycles due to unresolved branches • Example: (nearly) all commercial processors COSC 6385 – Computer Architecture Edgar Gabriel What do we do with that many transistors? (II) – Multi-issue processors: • Allow multiple instructions to start execution per clock cycle • Superscalar (Intel x86, AMD, …) vs. VLIW architectures – VLIW/EPIC architectures: • Allow compilers to indicate independent instructions per issue packet • Example: Intel Itanium series – Vector units: • Allow for the efficient expression and execution of vector operations • Example: SSE, SSE2, SSE3, SSE4 instructions COSC 6385 – Computer Architecture Edgar Gabriel 2 Limitations of optimizing a single instruction
    [Show full text]
  • Reverse Engineering X86 Processor Microcode
    Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz, Ruhr-University Bochum https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/koppe This paper is included in the Proceedings of the 26th USENIX Security Symposium August 16–18, 2017 • Vancouver, BC, Canada ISBN 978-1-931971-40-9 Open access to the Proceedings of the 26th USENIX Security Symposium is sponsored by USENIX Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz Ruhr-Universitat¨ Bochum Abstract hardware modifications [48]. Dedicated hardware units to counter bugs are imperfect [36, 49] and involve non- Microcode is an abstraction layer on top of the phys- negligible hardware costs [8]. The infamous Pentium fdiv ical components of a CPU and present in most general- bug [62] illustrated a clear economic need for field up- purpose CPUs today. In addition to facilitate complex and dates after deployment in order to turn off defective parts vast instruction sets, it also provides an update mechanism and patch erroneous behavior. Note that the implementa- that allows CPUs to be patched in-place without requiring tion of a modern processor involves millions of lines of any special hardware. While it is well-known that CPUs HDL code [55] and verification of functional correctness are regularly updated with this mechanism, very little is for such processors is still an unsolved problem [4, 29]. known about its inner workings given that microcode and the update mechanism are proprietary and have not been Since the 1970s, x86 processor manufacturers have throughly analyzed yet.
    [Show full text]
  • Assembly Language: IA-X86
    Assembly Language for x86 Processors X86 Processor Architecture CS 271 Computer Architecture Purdue University Fort Wayne 1 Outline Basic IA Computer Organization IA-32 Registers Instruction Execution Cycle Basic IA Computer Organization Since the 1940's, the Von Neumann computers contains three key components: Processor, called also the CPU (Central Processing Unit) Memory and Storage Devices I/O Devices Interconnected with one or more buses Data Bus Address Bus data bus Control Bus registers Processor I/O I/O IA: Intel Architecture Memory Device Device (CPU) #1 #2 32-bit (or i386) ALU CU clock control bus address bus Processor The processor consists of Datapath ALU Registers Control unit ALU (Arithmetic logic unit) Performs arithmetic and logic operations Control unit (CU) Generates the control signals required to execute instructions Memory Address Space Address Space is the set of memory locations (bytes) that are addressable Next ... Basic Computer Organization IA-32 Registers Instruction Execution Cycle Registers Registers are high speed memory inside the CPU Eight 32-bit general-purpose registers Six 16-bit segment registers Processor Status Flags (EFLAGS) and Instruction Pointer (EIP) 32-bit General-Purpose Registers EAX EBP EBX ESP ECX ESI EDX EDI 16-bit Segment Registers EFLAGS CS ES SS FS EIP DS GS General-Purpose Registers Used primarily for arithmetic and data movement mov eax 10 ;move constant integer 10 into register eax Specialized uses of Registers eax – Accumulator register Automatically
    [Show full text]
  • State of the Port to X86-64
    OpenVMS State of the Port to x86-64 October 6, 2017 1 Executive Summary - Development The Development Plan consists of five strategic work areas for porting the OpenVMS operating system to the x86-64 architecture. • OpenVMS supports nine programming languages, six of which use a DEC-developed proprietary backend code generator on both Alpha and Itanium. A converter is being created to internally connect these compiler frontends to the open source LLVM backend code generator which targets x86-64 as well as many other architectures. LLVM implements the most current compiler technology and associated development tools and it provides a direct path for porting to other architectures in the future. The other three compilers have their own individual pathways to the new architecture. • Operating system components are being modified to conform to the industry standard AMD64 ABI calling conventions. OpenVMS moved away from proprietary conventions and formats in porting from Alpha to Itanium so there is much less work in these areas in porting to x86-64. • As in any port to a new architecture, a number of architecture-defined interfaces that are critical to the inner workings of the operating system are being implemented. • OpenVMS is currently built for Alpha and Itanium from common source code modules. Support for x86-64 is being added, so all conditional assembly directives must be verified. • The boot manager for x86-64 has been upgraded to take advantage of new features and methods which did not exist when our previous implementation was created for Itanium. This will streamline the entire boot path and make it more flexible and maintainable.
    [Show full text]
  • Theoretical Peak FLOPS Per Instruction Set on Less Conventional Hardware
    Theoretical Peak FLOPS per instruction set on less conventional hardware Romain Dolbeau Bull – Center for Excellence in Parallel Programming Email: [email protected] Abstract—This is a companion paper to “Theoreti- popular at the time [4][5][6]. Only the FPUs are cal Peak FLOPS per instruction set on modern Intel shared in Bulldozer. We can take a look back at CPUs” [1]. In it, we survey some alternative hardware for the equation1, replicated from the main paper, and which the peak FLOPS can be of interest. As in the main see how this affects the peak FLOPS. paper, we take into account and explain the peculiarities of the surveyed hardware. Revision 1.16, 2016/10/04 08:40:17 flop 9 Index Terms—FLOPS > operation> > > I. INTRODUCTION > operations => Many different kind of hardware are in use to × micro − architecture instruction> perform computations. No only conventional Cen- > > tral Processing Unit (CPU), but also Graphics Pro- > instructions> cessing Unit (GPU) and other accelerators. In the × > cycle ; main paper [1], we described how to compute the flops = (1) peak FLOPS for conventional Intel CPUs. In this node extension, we take a look at the peculiarities of those cycles 9 × > alternative computation devices. second> > > II. OTHER CPUS cores => × machine architecture A. AMD Family 15h socket > > The AMD Family 15h (the name “15h” comes > sockets> from the hexadecimal value returned by the CPUID × ;> instruction) was introduced in 2011 and is composed node flop of the so-called “construction cores”, code-named For the micro-architecture parts ( =operation, operations instructions Bulldozer, Piledriver, Steamroller and Excavator.
    [Show full text]