Processors That Can Do 20+ GFLOPS Per Watt - 2012-08-27 by Vincent - Streamcomputing

Total Page:16

File Type:pdf, Size:1020Kb

Processors That Can Do 20+ GFLOPS Per Watt - 2012-08-27 by Vincent - Streamcomputing Processors that can do 20+ GFLOPS per Watt - 2012-08-27 by vincent - StreamComputing - http://www.streamcomputing.eu Processors that can do 20+ GFLOPS per Watt by vincent – Monday, 27 August 2012 http://www.streamcomputing.eu/blog/2012-08-27/processors-that-can-do-20-gflops-watt/ System for communicating power-efficiency of new equipment. “A” being best, “F” being worst. 2011-A is incomparable with 2012-A. For yearly power-usage there is a rule-of-thumb which states that a device that is continuously on, costs the amount of Watt times 1.5 in Euro per year. So the computer in front of me, that takes around 107 Watt, costs me €160 a year if I would leave it on. A moderate cluster with several GPUs of a few hundred Watts each, would cost a few thousand Euros a year. I would say: very doable for most companies. So why is the performance per Watt? There is more to a Watt than just the costs. The energy to cool a cluster is quite high, as most of the energy escapes via heat. And then there is the increase in demand for portable power. In cases you are thinking of sweeping you credit card for a top 10 supercomputer, then these energy-costs are extremely high. In this article I try to get an overview of who is entering the 20+ GFLOPS/Watt area. All processors that do less than 20 GFLOPS/Watt, need to have other advantages to survive. And you’ll see that all the green processors are programmed with OpenCL, the technology StreamComputing is all about. IMPORTANT: The total power used is sometimes including and sometimes excluding memory-transfers. So the comparison below is not fair. The graphics cards are including memory-transfers, while the CPUs and SoCs are not. The list Understand that since I mix CPUs, GPUs and SoCs (= CPU+GPU) the list is really only an indication of page 1 / 9 Processors that can do 20+ GFLOPS per Watt - 2012-08-27 by vincent - StreamComputing - http://www.streamcomputing.eu what is possible. Also a computer is built up of more energy-consuming parts than just the processors: interconnects, memory, harddrives, etc. Disclaimer: The below list is incomplete and based on theoretical values. TDP is assumed to be consumed when processor is working at maximum performance. Actual FLOPS/Watt values can be much lower, depending on many factors. If you want to buy hardware specifically for the purpose of highest FLOPS/Watt have your software tested on the device. Processor Type GFLOPS GFLOPS Watt (TDP) GFLOPS/Watt FLOPS/Watt (32bit) (64bit) (32bit) (64bit) Adapteva Epiphany 100 N/A 2 50 N/A Epiphany-IV Movidius ARM SoC: 15.28 N/A 0.32 48 N/A Myriad LEON3+SHA VE ZiiLabs ARM SoC 58 N/A ? 20? N/A Nvidia Tesla X86 GPU 4577 190 225 20.34 ? K10 ARM + MALI ARM SoC 8 + 68 N/A 4? 19? N/A T604 NVidia GTX X86 GPU x 2 5621 234? 300 18.74 0.78 690 GeForce GTX X86 GPU 3090 128 195 15.85 0.65 680 AMD Radeon X86 GPU 4300 1075 300+ 14.3 3.58 HD 7970 GHz Intel Knight’s X87? 2000? 1000 200? 10? 5? Corner (Xeon Phi) AMD X86 SoC 121 + 614 ? 100 7.35 ? A10-5800K + HD 7660D Intel Core X86 SoC 225 + 294,4 112 + 73.6 77 6.74 2.41 i7-3770 + HD4000 NVIDIA ARM + GPU ? + 200 ? 40 5.00 ? CARMA (complete board) IBM Power Power CPU 204? 204 55 3.72? 3.72 A2 Intel Core X86 CPU 225 112 ? ? ? i7-3770 AMD X86 CPU 121 60? ? ? ? A10-5800K The list contains recent and general available processors, but I will add any processor you want to see in page 2 / 9 Processors that can do 20+ GFLOPS per Watt - 2012-08-27 by vincent - StreamComputing - http://www.streamcomputing.eu the list – just request them in a comment. Please also point me to sources where official data can be found on these processors, as it seems to be top-secret data. As not all the data was available, I had to make some guesses. Below you find a graph of the list, including architectures grouped by GFLOPS + GFLOPS/Watt. GFLOPS/Watt for 32-bit. Red: CPUs, orange: APUs, yellow: GPUs, light-blue: ARM, green: grid-processors, not circled: Phi. The upper-right area is where we need to go. Below is a maybe more interesting view: Watt/GFLOPS. This projection has the advantage that low-power processors (< 2Watt) don’t get overrated and are closer together. page 3 / 9 Processors that can do 20+ GFLOPS per Watt - 2012-08-27 by vincent - StreamComputing - http://www.streamcomputing.eu Watt/GFLOPS (lower is better) vs GFLOPS, excluding the CPUs. You see the Radeons doing best if it comes to performance and Watt/GFLOPS. The left-upper area is where we need to go. CPU vs GPU Let’s be clear: 1. A GPU needs a CPU as a host. page 4 / 9 Processors that can do 20+ GFLOPS per Watt - 2012-08-27 by vincent - StreamComputing - http://www.streamcomputing.eu 2. A GPU is great in vector-computations, a CPU much better in scalar computations. In other words, a mix between a scalar and a vector processor is best. But once a problem can be defined as a vector-problem, the GPU is much, much faster than a CPU. 64 bit vs 32 bit As the memory-usage is energy-consuming and results in half the number of data showing up at the processor, we have two reasons why more energy is consumed. Due to architecture-differences, CPUs have a penalty for 32 bit and GPUs a penalty for 64 bit. Notice that most X86-alternatives have no 64 bit support, or just recently started with it. GPUs crunch double precision numbers at a fourth or less of the 32-bit performance-roof. Architectures ARM, X86/X87, Power and Epiphany all have different architecture-choices to get their targeted trade-off between precision, power-consumption and performance-optimisation (control unit). These choices make it sometimes impossible to get with the pace of other architectures in a certain direction. Current winner: Adapteva Epiphany Their 64-core Epiphany-IV is programmable with OpenCL and the 50 GFLOPS/Watt makes it worth to put time in porting software if you need a portable device. People who have ported their software to OpenCL already have an advantage here. Adapteva even claims 72 GFLOPS/Watt, as you can read here. With a 100-core CPU coming up, they will probably even raise the bar. X86 CPUs have the advantage of precision and legacy code, of which precision is the biggest advantage. As X86 GPUs (with Nvidia on top) have a great performance/Watt entering the 20+ GFLOPS/Watt, this could be very interesting for defending the X86 market against ARM. ARM-processors have a lot of software written for it (via Android) and is very flexible in design, while keeping power-usage for the CPU-part around 1Watt. For instance ZiiLabs’ processor can be compared to the design of Adapteva, but then with an ARM-CPU attached to it. Conclusion There is much more than just this number of GFLOPS/Watt, and which architecture will be mainstream architecture in a few years one can only speculate on. Luckily recompiling for other architectures is getting easier with compiler-technologies such as LLVM, so we don’t need to worry too much. Except to redesign our software for multi-core of course. You have read above that new architectures are programmed with OpenCL. It is better to invest in this technology now than later. page 5 / 9 Processors that can do 20+ GFLOPS per Watt - 2012-08-27 by vincent - StreamComputing - http://www.streamcomputing.eu More reading As memory-access takes energy, minimising memory-calls can lower consumption. This article on the ARM blog explains how this is done with MALI GPUs. The Mont Blanc project is a supercomputer based on ARM. This 12 page PDF shows some numbers and specifications of this supercomputer. As supercomputers eat lots of power, The Green 500 tries to stimulate to build greener HPC. Also check out these posts The entanglement of Bitcoins and compute-capa… The OpenCL power: offloading to the CPU (AVX+… AMD positions FirePro S10000 against both TES… page 6 / 9 Processors that can do 20+ GFLOPS per Watt - 2012-08-27 by vincent - StreamComputing - http://www.streamcomputing.eu page 7 / 9 Processors that can do 20+ GFLOPS per Watt - 2012-08-27 by vincent - StreamComputing - http://www.streamcomputing.eu Intel’s answer to AMD and NVIDIA: the XEON Ph… _______________________________________________ Also check out these posts The entanglement of Bitcoins and compute-capa... The OpenCL power: offloading to the CPU (AVX+... AMD positions FirePro S10000 against both TES... page 8 / 9 Processors that can do 20+ GFLOPS per Watt - 2012-08-27 by vincent - StreamComputing - http://www.streamcomputing.eu Intel's answer to AMD and NVIDIA: the XEON Ph... page 9 / 9 Powered by TCPDF (www.tcpdf.org).
Recommended publications
  • Accelerating HPL Using the Intel Xeon Phi 7120P Coprocessors
    Accelerating HPL using the Intel Xeon Phi 7120P Coprocessors Authors: Saeed Iqbal and Deepthi Cherlopalle The Intel Xeon Phi Series can be used to accelerate HPC applications in the C4130. The highly parallel architecture on Phi Coprocessors can boost the parallel applications. These coprocessors work seamlessly with the standard Xeon E5 processors series to provide additional parallel hardware to boost parallel applications. A key benefit of the Xeon Phi series is that these don’t require redesigning the application, only compiler directives are required to be able to use the Xeon Phi coprocessor. Fundamentally, the Intel Xeon series are many-core parallel processors, with each core having a dedicated L2 cache. The cores are connected through a bi-directional ring interconnects. Intel offers a complete set of development, performance monitoring and tuning tools through its Parallel Studio and VTune. The goal is to enable HPC users to get advantage from the parallel hardware with minimal changes to the code. The Xeon Phi has two modes of operation, the offload mode and native mode. In the offload mode designed parts of the application are “offloaded” to the Xeon Phi, if available in the server. Required code and data is copied from a host to the coprocessor, processing is done parallel in the Phi coprocessor and results move back to the host. There are two kinds of offload modes, non-shared and virtual-shared memory modes. Each offload mode offers different levels of user control on data movement to and from the coprocessor and incurs different types of overheads. In the native mode, the application runs on both host and Xeon Phi simultaneously, communication required data among themselves as need.
    [Show full text]
  • Intel Cirrascale and Petrobras Case Study
    Case Study Intel® Xeon Phi™ Coprocessor Intel® Xeon® Processor E5 Family Big Data Analytics High-Performance Computing Energy Accelerating Energy Exploration with Intel® Xeon Phi™ Coprocessors Cirrascale delivers scalable performance by combining its innovative PCIe switch riser with Intel® processors and coprocessors To find new oil and gas reservoirs, organizations are focusing exploration in the deep sea and in complex geological formations. As energy companies such as Petrobras work to locate and map those reservoirs, they need powerful IT resources that can process multiple iterations of seismic models and quickly deliver precise results. IT solution provider Cirrascale began building systems with Intel® Xeon Phi™ coprocessors to provide the scalable performance Petrobras and other customers need while holding down costs. Challenges • Enable deep-sea exploration. Improve reservoir mapping accuracy with detailed seismic processing. • Accelerate performance of seismic applications. Speed time to results while controlling costs. • Improve scalability. Enable server performance and density to scale as data volumes grow and workloads become more demanding. Solution • Custom Cirrascale servers with Intel Xeon Phi coprocessors. Employ new compute blades with the Intel® Xeon® processor E5 family and Intel Xeon Phi coprocessors. Cirrascale uses custom PCIe switch risers for fast, peer-to-peer communication among coprocessors. Technology Results • Linear scaling. Performance increases linearly as Intel Xeon Phi coprocessors “Working together, the are added to the system. Intel® Xeon® processors • Simplified development model. Developers no longer need to spend time optimizing data placement. and Intel® Xeon Phi™ coprocessors help Business Value • Faster, better analysis. More detailed and accurate modeling in less time HPC applications shed improves oil and gas exploration.
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • Recent International Trade Commission Representations
    Recent International Trade Commission Representations Certain Mobile Electronic Devices and Radio Frequency and Processing Components Thereof (II), Inv. No. 337-TA-1093 (ITC 2019). Quinn Emanuel was lead counsel for Qualcomm in a patent infringement action against Apple in the International Trade Commission. Qualcomm alleged that Apple engaged in the unlawful importation and sale of iPhones that infringe one or more claims of five Qualcomm patents covering key technologies that enable important features and function in the iPhones. After a seven day hearing, Administrative Law Judge McNamara issued an Initial Determination finding for Qualcomm on all issues related to claim 1 of U.S. Patent 8,063,674 related to an improved “Power on Control” circuit. ALJ McNamara recommended that the Commission issue a limited exclusion order with respect to the accused iPhone devices. Although the case settled shortly after AJ McNamara recommended the exclusion order, the order would have resulted in the exclusion of all iPhones and iPads without Qualcomm baseband processors from being imported into the United States. Certain Magnetic Tape Cartridges and Components Thereof Inv. No. 337-TA-1058 (ITC 2019): We represented Sony in a multifront battle against Fujifilm arising from Fujifilm’s anticompetitive conduct seeking to exclude Sony from the Linear Tape-Open magnetic tape market. LTO tape products are used to store large quantities of data by companies in a wide range of industries, including health care, education, finance and banking. Sony filed a complaint in the ITC seeking an exclusion order of Fujifilm’s products based on its infringement of three Sony patents covering various aspects of magnetic data storage technology.
    [Show full text]
  • Power Measurement Tutorial for the Green500 List
    Power Measurement Tutorial for the Green500 List R. Ge, X. Feng, H. Pyla, K. Cameron, W. Feng June 27, 2007 Contents 1 The Metric for Energy-Efficiency Evaluation 1 2 How to Obtain P¯(Rmax)? 2 2.1 The Definition of P¯(Rmax)...................................... 2 2.2 Deriving P¯(Rmax) from Unit Power . 2 2.3 Measuring Unit Power . 3 3 The Measurement Procedure 3 3.1 Equipment Check List . 4 3.2 Software Installation . 4 3.3 Hardware Connection . 4 3.4 Power Measurement Procedure . 5 4 Appendix 6 4.1 Frequently Asked Questions . 6 4.2 Resources . 6 1 The Metric for Energy-Efficiency Evaluation This tutorial serves as a practical guide for measuring the computer system power that is required as part of a Green500 submission. It describes the basic procedures to be followed in order to measure the power consumption of a supercomputer. A supercomputer that appears on The TOP500 List can easily consume megawatts of electric power. This power consumption may lead to operating costs that exceed acquisition costs as well as intolerable system failure rates. In recent years, we have witnessed an increasingly stronger movement towards energy-efficient computing systems in academia, government, and industry. Thus, the purpose of the Green500 List is to provide a ranking of the most energy-efficient supercomputers in the world and serve as a complementary view to the TOP500 List. However, as pointed out in [1, 2], identifying a single objective metric for energy efficiency in supercom- puters is a difficult task. Based on [1, 2] and given the already existing use of the “performance per watt” metric, the Green500 List uses “performance per watt” (PPW) as its metric to rank the energy efficiency of supercomputers.
    [Show full text]
  • Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems
    Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-core Systems Viren Kumar Alexandra Fedorova Simon Fraser University Simon Fraser University 8888 University Dr 8888 University Dr Vancouver, Canada Vancouver, Canada [email protected] [email protected] ABSTRACT performance per watt than homogeneous multicore proces- Single-ISA heterogeneous multicore architectures promise to sors. As power consumption in data centers becomes a grow- deliver plenty of cores with varying complexity, speed and ing concern [3], deploying ASISA multicore systems is an performance in the near future. Virtualization enables mul- increasingly attractive opportunity. These systems perform tiple operating systems to run concurrently as distinct, in- at their best if application workloads are assigned to het- dependent guest domains, thereby reducing core idle time erogeneous cores in consideration of their runtime proper- and maximizing throughput. This paper seeks to identify a ties [4][13][12][18][24][21]. Therefore, understanding how to heuristic that can aid in intelligently scheduling these vir- schedule data-center workloads on ASISA systems is an im- tualized workloads to maximize performance while reducing portant problem. This paper takes the first step towards power consumption. understanding the properties of data center workloads that determine how they should be scheduled on ASISA multi- We propose that the controlling domain in a Virtual Ma- core processors. Since virtual machine technology is a de chine Monitor or hypervisor is relatively insensitive to changes facto standard for data centers, we study virtual machine in core frequency, and thus scheduling it on a slower core (VM) workloads. saves power while only slightly affecting guest domain per- formance.
    [Show full text]
  • Comparing the Power and Performance of Intel's SCC to State
    Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs Ehsan Totoni, Babak Behzad, Swapnil Ghike, Josep Torrellas Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA E-mail: ftotoni2, bbehza2, ghike2, [email protected] Abstract—Power dissipation and energy consumption are be- A key architectural challenge now is how to support in- coming increasingly important architectural design constraints in creasing parallelism and scale performance, while being power different types of computers, from embedded systems to large- and energy efficient. There are multiple options on the table, scale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the namely “heavy-weight” multi-cores (such as general purpose best use of exponentially increasing numbers of transistors within processors), “light-weight” many-cores (such as Intel’s Single- the power and energy budgets. Intel SCC is an appealing option Chip Cloud Computer (SCC) [1]), low-power processors (such for future many-core architectures. In this paper, we use various as embedded processors), and SIMD-like highly-parallel archi- scalable applications to quantitatively compare and analyze tectures (such as General-Purpose Graphics Processing Units the performance, power consumption and energy efficiency of different cutting-edge platforms that differ in architectural build. (GPGPUs)). These platforms include the Intel Single-Chip Cloud Computer The Intel SCC [1] is a research chip made by Intel Labs (SCC) many-core, the Intel Core i7 general-purpose multi-core, to explore future many-core architectures. It has 48 Pentium the Intel Atom low-power processor, and the Nvidia ION2 (P54C) cores in 24 tiles of two cores each.
    [Show full text]
  • NVIDIA Ampere GA102 GPU Architecture Whitepaper
    NVIDIA AMPERE GA102 GPU ARCHITECTURE Second-Generation RTX Updated with NVIDIA RTX A6000 and NVIDIA A40 Information V2.0 Table of Contents Introduction 5 GA102 Key Features 7 2x FP32 Processing 7 Second-Generation RT Core 7 Third-Generation Tensor Cores 8 GDDR6X and GDDR6 Memory 8 Third-Generation NVLink® 8 PCIe Gen 4 9 Ampere GPU Architecture In-Depth 10 GPC, TPC, and SM High-Level Architecture 10 ROP Optimizations 11 GA10x SM Architecture 11 2x FP32 Throughput 12 Larger and Faster Unified Shared Memory and L1 Data Cache 13 Performance Per Watt 16 Second-Generation Ray Tracing Engine in GA10x GPUs 17 Ampere Architecture RTX Processors in Action 19 GA10x GPU Hardware Acceleration for Ray-Traced Motion Blur 20 Third-Generation Tensor Cores in GA10x GPUs 24 Comparison of Turing vs GA10x GPU Tensor Cores 24 NVIDIA Ampere Architecture Tensor Cores Support New DL Data Types 26 Fine-Grained Structured Sparsity 26 NVIDIA DLSS 8K 28 GDDR6X Memory 30 RTX IO 32 Introducing NVIDIA RTX IO 33 How NVIDIA RTX IO Works 34 Display and Video Engine 38 DisplayPort 1.4a with DSC 1.2a 38 HDMI 2.1 with DSC 1.2a 38 Fifth Generation NVDEC - Hardware-Accelerated Video Decoding 39 AV1 Hardware Decode 40 Seventh Generation NVENC - Hardware-Accelerated Video Encoding 40 NVIDIA Ampere GA102 GPU Architecture ii Conclusion 42 Appendix A - Additional GeForce GA10x GPU Specifications 44 GeForce RTX 3090 44 GeForce RTX 3070 46 Appendix B - New Memory Error Detection and Replay (EDR) Technology 49 Appendix C - RTX A6000 GPU Perf ormance 50 List of Figures Figure 1.
    [Show full text]
  • Using Intel Processors for DSP Applications
    Using Intel® Processors for DSP Applications: Comparing the Performance of Freescale MPC8641D and Two Intel Core™2 Duo Processors Mike Delves N.A. Software Ltd David Tetley GE Fanuc Intelligent Platforms Introduction The high performance embedded DSP processor market has moved steadily over the past decade from one dominated by specialised fixed point processors towards those with recognisable similarities to the general- purpose processor market: DSP processors have provided floating point, substantial caches, and capable C/C++ compilers, as well as SIMD (Single Instruction Multiple Data) features providing fast vector operations. The inclusion of SIMD features to general purpose processors has led to them replacing specialised DSPs in many high performance embedded applications (such as Radar, SAR, SIGINT and image processing). Indeed, probably the most widely used general purpose processor for DSP in the past five years is the PowerPC family, which was used for a long while in the Apple range of PCs. The G4 generation of the PowerPC has dominated the embedded high performance computing market for over 8 years.If the PowerPC can be accepted both in the general-purpose and the embedded DSP markets, what about other processor families? This question is motivated by the very fast rate of development of general- purpose silicon over the past four years: faster cycle times, larger caches, faster front side buses, lower-power variants, multi-core technology, vector instruction sets, and plans for more of everything well into the future, with development funded by the huge general-purpose market. Here, we look in particular at how well the current family of Intel® low power processors perform against the PowerPC.
    [Show full text]
  • The Opengl ES Shading Language
    The OpenGL ES® Shading Language Language Version: 3.20 Document Revision: 12 246 JuneAugust 2015 Editor: Robert J. Simpson, Qualcomm OpenGL GLSL editor: John Kessenich, LunarG GLSL version 1.1 Authors: John Kessenich, Dave Baldwin, Randi Rost 1 Copyright (c) 2013-2015 The Khronos Group Inc. All Rights Reserved. This specification is protected by copyright laws and contains material proprietary to the Khronos Group, Inc. It or any components may not be reproduced, republished, distributed, transmitted, displayed, broadcast, or otherwise exploited in any manner without the express prior written permission of Khronos Group. You may use this specification for implementing the functionality therein, without altering or removing any trademark, copyright or other notice from the specification, but the receipt or possession of this specification does not convey any rights to reproduce, disclose, or distribute its contents, or to manufacture, use, or sell anything that it may describe, in whole or in part. Khronos Group grants express permission to any current Promoter, Contributor or Adopter member of Khronos to copy and redistribute UNMODIFIED versions of this specification in any fashion, provided that NO CHARGE is made for the specification and the latest available update of the specification for any version of the API is used whenever possible. Such distributed specification may be reformatted AS LONG AS the contents of the specification are not changed in any way. The specification may be incorporated into a product that is sold as long as such product includes significant independent work developed by the seller. A link to the current version of this specification on the Khronos Group website should be included whenever possible with specification distributions.
    [Show full text]
  • The Opengl ES Shading Language
    The OpenGL ES® Shading Language Language Version: 3.00 Document Revision: 6 29 January 2016 Editor: Robert J. Simpson, Qualcomm OpenGL GLSL editor: John Kessenich, LunarG GLSL version 1.1 Authors: John Kessenich, Dave Baldwin, Randi Rost Copyright © 2008-2016 The Khronos Group Inc. All Rights Reserved. This specification is protected by copyright laws and contains material proprietary to the Khronos Group, Inc. It or any components may not be reproduced, republished, distributed, transmitted, displayed, broadcast, or otherwise exploited in any manner without the express prior written permission of Khronos Group. You may use this specification for implementing the functionality therein, without altering or removing any trademark, copyright or other notice from the specification, but the receipt or possession of this specification does not convey any rights to reproduce, disclose, or distribute its contents, or to manufacture, use, or sell anything that it may describe, in whole or in part. Khronos Group grants express permission to any current Promoter, Contributor or Adopter member of Khronos to copy and redistribute UNMODIFIED versions of this specification in any fashion, provided that NO CHARGE is made for the specification and the latest available update of the specification for any version of the API is used whenever possible. Such distributed specification may be reformatted AS LONG AS the contents of the specification are not changed in any way. The specification may be incorporated into a product that is sold as long as such product includes significant independent work developed by the seller. A link to the current version of this specification on the Khronos Group website should be included whenever possible with specification distributions.
    [Show full text]
  • Quantitative Analysis Modern Processor Design: Fundamentals of Superscalar Processors
    Chapter 1: Quantitative Analysis Modern Processor Design: Fundamentals of Superscalar Processors Mark Heinrich School of Computer Science University of Central Florida Define and quantify power ( 1 / 2) • For CMOS chips, traditional dominant energy consumption has been in switching transistors, called dynamic power 2 Powerdynamic = 1/ 2 × CapacitiveLoad ×Voltage × FrequencySwitched • For mobile devices, energy better metric 2 Energydynamic = CapacitiveLoad ×Voltage • For a fixed task, slowing clock rate (frequency switched) reduces power, but not energy • Capacitive load is a function of number of transistors connected to output and technology, which determines capacitance of wires and transistors • Dropping voltage helps both, so went from 5V to 1V • To save energy & dynamic power, most CPUs now turn off clock of inactive modules (e.g. Fl. Pt. Unit) 2 Example of quantifying power • Suppose 15% reduction in voltage results in a 15% reduction in frequency. What is impact on dynamic power? 2 Powerdynamic =1/ 2 × CapacitiveLoad ×Voltage × FrequencySwitched 2 =1/ 2 × .85 × CapacitiveLoad × (.85×Voltage) × FrequencySwitched 3 = (.85) × OldPowerdynamic ≈ 0.6 × OldPowerdynamic • Because turning down the voltage and performance are linear but voltage and power are cubic, be careful statements like I saved x% in power with ONLY a y% performance decrease!!! USE BETTER METRICS 3 Define and quantify power (2 / 2) • Because leakage current flows even when a transistor is off, now static power important too Powerstatic = Currentstatic ×Voltage • Leakage
    [Show full text]