What Every Scientist Should Know About Floating-Point Arithmetic

Total Page:16

File Type:pdf, Size:1020Kb

What Every Scientist Should Know About Floating-Point Arithmetic What Every Computer Scientist Should Know About Floating-Point Arithmetic 2550 Garcia Avenue Mountain View, CA 94043 U.S.A. Part No: 800-7895-10 Revision A, June 1992 1994 Sun Microsystems, Inc. 2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A. All rights reserved. This product and related documentation are protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or related documentation may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Portions of this product may be derived from the UNIX® and Berkeley 4.3 BSD systems, licensed from UNIX System Laboratories, Inc., a wholly owned subsidiary of Novell, Inc., and the University of California, respectively. Third-party font software in this product is protected by copyright and licensed from Sun’s font suppliers. RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the United States Government is subject to the restrictions set forth in DFARS 252.227-7013 (c)(1)(ii) and FAR 52.227-19. The product described in this manual may be protected by one or more U.S. patents, foreign patents, or pending applications. TRADEMARKS Sun, the Sun logo, Sun Microsystems, Sun Microsystems Computer Corporation, Solaris, are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and certain other countries. UNIX is a registered trademark of Novell, Inc., in the United States and other countries; X/Open Company, Ltd., is the exclusive licensor of such trademark. OPEN LOOK® is a registered trademark of Novell, Inc. PostScript and Display PostScript are trademarks of Adobe Systems, Inc. All other product names mentioned herein are the trademarks of their respective owners. All SPARC trademarks, including the SCD Compliant Logo, are trademarks or registered trademarks of SPARC International, Inc. SPARCstation, SPARCserver, SPARCengine, SPARCstorage, SPARCware, SPARCcenter, SPARCclassic, SPARCcluster, SPARCdesign, SPARC811, SPARCprinter, UltraSPARC, microSPARC, SPARCworks, and SPARCompiler are licensed exclusively to Sun Microsystems, Inc. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. The OPEN LOOKand Sun™ Graphical User Interfaces were developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Sun’s licensees who implement OPEN LOOK GUIs and otherwise comply with Sun’s written license agreements. X Window System is a product of the Massachusetts Institute of Technology. THIS PUBLICATION IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. THIS PUBLICATION COULD INCLUDE TECHNICAL INACCURACIES OR TYPOGRAPHICAL ERRORS. CHANGES ARE PERIODICALLY ADDED TO THE INFORMATION HEREIN; THESE CHANGES WILL BE INCORPORATED IN NEW EDITIONS OF THE PUBLICATION. SUN MICROSYSTEMS, INC. MAY MAKE IMPROVEMENTS AND/OR CHANGES IN THE PRODUCT(S) AND/OR THE PROGRAM(S) DESCRIBED IN THIS PUBLICATION AT ANY TIME Please Recycle Contents Abstract . 1 Introduction . 2 Rounding Error . 2 Floating-point Formats . 3 Relative Error and Ulps . 5 Guard Digits. 6 Cancellation . 8 The IEEE Standard. 17 Formats and Operations . 18 Special Quantities . 24 NaNs. 25 Exceptions, Flags and Trap Handlers . 32 Systems Aspects. 37 Instruction Sets. 38 Languages and Compilers . 40 iii Exception Handling. 48 The Details . 50 Rounding Error . 50 Errors In Summation . 61 Summary. 62 Acknowledgments. 63 References . 63 Theorem 14 and Theorem 8 . 65 iv What Every Computer Scientist Should Know About Floating-Point Arithmetic—June 1992 What Every Computer Scientist Should Know About Floating-Point Arithmetic Note – This document is a reprint of the paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, published in the March, 1991 issue of Computing Surveys. Copyright 1991, Association for Computing Machinery, Inc., reprinted by permission. Abstract Floating-point arithmetic is considered an esoteric subject by many people. This is rather surprising because floating-point is ubiquitous in computer systems. Almost every language has a floating-point datatype; computers from PC’s to supercomputers have floating-point accelerators; most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system must respond to floating-point exceptions such as overflow. This paper presents a tutorial on those aspects of floating- point that have a direct impact on designers of computer systems. It begins with background on floating-point representation and rounding error, continues with a discussion of the IEEE floating-point standard, and concludes with numerous examples of how computer builders can better support floating-point. Categories and Subject Descriptors: (Primary) C.0 [Computer Systems Organization]: General — instruction set design; D.3.4 [Programming Languages]: Processors — compilers, optimization; G.1.0 [Numerical Analysis]: General — computer arithmetic, error analysis, numerical algorithms (Secondary) 1 D.2.1 [Software Engineering]: Requirements/Specifications — languages; D.3.4 [Programming Languages]: Formal Definitions and Theory — semantics; D.4.1 [Operating Systems]: Process Management — synchronization. General Terms: Algorithms, Design, Languages Additional Key Words and Phrases: Denormalized number, exception, floating- point, floating-point standard, gradual underflow, guard digit, NaN, overflow, relative error, rounding error, rounding mode, ulp, underflow. Introduction Builders of computer systems often need information about floating-point arithmetic. There are, however, remarkably few sources of detailed information about it. One of the few books on the subject, Floating-Point Computation by Pat Sterbenz, is long out of print. This paper is a tutorial on those aspects of floating-point arithmetic (floating-point hereafter) that have a direct connection to systems building. It consists of three loosely connected parts. The first (Section , “Rounding Error,” on page 2) discusses the implications of using different rounding strategies for the basic operations of addition, subtraction, multiplication and division. It also contains background information on the two methods of measuring rounding error, ulps and relative error. The second part discusses the IEEE floating-point standard, which is becoming rapidly accepted by commercial hardware manufacturers. Included in the IEEE standard is the rounding method for basic operations. The discussion of the standard draws on the material in Section , “Rounding Error,” on page 2. The third part discusses the connections between floating-point and the design of various aspects of computer systems. Topics include instruction set design, optimizing compilers and exception handling. I have tried to avoid making statements about floating-point without also giving reasons why the statements are true, especially since the justifications involve nothing more complicated than elementary calculus. Those explanations that are not central to the main argument have been grouped into a section called “The Details,” so that they can be skipped if desired. In particular, the proofs of many of the theorems appear in this section. The end of each proof is marked with the ❚ symbol; when a proof is not included, the ❚ appears immediately following the statement of the theorem. Rounding Error Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In 2 What Every Computer Scientist Should Know About Floating-Point Arithmetic contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation. “Relative Error and Ulps” on page 5 describes how it is measured. Since most floating-point calculations have rounding error anyway, does it matter if the basic arithmetic operations introduce a little bit more rounding error than necessary? That question is a main theme throughout this section. “Guard Digits” on page 6 discusses guard digits, a means of reducing the error when subtracting two nearby numbers. Guard digits were considered sufficiently important by IBM that in 1968 it added a guard digit to the double precision format in the System/360 architecture (single precision already had a guard digit), and retrofitted all existing machines in the field. Two examples are given to illustrate the utility of guard digits. The IEEE standard goes further than just requiring the use of a guard digit. It gives an algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm. Thus when a program is moved from one machine to another, the results of the basic operations will
Recommended publications
  • Nios II Custom Instruction User Guide
    Nios II Custom Instruction User Guide Subscribe UG-20286 | 2020.04.27 Send Feedback Latest document on the web: PDF | HTML Contents Contents 1. Nios II Custom Instruction Overview..............................................................................4 1.1. Custom Instruction Implementation......................................................................... 4 1.1.1. Custom Instruction Hardware Implementation............................................... 5 1.1.2. Custom Instruction Software Implementation................................................ 6 2. Custom Instruction Hardware Interface......................................................................... 7 2.1. Custom Instruction Types....................................................................................... 7 2.1.1. Combinational Custom Instructions.............................................................. 8 2.1.2. Multicycle Custom Instructions...................................................................10 2.1.3. Extended Custom Instructions................................................................... 11 2.1.4. Internal Register File Custom Instructions................................................... 13 2.1.5. External Interface Custom Instructions....................................................... 15 3. Custom Instruction Software Interface.........................................................................16 3.1. Custom Instruction Software Examples................................................................... 16
    [Show full text]
  • Bit Nove Signal Interface Processor
    www.audison.eu bit Nove Signal Interface Processor POWER SUPPLY CONNECTION Voltage 10.8 ÷ 15 VDC From / To Personal Computer 1 x Micro USB Operating power supply voltage 7.5 ÷ 14.4 VDC To Audison DRC AB / DRC MP 1 x AC Link Idling current 0.53 A Optical 2 sel Optical In 2 wire control +12 V enable Switched off without DRC 1 mA Mem D sel Memory D wire control GND enable Switched off with DRC 4.5 mA CROSSOVER Remote IN voltage 4 ÷ 15 VDC (1mA) Filter type Full / Hi pass / Low Pass / Band Pass Remote OUT voltage 10 ÷ 15 VDC (130 mA) Linkwitz @ 12/24 dB - Butterworth @ Filter mode and slope ART (Automatic Remote Turn ON) 2 ÷ 7 VDC 6/12/18/24 dB Crossover Frequency 68 steps @ 20 ÷ 20k Hz SIGNAL STAGE Phase control 0° / 180° Distortion - THD @ 1 kHz, 1 VRMS Output 0.005% EQUALIZER (20 ÷ 20K Hz) Bandwidth @ -3 dB 10 Hz ÷ 22 kHz S/N ratio @ A weighted Analog Input Equalizer Automatic De-Equalization N.9 Parametrics Equalizers: ±12 dB;10 pole; Master Input 102 dBA Output Equalizer 20 ÷ 20k Hz AUX Input 101.5 dBA OPTICAL IN1 / IN2 Inputs 110 dBA TIME ALIGNMENT Channel Separation @ 1 kHz 85 dBA Distance 0 ÷ 510 cm / 0 ÷ 200.8 inches Input sensitivity Pre Master 1.2 ÷ 8 VRMS Delay 0 ÷ 15 ms Input sensitivity Speaker Master 3 ÷ 20 VRMS Step 0,08 ms; 2,8 cm / 1.1 inch Input sensitivity AUX Master 0.3 ÷ 5 VRMS Fine SET 0,02 ms; 0,7 cm / 0.27 inch Input impedance Pre In / Speaker In / AUX 15 kΩ / 12 Ω / 15 kΩ GENERAL REQUIREMENTS Max Output Level (RMS) @ 0.1% THD 4 V PC connections USB 1.1 / 2.0 / 3.0 Compatible Microsoft Windows (32/64 bit): Vista, INPUT STAGE Software/PC requirements Windows 7, Windows 8, Windows 10 Low level (Pre) Ch1 ÷ Ch6; AUX L/R Video Resolution with screen resize min.
    [Show full text]
  • Scilab Textbook Companion for Numerical Methods for Engineers by S
    Scilab Textbook Companion for Numerical Methods For Engineers by S. C. Chapra And R. P. Canale1 Created by Meghana Sundaresan Numerical Methods Chemical Engineering VNIT College Teacher Kailas L. Wasewar Cross-Checked by July 31, 2019 1Funded by a grant from the National Mission on Education through ICT, http://spoken-tutorial.org/NMEICT-Intro. This Textbook Companion and Scilab codes written in it can be downloaded from the "Textbook Companion Project" section at the website http://scilab.in Book Description Title: Numerical Methods For Engineers Author: S. C. Chapra And R. P. Canale Publisher: McGraw Hill, New York Edition: 5 Year: 2006 ISBN: 0071244298 1 Scilab numbering policy used in this document and the relation to the above book. Exa Example (Solved example) Eqn Equation (Particular equation of the above book) AP Appendix to Example(Scilab Code that is an Appednix to a particular Example of the above book) For example, Exa 3.51 means solved example 3.51 of this book. Sec 2.3 means a scilab code whose theory is explained in Section 2.3 of the book. 2 Contents List of Scilab Codes4 1 Mathematical Modelling and Engineering Problem Solving6 2 Programming and Software8 3 Approximations and Round off Errors 10 4 Truncation Errors and the Taylor Series 15 5 Bracketing Methods 25 6 Open Methods 34 7 Roots of Polynomials 44 9 Gauss Elimination 51 10 LU Decomposition and matrix inverse 61 11 Special Matrices and gauss seidel 65 13 One dimensional unconstrained optimization 69 14 Multidimensional Unconstrainted Optimization 72 3 15 Constrained
    [Show full text]
  • The MPFR Team [email protected] This Manual Documents How to Install and Use the Multiple Precision Floating-Point Reliable Library, Version 2.2.0
    MPFR The Multiple Precision Floating-Point Reliable Library Edition 2.2.0 September 2005 The MPFR team [email protected] This manual documents how to install and use the Multiple Precision Floating-Point Reliable Library, version 2.2.0. Copyright 1991, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005 Free Software Foundation, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with the Front-Cover Texts being “A GNU Manual”, and with the Back-Cover Texts being “You have freedom to copy and modify this GNU Manual, like GNU software”. A copy of the license is included in Appendix A [GNU Free Documentation License], page 30. i Table of Contents MPFR Copying Conditions ................................ 1 1 Introduction to MPFR ................................. 2 1.1 How to use this Manual ........................................................ 2 2 Installing MPFR ....................................... 3 2.1 How to install ................................................................. 3 2.2 Other make targets ............................................................ 3 2.3 Known Build Problems ........................................................ 3 2.4 Getting the Latest Version of MPFR ............................................ 4 3 Reporting Bugs ........................................ 5 4 MPFR Basics .........................................
    [Show full text]
  • Extended Precision Floating Point Arithmetic
    Extended Precision Floating Point Numbers for Ill-Conditioned Problems Daniel Davis, Advisor: Dr. Scott Sarra Department of Mathematics, Marshall University Floating Point Number Systems Precision Extended Precision Patriot Missile Failure Floating point representation is based on scientific An increasing number of problems exist for which Patriot missile defense modules have been used by • The precision, p, of a floating point number notation, where a nonzero real decimal number, x, IEEE double is insufficient. These include modeling the U.S. Army since the mid-1960s. On February 21, system is the number of bits in the significand. is expressed as x = ±S × 10E, where 1 ≤ S < 10. of dynamical systems such as our solar system or the 1991, a Patriot protecting an Army barracks in Dha- This means that any normalized floating point The values of S and E are known as the significand • climate, and numerical cryptography. Several arbi- ran, Afghanistan failed to intercept a SCUD missile, number with precision p can be written as: and exponent, respectively. When discussing float- trary precision libraries have been developed, but leading to the death of 28 Americans. This failure E ing points, we are interested in the computer repre- x = ±(1.b1b2...bp−2bp−1)2 × 2 are too slow to be practical for many complex ap- was caused by floating point rounding error. The sentation of numbers, so we must consider base 2, or • The smallest x such that x > 1 is then: plications. David Bailey’s QD library may be used system measured time in tenths of seconds, using binary, rather than base 10.
    [Show full text]
  • Computer Architectures
    Computer Architectures Motorola 68000, 683xx a ColdFire – CISC CPU Principles Demonstrated Czech Technical University in Prague, Faculty of Electrical Engineering AE0B36APO Computer Architectures Ver.1.10 1 Original Desktop/Workstation 680X0 Feature 68000 'EC000 68010 68020 68030 68040 68060 Data bus 16 8/16 16 8/16/32 8/16/32 32 32 Addr bus 23 23 23 32 32 32 32 Misaligned Addr - - - Yes Yes Yes Yes Virtual memory - - Yes Yes Yes Yes Yes Instruct Cache - - 3 256 256 4096 8192 Data Cache - - - - 256 4096 8192 Memory manager 68451 or 68851 68851 Yes Yes Yes ATC entries - - - - 22 64/64 64/64 FPU interface - - - 68881 or 68882 Internal FPU built-in FPU - - - - - Yes Yes Burst Memory - - - - Yes Yes Yes Bus Cycle type asynchronous both synchronous Data Bus Sizing - - - Yes Yes use 68150 Power (watts) 1.2 0.13-0.26 0.13 1.75 2.6 4-6 3.9-4.9 at frequency of 8.0 8-16 8 16-25 16-50 25-40 50-66 MIPS/kDhryst. 1.2/2.1 2.5/4.3 6.5/11 14/23 35/60 100/300 Transistors 68k 84k 190k 273k 1,170k 2,500k Introduction 1979 1982 1984 1987 1991 1994 AE0B36APO Computer Architectures 2 M68xxx/CPU32/ColdFire – Basic Registers Set 31 16 15 8 7 0 User programming D0 D1 model registers D2 D3 DATA REGISTERS D4 D5 D6 D7 16 15 0 A0 A1 A2 A3 ADDRESS REGISTERS A4 A5 A6 16 15 0 A7 (USP) USER STACK POINTER 0 PC PROGRAM COUNTER 15 8 7 0 0 CCR CONDITION CODE REGISTER 31 16 15 0 A7# (SSP) SUPERVISOR STACK Supervisor/system POINTER 15 8 7 0 programing model (CCR) SR STATUS REGISTER 31 0 basic registers VBR VECTOR BASE REGISTER 31 3 2 0 SFC ALTERNATE FUNCTION DFC CODE REGISTERS AE0B36APO Computer Architectures 3 Status Register – Conditional Code Part USER BYTE SYSTEM BYTE (CONDITION CODE REGISTER) 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 T1 T0 S 0 0 I2 I1 I0 0 0 0 X N Z V C TRACE INTERRUPT EXTEND ENABLE PRIORITY MASK NEGATIVE SUPERVISOR/USER ZERO STATE OVERFLOW CARRY ● N – negative ..
    [Show full text]
  • RTEMS CPU Supplement Documentation Release 4.11.3 ©Copyright 2016, RTEMS Project (Built 15Th February 2018)
    RTEMS CPU Supplement Documentation Release 4.11.3 ©Copyright 2016, RTEMS Project (built 15th February 2018) CONTENTS I RTEMS CPU Architecture Supplement1 1 Preface 5 2 Port Specific Information7 2.1 CPU Model Dependent Features...........................8 2.1.1 CPU Model Name...............................8 2.1.2 Floating Point Unit..............................8 2.2 Multilibs........................................9 2.3 Calling Conventions.................................. 10 2.3.1 Calling Mechanism.............................. 10 2.3.2 Register Usage................................. 10 2.3.3 Parameter Passing............................... 10 2.3.4 User-Provided Routines............................ 10 2.4 Memory Model..................................... 11 2.4.1 Flat Memory Model.............................. 11 2.5 Interrupt Processing.................................. 12 2.5.1 Vectoring of an Interrupt Handler...................... 12 2.5.2 Interrupt Levels................................ 12 2.5.3 Disabling of Interrupts by RTEMS...................... 12 2.6 Default Fatal Error Processing............................. 14 2.7 Symmetric Multiprocessing.............................. 15 2.8 Thread-Local Storage................................. 16 2.9 CPU counter...................................... 17 2.10 Interrupt Profiling................................... 18 2.11 Board Support Packages................................ 19 2.11.1 System Reset................................. 19 3 ARM Specific Information 21 3.1 CPU Model Dependent Features..........................
    [Show full text]
  • IEEE Std 754-1985 Revision of Reaffirmed1990 IEEE Standard for Binary Floating-Point Arithmetic
    Recognized as an American National Standard (ANSI) IEEE Std 754-1985 An American National Standard IEEE Standard for Binary Floating-Point Arithmetic Sponsor Standards Committee of the IEEE Computer Society Approved March 21, 1985 Reaffirmed December 6, 1990 IEEE Standards Board Approved July 26, 1985 Reaffirmed May 21, 1991 American National Standards Institute © Copyright 1985 by The Institute of Electrical and Electronics Engineers, Inc 345 East 47th Street, New York, NY 10017, USA No part of this publication may be reproduced in any form, in an electronic retrieval system or otherwise, without the prior written permission of the publisher. IEEE Standards documents are developed within the Technical Committees of the IEEE Societies and the Standards Coordinating Committees of the IEEE Standards Board. Members of the committees serve voluntarily and without compensation. They are not necessarily members of the Institute. The standards developed within IEEE represent a consensus of the broad expertise on the subject within the Institute as well as those activities outside of IEEE which have expressed an interest in participating in the development of the standard. Use of an IEEE Standard is wholly voluntary. The existence of an IEEE Standard does not imply that there are no other ways to produce, test, measure, purchase, market, or provide other goods and services related to the scope of the IEEE Standard. Furthermore, the viewpoint expressed at the time a standard is approved and issued is subject to change brought about through developments in the state of the art and comments received from users of the standard. Every IEEE Standard is subjected to review at least once every five years for revision or reaffirmation.
    [Show full text]
  • Signedness-Agnostic Program Analysis: Precise Integer Bounds for Low-Level Code
    Signedness-Agnostic Program Analysis: Precise Integer Bounds for Low-Level Code Jorge A. Navas, Peter Schachte, Harald Søndergaard, and Peter J. Stuckey Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia Abstract. Many compilers target common back-ends, thereby avoid- ing the need to implement the same analyses for many different source languages. This has led to interest in static analysis of LLVM code. In LLVM (and similar languages) most signedness information associated with variables has been compiled away. Current analyses of LLVM code tend to assume that either all values are signed or all are unsigned (except where the code specifies the signedness). We show how program analysis can simultaneously consider each bit-string to be both signed and un- signed, thus improving precision, and we implement the idea for the spe- cific case of integer bounds analysis. Experimental evaluation shows that this provides higher precision at little extra cost. Our approach turns out to be beneficial even when all signedness information is available, such as when analysing C or Java code. 1 Introduction The “Low Level Virtual Machine” LLVM is rapidly gaining popularity as a target for compilers for a range of programming languages. As a result, the literature on static analysis of LLVM code is growing (for example, see [2, 7, 9, 11, 12]). LLVM IR (Intermediate Representation) carefully specifies the bit- width of all integer values, but in most cases does not specify whether values are signed or unsigned. This is because, for most operations, two’s complement arithmetic (treating the inputs as signed numbers) produces the same bit-vectors as unsigned arithmetic.
    [Show full text]
  • Introduction & Floating Point NMMC §1.4.1, Atkinson §1.2 1 Floating
    Introduction & Floating Point NMMC x1.4.1, Atkinson x1.2 1 Floating-point representation, IEEE 754. In binary, 11:01 = 1 × 21 + 1 × 20 + 0 × 2−1 + 1 × 2−2: Double-precision floating-point format has 64 bits: ±; a1; : : : ; a11; b1; : : : ; b52 The associated number is a1···a11−1023 # = ±1:b1 ··· b52 × 2 : If all the a are zero, then the number is `subnormal' and the representation is −1022 # = ±0:b1 ··· b52 × 2 If the a are all 1 and the b are all 0 then it's ±∞ (± 1/0). If the a are all 1 and the b are not all 0 then it's NaN (±0=0). Machine-epsilon is the difference between 1 and the smallest representable number larger than 1, i.e. = 2−52 ≈ 2 × 10−16. The smallest and largest positive numbers are ≈ 10±308. Numbers larger than the max are set to 1. 2 Rounding & Arithmetic. Round-toward-nearest rounds a number towards the nearest floating-point num- ber. For nonzero numbers x within the range of max/min, the relative errors due to rounding are jround(x) − xj ≤ 2−53 ≈ 10−16 = u jxj which is half of machine epsilon. Floating-point arithmetic: add, subtract, multiply, divide. IEEE 754: If you start with 2 exactly- represented floating-point numbers x and y and compute any arithmetic operation `op', the magnitude of the relative error in the result is less than half machine precision. I.e. there is some δ with jδj ≤u such that fl(x op y) = (x op y)(1 + δ): The exception is underflow/overflow, where the result is either larger or smaller than can be represented.
    [Show full text]
  • Metaclasses: Generative C++
    Metaclasses: Generative C++ Document Number: P0707 R3 Date: 2018-02-11 Reply-to: Herb Sutter ([email protected]) Audience: SG7, EWG Contents 1 Overview .............................................................................................................................................................2 2 Language: Metaclasses .......................................................................................................................................7 3 Library: Example metaclasses .......................................................................................................................... 18 4 Applying metaclasses: Qt moc and C++/WinRT .............................................................................................. 35 5 Alternatives for sourcedefinition transform syntax .................................................................................... 41 6 Alternatives for applying the transform .......................................................................................................... 43 7 FAQs ................................................................................................................................................................. 46 8 Revision history ............................................................................................................................................... 51 Major changes in R3: Switched to function-style declaration syntax per SG7 direction in Albuquerque (old: $class M new: constexpr void M(meta::type target,
    [Show full text]
  • 1.2 Round-Off Errors and Computer Arithmetic
    1.2 Round-off Errors and Computer Arithmetic 1 • In a computer model, a memory storage unit – word is used to store a number. • A word has only a finite number of bits. • These facts imply: 1. Only a small set of real numbers (rational numbers) can be accurately represented on computers. 2. (Rounding) errors are inevitable when computer memory is used to represent real, infinite precision numbers. 3. Small rounding errors can be amplified with careless treatment. So, do not be surprised that (9.4)10= (1001.0110)2 can not be represented exactly on computers. • Round-off error: error that is produced when a computer is used to perform real number calculations. 2 Binary numbers and decimal numbers • Binary number system: A method of representing numbers that has 2 as its base and uses only the digits 0 and 1. Each successive digit represents a power of 2. (… . … ) where 0 1, for each = 2,1,0, 1, 2 … 3210 −1−2−3 2 • Binary≤ to decimal:≤ ⋯ − − (… . … ) (… 2 + 2 + 2 + 2 = 3210 −1−2−3 2 + 2 3 + 22 + 1 2 …0) 3 2 1 0 −1 −2 −3 −1 −2 −3 10 3 Binary machine numbers • IEEE (Institute for Electrical and Electronic Engineers) – Standards for binary and decimal floating point numbers • For example, “double” type in the “C” programming language uses a 64-bit (binary digit) representation – 1 sign bit (s), – 11 exponent bits – characteristic (c), – 52 binary fraction bits – mantissa (f) 1. 0 2 1 = 2047 11 4 ≤ ≤ − This 64-bit binary number gives a decimal floating-point number (Normalized IEEE floating point number): 1 2 (1 + ) where 1023 is called exponent − 1023bias.
    [Show full text]