1 Logistics 2 Binary Floating Point
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Scilab Textbook Companion for Numerical Methods for Engineers by S
Scilab Textbook Companion for Numerical Methods For Engineers by S. C. Chapra And R. P. Canale1 Created by Meghana Sundaresan Numerical Methods Chemical Engineering VNIT College Teacher Kailas L. Wasewar Cross-Checked by July 31, 2019 1Funded by a grant from the National Mission on Education through ICT, http://spoken-tutorial.org/NMEICT-Intro. This Textbook Companion and Scilab codes written in it can be downloaded from the "Textbook Companion Project" section at the website http://scilab.in Book Description Title: Numerical Methods For Engineers Author: S. C. Chapra And R. P. Canale Publisher: McGraw Hill, New York Edition: 5 Year: 2006 ISBN: 0071244298 1 Scilab numbering policy used in this document and the relation to the above book. Exa Example (Solved example) Eqn Equation (Particular equation of the above book) AP Appendix to Example(Scilab Code that is an Appednix to a particular Example of the above book) For example, Exa 3.51 means solved example 3.51 of this book. Sec 2.3 means a scilab code whose theory is explained in Section 2.3 of the book. 2 Contents List of Scilab Codes4 1 Mathematical Modelling and Engineering Problem Solving6 2 Programming and Software8 3 Approximations and Round off Errors 10 4 Truncation Errors and the Taylor Series 15 5 Bracketing Methods 25 6 Open Methods 34 7 Roots of Polynomials 44 9 Gauss Elimination 51 10 LU Decomposition and matrix inverse 61 11 Special Matrices and gauss seidel 65 13 One dimensional unconstrained optimization 69 14 Multidimensional Unconstrainted Optimization 72 3 15 Constrained -
Extended Precision Floating Point Arithmetic
Extended Precision Floating Point Numbers for Ill-Conditioned Problems Daniel Davis, Advisor: Dr. Scott Sarra Department of Mathematics, Marshall University Floating Point Number Systems Precision Extended Precision Patriot Missile Failure Floating point representation is based on scientific An increasing number of problems exist for which Patriot missile defense modules have been used by • The precision, p, of a floating point number notation, where a nonzero real decimal number, x, IEEE double is insufficient. These include modeling the U.S. Army since the mid-1960s. On February 21, system is the number of bits in the significand. is expressed as x = ±S × 10E, where 1 ≤ S < 10. of dynamical systems such as our solar system or the 1991, a Patriot protecting an Army barracks in Dha- This means that any normalized floating point The values of S and E are known as the significand • climate, and numerical cryptography. Several arbi- ran, Afghanistan failed to intercept a SCUD missile, number with precision p can be written as: and exponent, respectively. When discussing float- trary precision libraries have been developed, but leading to the death of 28 Americans. This failure E ing points, we are interested in the computer repre- x = ±(1.b1b2...bp−2bp−1)2 × 2 are too slow to be practical for many complex ap- was caused by floating point rounding error. The sentation of numbers, so we must consider base 2, or • The smallest x such that x > 1 is then: plications. David Bailey’s QD library may be used system measured time in tenths of seconds, using binary, rather than base 10. -
Introduction & Floating Point NMMC §1.4.1, Atkinson §1.2 1 Floating
Introduction & Floating Point NMMC x1.4.1, Atkinson x1.2 1 Floating-point representation, IEEE 754. In binary, 11:01 = 1 × 21 + 1 × 20 + 0 × 2−1 + 1 × 2−2: Double-precision floating-point format has 64 bits: ±; a1; : : : ; a11; b1; : : : ; b52 The associated number is a1···a11−1023 # = ±1:b1 ··· b52 × 2 : If all the a are zero, then the number is `subnormal' and the representation is −1022 # = ±0:b1 ··· b52 × 2 If the a are all 1 and the b are all 0 then it's ±∞ (± 1/0). If the a are all 1 and the b are not all 0 then it's NaN (±0=0). Machine-epsilon is the difference between 1 and the smallest representable number larger than 1, i.e. = 2−52 ≈ 2 × 10−16. The smallest and largest positive numbers are ≈ 10±308. Numbers larger than the max are set to 1. 2 Rounding & Arithmetic. Round-toward-nearest rounds a number towards the nearest floating-point num- ber. For nonzero numbers x within the range of max/min, the relative errors due to rounding are jround(x) − xj ≤ 2−53 ≈ 10−16 = u jxj which is half of machine epsilon. Floating-point arithmetic: add, subtract, multiply, divide. IEEE 754: If you start with 2 exactly- represented floating-point numbers x and y and compute any arithmetic operation `op', the magnitude of the relative error in the result is less than half machine precision. I.e. there is some δ with jδj ≤u such that fl(x op y) = (x op y)(1 + δ): The exception is underflow/overflow, where the result is either larger or smaller than can be represented. -
1.2 Round-Off Errors and Computer Arithmetic
1.2 Round-off Errors and Computer Arithmetic 1 • In a computer model, a memory storage unit – word is used to store a number. • A word has only a finite number of bits. • These facts imply: 1. Only a small set of real numbers (rational numbers) can be accurately represented on computers. 2. (Rounding) errors are inevitable when computer memory is used to represent real, infinite precision numbers. 3. Small rounding errors can be amplified with careless treatment. So, do not be surprised that (9.4)10= (1001.0110)2 can not be represented exactly on computers. • Round-off error: error that is produced when a computer is used to perform real number calculations. 2 Binary numbers and decimal numbers • Binary number system: A method of representing numbers that has 2 as its base and uses only the digits 0 and 1. Each successive digit represents a power of 2. (… . … ) where 0 1, for each = 2,1,0, 1, 2 … 3210 −1−2−3 2 • Binary≤ to decimal:≤ ⋯ − − (… . … ) (… 2 + 2 + 2 + 2 = 3210 −1−2−3 2 + 2 3 + 22 + 1 2 …0) 3 2 1 0 −1 −2 −3 −1 −2 −3 10 3 Binary machine numbers • IEEE (Institute for Electrical and Electronic Engineers) – Standards for binary and decimal floating point numbers • For example, “double” type in the “C” programming language uses a 64-bit (binary digit) representation – 1 sign bit (s), – 11 exponent bits – characteristic (c), – 52 binary fraction bits – mantissa (f) 1. 0 2 1 = 2047 11 4 ≤ ≤ − This 64-bit binary number gives a decimal floating-point number (Normalized IEEE floating point number): 1 2 (1 + ) where 1023 is called exponent − 1023bias. -
Floating Point Representation (Unsigned) Fixed-Point Representation
Floating point representation (Unsigned) Fixed-point representation The numbers are stored with a fixed number of bits for the integer part and a fixed number of bits for the fractional part. Suppose we have 8 bits to store a real number, where 5 bits store the integer part and 3 bits store the fractional part: 1 0 1 1 1.0 1 1 $ 2& 2% 2$ 2# 2" 2'# 2'$ 2'% Smallest number: Largest number: (Unsigned) Fixed-point representation Suppose we have 64 bits to store a real number, where 32 bits store the integer part and 32 bits store the fractional part: "# "% + /+ !"# … !%!#!&. (#(%(" … ("% % = * !+ 2 + * (+ 2 +,& +,# "# "& & /# % /"% = !"#× 2 +!"&× 2 + ⋯ + !&× 2 +(#× 2 +(%× 2 + ⋯ + ("%× 2 0 ∞ (Unsigned) Fixed-point representation Range: difference between the largest and smallest numbers possible. More bits for the integer part ⟶ increase range Precision: smallest possible difference between any two numbers More bits for the fractional part ⟶ increase precision "#"$"%. '$'#'( # OR "$"%. '$'#'(') # Wherever we put the binary point, there is a trade-off between the amount of range and precision. It can be hard to decide how much you need of each! Scientific Notation In scientific notation, a number can be expressed in the form ! = ± $ × 10( where $ is a coefficient in the range 1 ≤ $ < 10 and + is the exponent. 1165.7 = 0.0004728 = Floating-point numbers A floating-point number can represent numbers of different order of magnitude (very large and very small) with the same number of fixed bits. In general, in the binary system, a floating number can be expressed as ! = ± $ × 2' $ is the significand, normally a fractional value in the range [1.0,2.0) . -
On Various Ways to Split a Floating-Point Number
On various ways to split a floating-point number Claude-Pierre Jeannerod Jean-Michel Muller Paul Zimmermann Inria, CNRS, ENS Lyon, Universit´ede Lyon, Universit´ede Lorraine France ARITH-25 June 2018 -1- Splitting a floating-point number X = ? round(x) frac(x) 0 2 First bit of x = ufp(x) (for scalings) X All products are s computed exactly 2 2 X with one FP p 2 2 k a b multiplication a + b ! 2 k + k X (Dekker product) 2 2 X Dekker product (1971) -2- Splitting a floating-point number absolute splittings (e.g., bxc), vs relative splittings (e.g., most significant bits, splitting of the significands for multiplication); In each "bin", the sum is no bit manipulations of the computed exactly binary representations (would result in less portable Matlab program in a paper by programs) ! onlyFP Zielke and Drygalla (2003), operations. analysed and improved by Rump, Ogita, and Oishi (2008), reproducible summation, by Demmel & Nguyen. -3- Notation and preliminary definitions IEEE-754 compliant FP arithmetic with radix β, precision p, and extremal exponents emin and emax; F = set of FP numbers. x 2 F can be written M x = · βe ; βp−1 p M, e 2 Z, with jMj < β and emin 6 e 6 emax, and jMj maximum under these constraints; significand of x: M · β−p+1; RN = rounding to nearest with some given tie-breaking rule (assumed to be either \to even" or \to away", as in IEEE 754-2008); -4- Notation and preliminary definitions Definition 1 (classical ulp) The unit in the last place of t 2 R is ( βblogβ jt|c−p+1 if jtj βemin , ulp(t) = > βemin−p+1 otherwise. -
Floating Point Arithmetic
Systems Architecture Lecture 14: Floating Point Arithmetic Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some or all figures from Computer Organization and Design: The Hardware/Software Approach, Third Edition, by David Patterson and John Hennessy, are copyrighted material (COPYRIGHT 2004 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED). Lec 14 Systems Architecture 1 Introduction • Objective: To provide hardware support for floating point arithmetic. To understand how to represent floating point numbers in the computer and how to perform arithmetic with them. Also to learn how to use floating point arithmetic in MIPS. • Approximate arithmetic – Finite Range – Limited Precision • Topics – IEEE format for single and double precision floating point numbers – Floating point addition and multiplication – Support for floating point computation in MIPS Lec 14 Systems Architecture 2 Distribution of Floating Point Numbers e = -1 e = 0 e = 1 • 3 bit mantissa 1.00 X 2^(-1) = 1/2 1.00 X 2^0 = 1 1.00 X 2^1 = 2 1.01 X 2^(-1) = 5/8 1.01 X 2^0 = 5/4 1.01 X 2^1 = 5/2 • exponent {-1,0,1} 1.10 X 2^(-1) = 3/4 1.10 X 2^0 = 3/2 1.10 X 2^1= 3 1.11 X 2^(-1) = 7/8 1.11 X 2^0 = 7/4 1.11 X 2^1 = 7/2 0 1 2 3 Lec 14 Systems Architecture 3 Floating Point • An IEEE floating point representation consists of – A Sign Bit (no surprise) – An Exponent (“times 2 to the what?”) – Mantissa (“Significand”), which is assumed to be 1.xxxxx (thus, one bit of the mantissa is implied as 1) – This is called a normalized representation • So a mantissa = 0 really is interpreted to be 1.0, and a mantissa of all 1111 is interpreted to be 1.1111 • Special cases are used to represent denormalized mantissas (true mantissa = 0), NaN, etc., as will be discussed. -
Absolute and Relative Error All Computations on a Computer Are Approximate by Nature, Due to the Limited Precision on the Computer
4 2 0 −1 −2 • 5.75 = 4 + 1 + 0.5 + 0.25 = 1 × 2 + 1 × 2 + 1 × 2 + 1 × 2 = 101.11(2) 4 0 −1 • 17.5 = 16 + 1 + 0.5 = 1 × 2 + 1 × 2 + 1 × 2 = 10001.1(2) 2 0 −1 −2 • 5.75 = 4 + 1 + 0.5 + 0.25 = 1 × 2 + 1 × 2 + 1 × 2 + 1 × 2 = 101.11(2) Note that certain numbers are finite (terminating) decimals, actually are peri- odic in binary, e.g. 0.4(10) = 0.01100110011 ...(2) = 0.00110011(2) Lecture of: Machine numbers (a.k.a. binary floating point numbers) 24 Jan 2013 The numbers stored on the computer are, essentially, “binary numbers” in sci- entific notation x = ±a × 2b. Here, a is called the mantissa and b the exponent. We also follow the convention that 1 ≤ a < 2; the idea is that, for any number x, we can always divide it by an appropriate power of 2, such that the result will be within [1, 2). For example: 2 2 x = 5(10) = 1.25(10) × 2 = 1.01(2) × 2 Thus, a machine number is stored as: b x = ±1.a1a2 ··· ak-1ak × 2 • In single precision we store k = 23 binary digits, and the exponent b ranges between −126 ≤ b ≤ 127. The largest number we can thus represent is (2 − 2−23) × 2127 ≈ 3.4 × 1038. • In double precision we store k = 52 binary digits, and the exponent b ranges between −1022 ≤ b ≤ 1023. The largest number we can thus represent is (2 − 2−52) × 21023 ≈ 1.8 × 10308. -
Floating Point Square Root Under HUB Format
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Repositorio Institucional Universidad de Málaga Floating Point Square Root under HUB Format Julio Villalba-Moreno Javier Hormigo Dept. of Computer Architecture Dept. of Computer Architecture University of Malaga University of Malaga Malaga, SPAIN Malaga, SPAIN [email protected] [email protected] Abstract—Unit-Biased (HUB) is an emerging format based on maintaining the same accuracy, the area cost and delay of FIR shifting the representation line of the binary numbers by half filter implementations has been drastically reduced in [3], and unit in the last place. The HUB format is specially relevant similarly for the QR decomposition in [4]. for computers where rounding to nearest is required because it is performed simply by truncation. From a hardware point In [5] the authors carry out an analysis of using HUB format of view, the circuits implementing this representation save both for floating point adders, multipliers and converters from a area and time since rounding does not involve any carry propa- quantitative point of view. Experimental studies demonstrate gation. Designs to perform the four basic operations have been that HUB format keeps the same accuracy as conventional proposed under HUB format recently. Nevertheless, the square format for the aforementioned units, simultaneously improving root operation has not been confronted yet. In this paper we present an architecture to carry out the square root operation area, speed and power consumption (14% speed-up, 38% less under HUB format for floating point numbers. The results of area and 26% less power for single precision floating point this work keep supporting the fact that the HUB representation adder, 17% speed-up, 22% less area and slightly less power for involves simpler hardware than its conventional counterpart for the floating point multiplier). -
Numerical Computations – with a View Towards R
Numerical computations { with a view towards R Søren Højsgaard Department of Mathematical Sciences Aalborg University, Denmark October 8, 2012 Contents 1 Computer arithmetic is not exact 1 2 Floating point arithmetic 1 2.1 Addition and subtraction . .2 2.2 The Relative Machine Precision . .2 2.3 Floating-Point Precision . .3 2.4 Error in Floating-Point Computations . .3 3 Example: A polynomial 3 4 Example: Numerical derivatives 6 5 Example: The Sample Variance 8 6 Example: Solving a quadratic equation 8 7 Example: Computing the exponential 9 8 Ressources 10 1 Computer arithmetic is not exact The following R statement appears to give the correct answer. R> 0.3 - 0.2 [1] 0.1 1 But all is not as it seems. R> 0.3 - 0.2 == 0.1 [1] FALSE The difference between the values being compared is small, but important. R> (0.3 - 0.2) - 0.1 [1] -2.775558e-17 2 Floating point arithmetic Real numbers \do not exist" in computers. Numbers in computers are represented in floating point form s × be where s is the significand, b is the base and e is the exponent. R has \numeric" (floating points) and \integer" numbers R> class(1) [1] "numeric" R> class(1L) [1] "integer" 2.1 Addition and subtraction Let s be a 7{digit number A simple method to add floating-point numbers is to first represent them with the same exponent. 123456.7 = 1.234567 * 10^5 101.7654 = 1.017654 * 10^2 = 0.001017654 * 10^5 So the true result is (1.234567 + 0.001017654) * 10^5 = 1.235584654 * 10^5 But the approximate result the computer would give is (the last digits (654) are lost) 2 1.235585 * 10^5 (final sum: 123558.5) In extreme cases, the sum of two non-zero numbers may be equal to one of them Quiz: how to sum a sequence of numbers x1; : : : ; xn to obtain large accuracy? 2.2 The Relative Machine Precision The accuracy of a floating-point system is measured by the relative machine precision or machine epsilon. -
AC Library for Emulating Low-Precision Arithmetic Fasi
CPFloat: A C library for emulating low-precision arithmetic Fasi, Massimiliano and Mikaitis, Mantas 2020 MIMS EPrint: 2020.22 Manchester Institute for Mathematical Sciences School of Mathematics The University of Manchester Reports available from: http://eprints.maths.manchester.ac.uk/ And by contacting: The MIMS Secretary School of Mathematics The University of Manchester Manchester, M13 9PL, UK ISSN 1749-9097 CPFloat: A C library for emulating low-precision arithmetic∗ Massimiliano Fasi y Mantas Mikaitis z Abstract Low-precision floating-point arithmetic can be simulated via software by executing each arith- metic operation in hardware and rounding the result to the desired number of significant bits. For IEEE-compliant formats, rounding requires only standard mathematical library functions, but handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly in finite arithmetic. Moreover, the ensuing algorithms are not necessarily efficient, as the library functions these techniques build upon are typically designed to handle a broad range of cases and may not be optimized for the specific needs of rounding algorithms. CPFloat is a C library that of- fers efficient routines for rounding arrays of binary32 and binary64 numbers to lower precision. The software exploits the bit level representation of the underlying formats and performs only low-level bit manipulation and integer arithmetic, without relying on costly library calls. In numerical experiments the new techniques bring a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To the best of our knowledge, CPFloat is currently the most efficient and complete library for experimenting with custom low-precision floating-point arithmetic available in any language. -
CS 542G: Preliminaries, Floating-Point, Errors
CS 542G: Preliminaries, Floating-Point, Errors Robert Bridson September 8, 2008 1 Preliminaries The course web-page is at: http://www.cs.ubc.ca/∼rbridson/courses/542g. Details on the instructor, lectures, assignments and more will be available there. There is no assigned text book, as there isn’t one available that approaches the material with this breadth and at this level, but for some of the more basic topics you might try M. Heath’s “Scientific Computing: An Introductory Survey”. This course aims to provide a graduate-level overview of scientific computing: it counts for the “breadth” requirement in our degrees, and should be excellent preparation for research and further stud- ies involving scientific computing, whether for another subject (such as computer vision, graphics, statis- tics, or data mining to name a few within the department, or in another department altogether) or in scientific computing itself. While we will see many algorithms for many different problems, some histor- ical and others of critical importance today, the focus will try to remain on the bigger picture: powerful ways of analyzing and solving numerical problems, with an emphasis on common approaches that work in many situations. Alongside a practical introduction to doing scientific computing, I hope to develop your insight and intuition for the field. 2 Floating Point Unfortunately, the natural starting place for virtual any scientific computing course is in some ways the most tedious and frustrating part of the subject: floating point arithmetic. Some familiarity with floating point is necessary for everything that follows, and the main message—that all our computations will make (somewhat predictable) errors along the way—must be taken to heart, but we’ll try to get through it to more fun stuff fairly quickly.