Table 1: Values Represented by Bit Patterns in IEEE Single Format

Total Page:16

File Type:pdf, Size:1020Kb

Table 1: Values Represented by Bit Patterns in IEEE Single Format Table 1: Values Represented by Bit Patterns in IEEE Single Format Single-Format Bit Pattern Value 0 < e < 255 (-1)s × 2e-127 × 1.f (normal numbers) s e = 0; f =6 0 (at least one bit in f is nonzero) (-1) × 2-126 × 0.f (subnormal numbers) s e = 0; f = 0 (all bits in f are zero) (-1) × 0.0 (signed zero) s = 0; e = 255; f = 0 (all bits in f are zero) +INF (positive infinity) s = 1; e = 255; f = 0 (all bits in f are zero) -INF (negative infinity) s = u; e = 255;f =6 0 (at least one bit in f is nonzero) NaN (Not-a-Number) Bit Patterns in Single-Storage Format and their IEEE Values Common Name Bit Pattern (Hex) Decimal Value +0 00000000 0.0 -0 80000000 -0.0 1 3f800000 1.0 2 40000000 2.0 maximum normal number 7f7fffff 3.40282347e+38 minimum positive normal number 00800000 1.17549435e-38 maximum subnormal number 007fffff 1.17549421e-38 minimum positive subnormal number 00000001 1.40129846e-45 +∞ 7f800000 Infinity −∞ ff800000 -Infinity Not-a-Number 7fc00000 NaN Table 2: Values Represented by Bit Patterns in IEEE Double Format Double-Format Bit Pattern Value 0 < e < 2047 (-1)s × 2e-1023 x 1.f (normal numbers) s e = 0; f =6 0 (at least one bit in f is nonzero) (-1) × 2-1022 x 0.f (subnormal numbers) s e = 0; f = 0 (all bits in f are zero) (-1) × 0.0 (signed zero) s = 0; e = 2047; f = 0 (all bits in f are zero) +INF (positive infinity) s = 1; e = 2047; f = 0 (all bits in f are zero) -INF (negative infinity) s = u; e = 2047; f =6 0 (at least one bit in f is nonzero) NaN (Not-a-Number) Bit Patterns in Double-Storage Format and their IEEE Values Common Name Bit Pattern (Hex) Decimal Value + 0 00000000 00000000 0.0 - 0 80000000 00000000 -0.0 1 3ff00000 00000000 1.0 2 40000000 00000000 2.0 max normal number 7fefffff ffffffff 1.7976931348623157e+308 min positive normal number 00100000 00000000 2.2250738585072014e-308 max subnormal number 000fffff ffffffff 2.2250738585072009e-308 min positive subnormal number 00000000 00000001 4.9406564584124654e-324 +∞ 7ff00000 00000000 Infinity −∞ fff00000 00000000 -Infinity Not-a-Number 7ff80000 00000000 NaN Table 3: Double-Extended Bit Pattern (x86) Value j = 0, 0 <e< 32767 Unsupported s j = 1, 0 <e< 32767 (-1) x 2e-16383 x 1.f (normal numbers) s j = 0, e = 0; f =6 0 (at least one bit in f is nonzero) (-1) x 2-16382 x 0.f (subnormal numbers) j = 1, e = 0 (-1)s x 2-16382 x 1.f (pseudo-denormal numbers) j = 0, e = 0, f = 0 (all bits in f are zero) (-1)s x 0.0 (signed zero) j = 1; s = 0; e = 32767; f = 0 (all bits in f are zero) +INF (positive infinity) j = 1; s = 1; e = 32767; f = 0 (all bits in f are zero) -INF (negative infinity) j = 1; s = u; e = 32767; f = .1uuu– uu QNaN (quiet NaNs) j = 1; s = u; e = 32767; f = .0uuu– uu =6 0 (at least one of theu in f is nonzero) SNaN (signaling NaNs) Figure 1: The floating-point number line int main() { float y, z; y = 838861.2; z = 1.3; printf("y: %18.11f\n", y); printf("z: %18.11f\n", z); return 0; } The output from this program should be similar to: y: 838861.18750000000 z: 1.29999995232 Range and Precision of Storage Formats Format Sig Digits (Binary) Smallest Pos Largest Pos Sig Digits (Decimal) single 24 1.175... 10-38 3.402... 10+38 6-9 double 53 2.225... 10-308 1.797... 10+308 15-17 double extended (x86) 64 3.362... 10-4932 1.189... 10+4932 18-21 double extended (x86 64) 113 3.362... 10-4932 1.189... 10+4932 33-36 Standards: POSIX, BSD 4.3, ISO 9899 acos arccosine, returns value in [0, π] asin arcsine, returns value in [−π/2,π/2] atan arctangent, returns value in [−π/2,π/2] atan2 takes y and x to break degeneracy in atan(y/x) ceil smallest integral value not less than x cos Cosine cosh Hyperbolic cosine exp Exponentiate fabs absolute value of floating-point number floor largest integral value not greater than x fmod floating-point remainder function frexp convert floating-point number to fractional and integral components ldexp multiply floating-point number by integral power of 2 log Natural log log10 Log base ten modf extract signed integral and fractional values from floating-point number pow Raise number to a power sin Sine sinh Hyperbolic sine sqrt Square root of a number tan Tangent tanh Hyperbolic tangent.
Recommended publications
  • Fortran 90 Overview
    1 Fortran 90 Overview J.E. Akin, Copyright 1998 This overview of Fortran 90 (F90) features is presented as a series of tables that illustrate the syntax and abilities of F90. Frequently comparisons are made to similar features in the C++ and F77 languages and to the Matlab environment. These tables show that F90 has significant improvements over F77 and matches or exceeds newer software capabilities found in C++ and Matlab for dynamic memory management, user defined data structures, matrix operations, operator definition and overloading, intrinsics for vector and parallel pro- cessors and the basic requirements for object-oriented programming. They are intended to serve as a condensed quick reference guide for programming in F90 and for understanding programs developed by others. List of Tables 1 Comment syntax . 4 2 Intrinsic data types of variables . 4 3 Arithmetic operators . 4 4 Relational operators (arithmetic and logical) . 5 5 Precedence pecking order . 5 6 Colon Operator Syntax and its Applications . 5 7 Mathematical functions . 6 8 Flow Control Statements . 7 9 Basic loop constructs . 7 10 IF Constructs . 8 11 Nested IF Constructs . 8 12 Logical IF-ELSE Constructs . 8 13 Logical IF-ELSE-IF Constructs . 8 14 Case Selection Constructs . 9 15 F90 Optional Logic Block Names . 9 16 GO TO Break-out of Nested Loops . 9 17 Skip a Single Loop Cycle . 10 18 Abort a Single Loop . 10 19 F90 DOs Named for Control . 10 20 Looping While a Condition is True . 11 21 Function definitions . 11 22 Arguments and return values of subprograms . 12 23 Defining and referring to global variables .
    [Show full text]
  • Quick Overview: Complex Numbers
    Quick Overview: Complex Numbers February 23, 2012 1 Initial Definitions Definition 1 The complex number z is defined as: z = a + bi (1) p where a, b are real numbers and i = −1. Remarks about the definition: • Engineers typically use j instead of i. • Examples of complex numbers: p 5 + 2i; 3 − 2i; 3; −5i • Powers of i: i2 = −1 i3 = −i i4 = 1 i5 = i i6 = −1 i7 = −i . • All real numbers are also complex (by taking b = 0). 2 Visualizing Complex Numbers A complex number is defined by it's two real numbers. If we have z = a + bi, then: Definition 2 The real part of a + bi is a, Re(z) = Re(a + bi) = a The imaginary part of a + bi is b, Im(z) = Im(a + bi) = b 1 Im(z) 4i 3i z = a + bi 2i r b 1i θ Re(z) a −1i Figure 1: Visualizing z = a + bi in the complex plane. Shown are the modulus (or length) r and the argument (or angle) θ. To visualize a complex number, we use the complex plane C, where the horizontal (or x-) axis is for the real part, and the vertical axis is for the imaginary part. That is, a + bi is plotted as the point (a; b). In Figure 1, we can see that it is also possible to represent the point a + bi, or (a; b) in polar form, by computing its modulus (or size), and angle (or argument): p r = jzj = a2 + b2 θ = arg(z) We have to be a bit careful defining φ, since there are many ways to write φ (and we could add multiples of 2π as well).
    [Show full text]
  • Fortran Math Special Functions Library
    IMSL® Fortran Math Special Functions Library Version 2021.0 Copyright 1970-2021 Rogue Wave Software, Inc., a Perforce company. Visual Numerics, IMSL, and PV-WAVE are registered trademarks of Rogue Wave Software, Inc., a Perforce company. IMPORTANT NOTICE: Information contained in this documentation is subject to change without notice. Use of this docu- ment is subject to the terms and conditions of a Rogue Wave Software License Agreement, including, without limitation, the Limited Warranty and Limitation of Liability. ACKNOWLEDGMENTS Use of the Documentation and implementation of any of its processes or techniques are the sole responsibility of the client, and Perforce Soft- ware, Inc., assumes no responsibility and will not be liable for any errors, omissions, damage, or loss that might result from any use or misuse of the Documentation PERFORCE SOFTWARE, INC. MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THE DOCUMENTATION. THE DOCU- MENTATION IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. PERFORCE SOFTWARE, INC. HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS WITH REGARD TO THE DOCUMENTATION, WHETHER EXPRESS, IMPLIED, STATUTORY, OR OTHERWISE, INCLUDING WITHOUT LIMITATION ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PAR- TICULAR PURPOSE, OR NONINFRINGEMENT. IN NO EVENT SHALL PERFORCE SOFTWARE, INC. BE LIABLE, WHETHER IN CONTRACT, TORT, OR OTHERWISE, FOR ANY SPECIAL, CONSEQUENTIAL, INDIRECT, PUNITIVE, OR EXEMPLARY DAMAGES IN CONNECTION WITH THE USE OF THE DOCUMENTATION. The Documentation is subject to change at any time without notice. IMSL https://www.imsl.com/ Contents Introduction The IMSL Fortran Numerical Libraries . 1 Getting Started . 2 Finding the Right Routine . 3 Organization of the Documentation . 4 Naming Conventions .
    [Show full text]
  • IEEE Standard 754 for Binary Floating-Point Arithmetic
    Work in Progress: Lecture Notes on the Status of IEEE 754 October 1, 1997 3:36 am Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic Prof. W. Kahan Elect. Eng. & Computer Science University of California Berkeley CA 94720-1776 Introduction: Twenty years ago anarchy threatened floating-point arithmetic. Over a dozen commercially significant arithmetics boasted diverse wordsizes, precisions, rounding procedures and over/underflow behaviors, and more were in the works. “Portable” software intended to reconcile that numerical diversity had become unbearably costly to develop. Thirteen years ago, when IEEE 754 became official, major microprocessor manufacturers had already adopted it despite the challenge it posed to implementors. With unprecedented altruism, hardware designers had risen to its challenge in the belief that they would ease and encourage a vast burgeoning of numerical software. They did succeed to a considerable extent. Anyway, rounding anomalies that preoccupied all of us in the 1970s afflict only CRAY X-MPs — J90s now. Now atrophy threatens features of IEEE 754 caught in a vicious circle: Those features lack support in programming languages and compilers, so those features are mishandled and/or practically unusable, so those features are little known and less in demand, and so those features lack support in programming languages and compilers. To help break that circle, those features are discussed in these notes under the following headings: Representable Numbers, Normal and Subnormal, Infinite
    [Show full text]
  • SPARC Assembly Language Reference Manual
    SPARC Assembly Language Reference Manual 2550 Garcia Avenue Mountain View, CA 94043 U.S.A. A Sun Microsystems, Inc. Business 1995 Sun Microsystems, Inc. 2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A. All rights reserved. This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution and decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Portions of this product may be derived from the UNIX® system, licensed from UNIX Systems Laboratories, Inc., a wholly owned subsidiary of Novell, Inc., and from the Berkeley 4.3 BSD system, licensed from the University of California. Third-party software, including font technology in this product, is protected by copyright and licensed from Sun’s Suppliers. RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 and FAR 52.227-19. The product described in this manual may be protected by one or more U.S. patents, foreign patents, or pending applications. TRADEMARKS Sun, Sun Microsystems, the Sun logo, SunSoft, the SunSoft logo, Solaris, SunOS, OpenWindows, DeskSet, ONC, ONC+, and NFS are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. UNIX is a registered trademark in the United States and other countries, exclusively licensed through X/Open Company, Ltd. OPEN LOOK is a registered trademark of Novell, Inc.
    [Show full text]
  • X86-64 Machine-Level Programming∗
    x86-64 Machine-Level Programming∗ Randal E. Bryant David R. O'Hallaron September 9, 2005 Intel’s IA32 instruction set architecture (ISA), colloquially known as “x86”, is the dominant instruction format for the world’s computers. IA32 is the platform of choice for most Windows and Linux machines. The ISA we use today was defined in 1985 with the introduction of the i386 microprocessor, extending the 16-bit instruction set defined by the original 8086 to 32 bits. Even though subsequent processor generations have introduced new instruction types and formats, many compilers, including GCC, have avoided using these features in the interest of maintaining backward compatibility. A shift is underway to a 64-bit version of the Intel instruction set. Originally developed by Advanced Micro Devices (AMD) and named x86-64, it is now supported by high end processors from AMD (who now call it AMD64) and by Intel, who refer to it as EM64T. Most people still refer to it as “x86-64,” and we follow this convention. Newer versions of Linux and GCC support this extension. In making this switch, the developers of GCC saw an opportunity to also make use of some of the instruction-set features that had been added in more recent generations of IA32 processors. This combination of new hardware and revised compiler makes x86-64 code substantially different in form and in performance than IA32 code. In creating the 64-bit extension, the AMD engineers also adopted some of the features found in reduced-instruction set computers (RISC) [7] that made them the favored targets for optimizing compilers.
    [Show full text]
  • FPGA Based Quadruple Precision Floating Point Arithmetic for Scientific Computations
    International Journal of Advanced Computer Research (ISSN (print): 2249-7277 ISSN (online): 2277-7970) Volume-2 Number-3 Issue-5 September-2012 FPGA Based Quadruple Precision Floating Point Arithmetic for Scientific Computations 1Mamidi Nagaraju, 2Geedimatla Shekar 1Department of ECE, VLSI Lab, National Institute of Technology (NIT), Calicut, Kerala, India 2Asst.Professor, Department of ECE, Amrita Vishwa Vidyapeetham University Amritapuri, Kerala, India Abstract amounts, and fractions are essential to many computations. Floating-point arithmetic lies at the In this project we explore the capability and heart of computer graphics cards, physics engines, flexibility of FPGA solutions in a sense to simulations and many models of the natural world. accelerate scientific computing applications which Floating-point computations suffer from errors due to require very high precision arithmetic, based on rounding and quantization. Fast computers let IEEE 754 standard 128-bit floating-point number programmers write numerically intensive programs, representations. Field Programmable Gate Arrays but computed results can be far from the true results (FPGA) is increasingly being used to design high due to the accumulation of errors in arithmetic end computationally intense microprocessors operations. Implementing floating-point arithmetic in capable of handling floating point mathematical hardware can solve two separate problems. First, it operations. Quadruple Precision Floating-Point greatly speeds up floating-point arithmetic and Arithmetic is important in computational fluid calculations. Implementing a floating-point dynamics and physical modelling, which require instruction will require at a generous estimate at least accurate numerical computations. However, twenty integer instructions, many of them conditional modern computers perform binary arithmetic, operations, and even if the instructions are executed which has flaws in representing and rounding the on an architecture which goes to great lengths to numbers.
    [Show full text]
  • A Practical Introduction to Python Programming
    A Practical Introduction to Python Programming Brian Heinold Department of Mathematics and Computer Science Mount St. Mary’s University ii ©2012 Brian Heinold Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported Li- cense Contents I Basics1 1 Getting Started 3 1.1 Installing Python..............................................3 1.2 IDLE......................................................3 1.3 A first program...............................................4 1.4 Typing things in...............................................5 1.5 Getting input.................................................6 1.6 Printing....................................................6 1.7 Variables...................................................7 1.8 Exercises...................................................9 2 For loops 11 2.1 Examples................................................... 11 2.2 The loop variable.............................................. 13 2.3 The range function............................................ 13 2.4 A Trickier Example............................................. 14 2.5 Exercises................................................... 15 3 Numbers 19 3.1 Integers and Decimal Numbers.................................... 19 3.2 Math Operators............................................... 19 3.3 Order of operations............................................ 21 3.4 Random numbers............................................. 21 3.5 Math functions............................................... 21 3.6 Getting
    [Show full text]
  • A Fast-Start Method for Computing the Inverse Tangent
    A Fast-Start Method for Computing the Inverse Tangent Peter Markstein Hewlett-Packard Laboratories 1501 Page Mill Road Palo Alto, CA 94062, U.S.A. [email protected] Abstract duction method is handled out-of-loop because its use is rarely required.) In a search for an algorithm to compute atan(x) For short latency, it is often desirable to use many in- which has both low latency and few floating point in- structions. Different sequences may be optimal over var- structions, an interesting variant of familiar trigonom- ious subdomains of the function, and in each subdomain, etry formulas was discovered that allow the start of ar- instruction level parallelism, such as Estrin’s method of gument reduction to commence before any references to polynomial evaluation [3] reduces latency at the cost of tables stored in memory are needed. Low latency makes additional intermediate calculations. the method suitable for a closed subroutine, and few In searching for an inverse tangent algorithm, an un- floating point operations make the method advantageous usual method was discovered that permits a floating for a software-pipelined implementation. point argument reduction computation to begin at the first cycle, without first examining a table of values. While a table is used, the table access is overlapped with 1Introduction most of the argument reduction, helping to contain la- tency. The table is designed to allow a low degree odd Short latency and high throughput are two conflict- polynomial to complete the approximation. A new ap- ing goals in transcendental function evaluation algo- proach to the logarithm routine [1] which motivated this rithm design.
    [Show full text]
  • Appendix a Mathematical Fundamentals
    Appendix A Mathematical Fundamentals In this book, it is assumed that the reader has familiarity with the mathematical definitions from the areas of Linear Algebra, Trigonometry, and other related areas. Some of them that are introduced in this Appendix to make the reader quickly understand the derivations and notations used in different chapters of this book. A.1 Function ‘Atan2’ The usual inverse tangent function denoted by Atan(z), where z=y/x, returns an angle in the range (-π/2, π/2). In order to express the full range of angles it is useful to define the so called two- argument arctangent function denoted by Atan2(y,x), which returns angle in the entire range, i.e., (- π, π). This function is defined for all (x,y)≠0, and equals the unique angle θ such that x y cosθ = , and sinθ = …(A.1) x 2 + y 2 x 2 + y 2 The function uses the signs of x and y to select the appropriate quadrant for the angle θ, as explained in Table A.1. Table A.1 Evaluation of ‘Atan2’ function x Atan2(y,x) +ve Atan(z) 0 Sgn(y) π/2 -ve Atan(z)+ Sgn(y) π In Table A.1, z=y/x, and Sgn(.) denotes the usual sign function, i.e., its value is -1, 0, or 1 depending on the positive, zero, and negative values of y, respectively. However, if both x and y are zeros, ‘Atan2’ is undefined. Now, using the table, Atan2(-1,1)= -π/4 and Atan2(1,-1)= 3π/4, whereas the Atan(-1) returns -π/4 in both the cases.
    [Show full text]
  • Hardware Implementations of Fixed-Point Atan2
    Hardware Implementations of Fixed-Point Atan2 Hardware Implementations of Fixed-Point Atan2 Florent de Dinechin Matei I¸stoan Universit´ede Lyon, INRIA, INSA-Lyon, CITI-Lab ARITH22 Florent de Dinechin, Matei I¸stoan Hardware Implementations of Fixed-Point Atan2 Hardware Implementations of Fixed-Point Atan2 Introduction: Methods for computing Atan2 Methods for Computing atan2 in Hardware Yet another arithmetic function . • . that is useful in telecom (to recover the phase of a signal) (12{24 bits of precision) • . and in general for cartesian to polar coordinate transformation • and an interesting function, nonetheless 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.8 1 0.6 0.8 0.4 x 0.6 0.4 0.2 0.2 y 0 0 Florent de Dinechin, Matei I¸stoan Hardware Implementations of Fixed-Point Atan2 ARITH22 2 / 24 Hardware Implementations of Fixed-Point Atan2 Introduction: Methods for computing Atan2 Common Specification • target function y 1 (x; y) 1 y α = atan2(y; x) f (x; y) = arctan ( ) π x −1 1 x • input: fixed-point format −1 −1 0 1 ky y arctan ( ) = arctan ( ) [ ) kx x • output: fixed-point format and binary angles y (0; 1) π 2 −1 0 1 (−1; 0) π 0 (1; 0) −π x =) [ ) Florent de Dinechin, Matei I¸stoan Hardware Implementations of Fixed-Point Atan2 ARITH22 3 / 24 Hardware Implementations of Fixed-Point Atan2 Introduction: Methods for computing Atan2 Common Specification • target function y 1 (x; y) 1 y α = atan2(y; x) f (x; y) = arctan ( ) π x −1 1 x • input: fixed-point format −1 −1 0 1 ky y arctan ( ) = arctan ( ) [ ) kx x • output: fixed-point format and binary
    [Show full text]
  • X86 Intrinsics Cheat Sheet Jan Finis [email protected]
    x86 Intrinsics Cheat Sheet Jan Finis [email protected] Bit Operations Conversions Boolean Logic Bit Shifting & Rotation Packed Conversions Convert all elements in a packed SSE register Reinterpet Casts Rounding Arithmetic Logic Shift Convert Float See also: Conversion to int Rotate Left/ Pack With S/D/I32 performs rounding implicitly Bool XOR Bool AND Bool NOT AND Bool OR Right Sign Extend Zero Extend 128bit Cast Shift Right Left/Right ≤64 16bit ↔ 32bit Saturation Conversion 128 SSE SSE SSE SSE Round up SSE2 xor SSE2 and SSE2 andnot SSE2 or SSE2 sra[i] SSE2 sl/rl[i] x86 _[l]rot[w]l/r CVT16 cvtX_Y SSE4.1 cvtX_Y SSE4.1 cvtX_Y SSE2 castX_Y si128,ps[SSE],pd si128,ps[SSE],pd si128,ps[SSE],pd si128,ps[SSE],pd epi16-64 epi16-64 (u16-64) ph ↔ ps SSE2 pack[u]s epi8-32 epu8-32 → epi8-32 SSE2 cvt[t]X_Y si128,ps/d (ceiling) mi xor_si128(mi a,mi b) mi and_si128(mi a,mi b) mi andnot_si128(mi a,mi b) mi or_si128(mi a,mi b) NOTE: Shifts elements right NOTE: Shifts elements left/ NOTE: Rotates bits in a left/ NOTE: Converts between 4x epi16,epi32 NOTE: Sign extends each NOTE: Zero extends each epi32,ps/d NOTE: Reinterpret casts !a & b while shifting in sign bits. right while shifting in zeros. right by a number of bits 16 bit floats and 4x 32 bit element from X to Y. Y must element from X to Y. Y must from X to Y. No operation is SSE4.1 ceil NOTE: Packs ints from two NOTE: Converts packed generated.
    [Show full text]