Chapter 2 DIRECT METHODS—PART II

Total Page:16

File Type:pdf, Size:1020Kb

Chapter 2 DIRECT METHODS—PART II Chapter 2 DIRECT METHODS—PART II If we use finite precision arithmetic, the results obtained by the use of direct methods are contaminated with roundoff error, which is not always negligible. 2.1 Finite Precision Computation 2.2 Residual vs. Error 2.3 Pivoting 2.4 Scaling 2.5 Iterative Improvement 2.1 Finite Precision Computation 2.1.1 Floating-point numbers decimal floating-point numbers The essence of floating-point numbers is best illustrated by an example, such as that of a 3-digit floating-point calculator which accepts numbers like the following: 123. 50.4 −0.62 −0.02 7.00 Any such number can be expressed as e ±d1.d2d3 × 10 where e ∈{0, 1, 2}. (2.1) Here 40 t := precision = 3, [L : U] := exponent range = [0 : 2]. The exponent range is rather limited. If the calculator display accommodates scientific notation, e g., 3.46 3 −1.56 −3 then we might use [L : U]=[−9 : 9]. Some numbers have multiple representations in form (2.1), e.g., 2.00 × 101 =0.20 × 102. Hence, there is a normalization: • choose smallest possible exponent, • choose + sign for zero, e.g., 0.52 × 102 → 5.20 × 101, 0.08 × 10−8 → 0.80 × 10−9, −0.00 × 100 → 0.00 × 10−9. −9 Nonzero numbers of form ±0.d2d3 × 10 are denormalized. But for large-scale scientific computation base 2 is preferred. binary floating-point numbers This is an important matter because numbers like 0.2 do not have finite representations in base 2: 0.2=(0.001100110011 ···)2. For the IEEE Standard for Binary Floating Point Arithmetic, the set of machine repre- sentable numbers (single precision) is e (±b1.b2 ···b24)2 × 2 , where bj ∈{0, 1} e ∈ [−126 : 127] (normalization b1 =1ife ≥−125), +0, −0, also := −∞,+∞,NaN. 41 Normally ±0, ±∞ are indistinguishable. The important question is not how the numbers are represented but rather WHICH NUMBERS ARE EXACTLY REPRESENTABLE. Answer: integer × 2power |integer|≤16 777 215 − 149 ≤ power ≤ 104 You do not have to know binary to understand this. For double precision the set of machine representable numbers is e (±b1.b2 ···b53)2 × 2 , where e ∈ [−1022 : 1023] general floating-point numbers A set of floating-point numbers =: machine numbers can be specified by four parameters: β =base,t= precision, L = lower limit of exponent,U= upper limit of exponent. Thesetofmachinenumbersis e ±d1.d2 ···dt × β where di ∈ [0 : β − 1],e∈ [L : U]. (We call e the exponent.) It is a finite set of numbers; e.g., β =2,t =3,L = −2, U =1are “3-bit numbers”: −7/2 −3 −2 −1−1/2 −1/4 0 1/4 1/2 1 2 3 7/2 denormalized 2.1.2 Rounding rounding round to nearest; in a tie, round away from zero round-to-even round to nearest: in a tie, round to number whose last digit is even chopping round toward zero 42 e.g., β =10 t =3 L = −9 U =9 number chopping rounding round–to–even 99.9499 99.9 99.9 99.9 99.95 99.9 100 100 −99.95 −99.9 −100 −100 99.85 99.8 99.9 99.8 9.995109 9.99109 overflow overflow 9.99510−10 0.9910−9 1.10−9 1.10−9 99.851 99.8 99.9 99.9 We use fl(x) to denote the result of rounding x.Iffl(x) =6 x,thenwehavearoundoff error. definition of error (for floating-point computation) (The number of exactly correct digits is for most purposes irrevelant.) There is yet another measure of accuracy which is meaningful and useful in a fixed precision context: e Supposeα ˜ = d1.d2 ···dt−1dt × β ( normalized). Define 1 ulp = 0.0 ···01 × βe = βe−t+1 1 unit in the last place. α˜ − α error in ulps := βe−t E.g., in a 3-digit environment, π ≈ 3.14 has an error −.1592 ulps. Clearly |error|≤0.5ulps represents perfection and a routine that delivers sin(x) to within 1 ulp is doing very well. a single roundoff error Let x be a representable real, meaning that its rounded value has an exponent that is in range. If x is a machine number, then δ = 0. Assume then that x is not a machine number so that, in particular, it is nonzero and can be expressed in normalized form x = βe × f 1 ≤|f| <β. As shown below x lies between two consecutive machine numbers (if for the moment we ignore out-of-range exponents) which differ by one unit in the t-th place of x. bracketing | | machine numbers e ←−←−←−β × (0.00 ···01)β −→ −→ −→ = βe−t+1 43 The number x is rounded to the nearer of the two bracketing numbers, so the rounding error is at most half the distance between them: 1 |error|≤ × βe−t+1. 2 Equality occurs when x is in the middle and gets rounded in either direction. It is the relative error (|error/x|) that we wish to bound, so we need a lower bound on x which can be obtained using the normalization assumption: |x| = βe|f| >βe = βe. The strict inequality is justified because x is not a machine number. Combine the two inequalities to get 1 × βe−t+1 1 |relative error| < 2 = β1−t e.g., 2−24 in s. prec. βe 2 This can be expressed fl(x)=x(1 + δ) where the relative roundoff error δ satisfies1 β1−t, chopping, |δ| <u:= 1 1−t 2 β , rounding, round-to-even. Often u is called the unit roundoff error. 2.1.3 Floating-point arithmetic Let ◦ denote one of +, –, *, /. If a, b are machine numbers, then the floating-point operation ◦ˆ is defined by a◦ˆb =fl(a ◦ b). E.g., 9.99 −ˆ .00501 = fl(9.98499) = 9.98. Thus in principle we first compute the result to infinite precision; in practice, computers do it some other way but the end result is the same. (The details of how the computers do it is an entirely different topic.) We have a◦ˆb =fl(a ◦ b)=(a ◦ b)(1 + δ), |δ|≤u. Thus a single operation a◦ˆb is always performed very accurately. Also input/ouput, e.g., 0.1F=0.1(1 + 2−26). 1for normalized numbers. This can be modified to include denormalized numbers. 44 cancellation occurs when the result has fewer digits than the operands and hence the result is a machine number: CANCELLATION =⇒ NO ROUNDOFF ERROR. Elementary functions can be evaluated with an error of less than 1 ulp. 2.1.4 Algorithms and numerical instability algorithms problem → subproblem; subproblem e.g., Solve for eliminate solve for solve ; −→ if (n>1) { ; ; } ; x1,x2,...,xn x1 x2,...,xn for x1 set up system polynomial solve system of ; −→ of linear equa- ; ; interpolation linear equations tions In general Rm −→f Rn −→ Rm −→f1 Rp −→f2 Rn y = f(x) z = f1(x); y = f2(z) f = f2 ◦ f1 Thus, an algorithm is the repeated decomposition of problems (mappings) into smaller sub- problems until all subproblems can be solved by existing hardware and software. Different decompositions yield different algorithms. propagated and accumulated roundoff error In the diagram below FE and FEc denote, respectively, the use of forward elimination using exact and 3-digit rounded arithmetic. Similary BS and BSc denote back substitution. 45 3.00 4.13 15.4 1.00 1.37 5.15 FE FEc the 0.02% diff 3.00 4.13 15.4 3.00 4.13 15.4 is the −.00666 ··· .016666 ··· −.∅100 .∅200 FE error BS BS BSc the 9% diff the1%diff 8.575 is the 7.88666 ··· 7.90 is the −2.5 propagated −2 −2.00 BS error FE error ←− the 9% diff is the accumulated error −→ For comparison the unit roundoff error u =0.5%. (Question: why do we consider FE;c BS rather than FE; BS?c Answer: because the 3-digit arithmetic used in BSc is defined only for 3-digit operands.) catastrophic cancellation (1000+1)ˆ −ˆ 1000 exact approx 1001 − 1000 1000−ˆ 1000 exact exact approx 100 The subtraction was performed exactly. The 0.1% addition error was amplified to a 100% error in the result because of cancellation. Another example: 1.07׈ 1.07−ˆ 1.14 exact approx 1.1449 − 1.14 1.14−ˆ 1.14 exact exact approx 0.0049 0 0 −.0049 −.0049 The relative error 1.1449 is amplified to become .0049 . CANCELLATION =⇒ ERROR AMPLIFICATION. 46 Often an expression has several arrangements which are mathematically equivalent. Example Suppose x2 ≈ x1 ≈ x0 formula 1: x2 − 2 ∗ x1 + x0 in 3-digit precision .723 − 2 ∗ 7.11 + .701 = .723 − 1.42 + .701 = −.697 + .701 = .004 formula 2: (x2 − x1) − (x1 − x0) in 3-digit precision (.723 − .711) − (.711 − .701) = .012 − .010 = .002 exact! Moral: cancellation is good if the operands have no computational error, e.g., they are inputs to the algorithm. x − y x + y Example: Change sinx − siny to2sin cos . (There is a calculus of finite dif- 2 2 ferences that enables one to perform such transformations.) Therefore if cancellation cannot be avoided, then it should occur at the beginning of the computation. numerical instability For a given problem y = x2 − 1 there may be several algorithms: 1. y =(x׈ x)−ˆ 1, 2. y =(x−ˆ 1)׈ (x+1).ˆ Both of these algorithms introduce a few very tiny roundoff errors, and yet algorithm 1 can sometimes yield a result y with a relative error far greater than the unit roundoff error.
Recommended publications
  • Fujitsu SPARC64™ X+/X Software on Chip Overview for Developers]
    White paper [Fujitsu SPARC64™ X+/X Software on Chip Overview for Developers] White paper Fujitsu SPARC64™ X+/X Software on Chip Overview for Developers Page 1 of 13 www.fujitsu.com/sparc White paper [Fujitsu SPARC64™ X+/X Software on Chip Overview for Developers] Table of Contents Table of Contents 1 Software on Chip Innovative Technology 3 2 SPARC64™ X+/X SIMD Vector Processing 4 2.1 How Oracle Databases take advantage of SPARC64™ X+/X SIMD vector processing 4 2.2 How to use of SPARC64™ X+/X SIMD instructions in user applications 5 2.3 How to check if the system and operating system is capable of SIMD execution? 6 2.4 How to check if SPARC64™ X+/X SIMD instructions indeed have been generated upon compilation of a user source code? 6 2.5 Is SPARC64™ X+/X SIMD implementation compatible with Oracle’s SPARC SIMD? 6 3 SPARC64™ X+/X Decimal Floating Point Processing 8 3.1 How Oracle Databases take advantage of SPARC64™ X+/X Decimal Floating-Point 8 3.2 Decimal Floating-Point processing in user applications 8 4 SPARC64™ X+/X Extended Floating-Point Registers 9 4.1 How Oracle Databases take advantage of SPARC64™ X+/X Decimal Floating-Point 9 5 SPARC64™ X+/X On-Chip Cryptographic Processing Capabilities 10 5.1 How to use the On-Chip Cryptographic Processing Capabilities 10 5.2 How to use the On-Chip Cryptographic Processing Capabilities in user applications 10 6 Conclusions 12 Page 2 of 13 www.fujitsu.com/sparc White paper [Fujitsu SPARC64™ X+/X Software on Chip Overview for Developers] Software on Chip Innovative Technology 1 Software on Chip Innovative Technology Fujitsu brings together innovations in supercomputing, business computing, and mainframe computing in the Fujitsu M10 enterprise server family to help organizations meet their business challenges.
    [Show full text]
  • Decimal Floating Point for Future Processors Hossam A
    1 Decimal Floating Point for future processors Hossam A. H. Fahmy Electronics and Communications Department, Cairo University, Egypt Email: [email protected] Tarek ElDeeb, Mahmoud Yousef Hassan, Yasmin Farouk, Ramy Raafat Eissa SilMinds, LLC Email: [email protected] F Abstract—Many new designs for Decimal Floating Point (DFP) hard- Simple decimal fractions such as 1=10 which might ware units have been proposed in the last few years. To date, only represent a tax amount or a sales discount yield an the IBM POWER6 and POWER7 processors include internal units for infinitely recurring number if converted to a binary decimal floating point processing. We have designed and tested several representation. Hence, a binary number system with a DFP units including an adder, multiplier, divider, square root, and fused- multiply-add compliant with the IEEE 754-2008 standard. This paper finite number of bits cannot accurately represent such presents the results of using our units as part of a vector co-processor fractions. When an approximated representation is used and the anticipated gains once the units are moved on chip with the in a series of computations, the final result may devi- main processor. ate from the correct result expected by a human and required by the law [3], [4]. One study [5] shows that in a large billing application such an error may be up to 1 WHY DECIMAL HARDWARE? $5 million per year. Ten is the natural number base or radix for humans Banking, billing, and other financial applications use resulting in a decimal number system while a binary decimal extensively.
    [Show full text]
  • A Decimal Floating-Point Speciftcation
    A Decimal Floating-point Specification Michael F. Cowlishaw, Eric M. Schwarz, Ronald M. Smith, Charles F. Webb IBM UK IBM Server Division P.O. Box 31, Birmingham Rd. 2455 South Rd., MS:P310 Warwick CV34 5JL. UK Poughkeepsie, NY 12601 USA [email protected] [email protected] Abstract ing is required. In the fixed point formats, rounding must be explicitly applied in software rather than be- Even though decimal arithmetic is pervasive in fi- ing provided by the hardware. To address these and nancial and commercial transactions, computers are other limitations, we propose implementing a decimal stdl implementing almost all arithmetic calculations floating-point format. But what should this format be? using binary arithmetic. As chip real estate becomes This paper discusses the issues of defining a decimal cheaper it is becoming likely that more computer man- floating-point format. ufacturers will provide processors with decimal arith- First, we consider the goals of the specification. It metic engines. Programming languages and databases must be compliant with standards already in place. are expanding the decimal data types available whale One standard we consider is the ANSI X3.274-1996 there has been little change in the base hardware. As (Programming Language REXX) [l]. This standard a result, each language and application is defining a contains a definition of an integrated floating-point and different arithmetic and few have considered the efi- integer decimal arithmetic which avoids the need for ciency of hardware implementations when setting re- two distinct data types and representations. The other quirements. relevant standard is the ANSI/IEEE 854-1987 (Radix- In this paper, we propose a decimal format which Independent Floating-point Arithmetic) [a].
    [Show full text]
  • Floating Point Numbers
    Floating Point Numbers Yes, this is the moon; our own moon. Not the final frontier but the first out post there to be exploited by our greed of consumable minerals. Soon we the human race will be there blasting the mines and depriving the orb with its riches. Do we know how much is there to steal? Pop Quiz : What is the surface area of moon? 2 Answer : The surface area of a sphere is: 4 * π * R Radius of moon is about 1738.2 KM; plugging the values: 4 * 3.14159265 * 1738.2 * 1738.2 = 37967268 .598162344 KM 2. That would be 37.9 million square kilometers. Two of our biggest states Texas and California are 1.7 and 0.7 million square kilometers respectively. Surface if the Moon would be about 2/3 of the area of North America or about the size of Russia, that is close to the 38 million Sq Km Now you, all mainframe assembly language tool developers i.e. old time MF programmers try doing this calculation in S/390 Assembly. Give yourself few minutes. Address Object Code S/390 Assembly Reg./ Memory after execution 000036 B375 0010 LZDR R1 FPR1 00000000_00000000 00003A ED10 C08C 0024 LDE R1,FOUR FPR1 41400000_00000000 000040 7C10 C090 MDE R1,PIE FPR1 41C90FDC_00000000 000044 7C10 C094 MDE R1,RADIUS FPR1 445552DD_F73CD400 000048 7C10 C094 MDE R1,RADIUS FPR1 47243559_FE390700 00004C B3C9 0011 CGDR R1,0,R1 GR1 0243559F 000050 5010 C098 ST R1,FIXED 000054 4E10 C09C CVD R1,DECIMAL 00000003 7967263C 000088 45B27570 FLOAT DC X'45B27570' 00008C 41400000 FOUR DC E'4' 000090 413243F7 PIE DC E'3.14159265E+0' 000094 436CA333 RADIUS DC E'1.7382E+3' 000098 00000000 FIXED DC F'0' 00009C 0000000000000000 DECIMAL DC 2F'0' This is one way of solving the problem mentioned on previous slide and, of course, we all know that there are several different ways to solve any programming problem and your way is always better than mine.
    [Show full text]
  • High Performance Hardware Design of IEEE Floating Point Adder in FPGA with VHDL
    International Journal of Mechatronics, Electrical and Computer Technology Vol. 3(8), Jul, 2013, pp 81 - 101, ISSN: 2305-0543 Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - High Performance Hardware Design Of IEEE Floating Point Adder In FPGA With VHDL Ali Farmani Department of Electrical and Computer Engineering,University of Tabriz,Tabriz,Iran *Corresponding Author's E-mail: [email protected] Abstract In this paper, we present the design and implementation of a floating-point adder that is compliant with the current draft revision of this standard. We provide synthesis results indicating the estimated area and delay for our design when it is pipelined to various depths.Our work is an important design resource for development of floating-point adder hardware on FPGAs. All sub components within the floating-point adder and known algorithms are researched and implemented to provide versatility and flexibility to designers as an alternative to intellectual property where they have no control over the design. The VHDL code is open source and can be used by designers with proper reference. Each of the sub-operation is researched for different implementations and then synthesized onto a Spartan FPGA device to be chosen for best performance. Our implementation of the standard algorithm occupied 370 slices and had an overall delay of 31 ns. The standard algorithm was pipelined into five stages to run at 100 MHz which took an area of 324 slices and power is 30mw. Keywords: Floating Point Adder, FPGA, VHDL, Pipeline. 1. Introduction High performance floating point adders are essential building blocks of microprocessors and floating point DSP data paths.
    [Show full text]
  • Names for Standardized Floating-Point Formats
    Names WORK IN PROGRESS April 19, 2002 11:05 am Names for Standardized Floating-Point Formats Abstract I lack the courage to impose names upon floating-point formats of diverse widths, precisions, ranges and radices, fearing the damage that can be done to one’s progeny by saddling them with ill-chosen names. Still, we need at least a list of the things we must name; otherwise how can we discuss them without talking at cross-purposes? These things include ... Wordsize, Width, Precision, Range, Radix, … all of which must be both declarable by a computer programmer, and ascertainable by a program through Environmental Inquiries. And precision must be subject to truncation by Coercions. Conventional Floating-Point Formats Among the names I have seen in use for floating-point formats are … float, double, long double, real, REAL*…, SINGLE PRECISION, DOUBLE PRECISION, double extended, doubled double, TEMPREAL, quad, … . Of these floating-point formats the conventional ones include finite numbers x all of the form x = (–1)s·m·ßn+1–P in which s, m, ß, n and P are integers with the following significance: s is the Sign bit, 0 or 1 according as x ≥ 0 or x ≤ 0 . ß is the Radix or Base, two for Binary and ten for Decimal. (I exclude all others). P is the Precision measured in Significant Digits of the given radix. m is the Significand or Coefficient or (wrongly) Mantissa of | x | running in the range 0 ≤ m ≤ ßP–1 . ≤ ≤ n is the Exponent, perhaps Biased, confined to some interval Nmin n Nmax . Other terms are in use.
    [Show full text]
  • SPARC64™ X / X+ Specification
    SPARC64™ X / X+ Specification Distribution: Public Privilege Levels: Nonprivileged Ver 29.0 2015/01/27 Fujitsu Limited Fujitsu Limited 4-1-1 Kamikodanaka Nakahara-ku, Kawasaki, 211-8588 Japan Copyright© 2007 - 2015 Fujitsu Limited, 4-1-1 Kamikodanaka, Nakahara-ku, Kawasaki, 211-8588, Japan. All rights reserved. This product and related documentation are protected by copyright and distributed under licenses restricting their use, copying, distribution, and decompilation. No part of this product or related documentation may be reproduced in any form by any means without prior written authorization of Fujitsu Limited and its licensors, if any. The product(s) described in this book may be protected by one or more U.S. patents, foreign patents, or pending applications. TRADEMARKS SPARC® is a registered trademark of SPARC International, Inc. Products bearing SPARC trademarks are based on an architecture developed by Oracle and / or its affiliates. SPARC64™ is a registered trademark of SPARC International, Inc., licensed exclusively to Fujitsu Limited. UNIX is a registered trademark of The Open Group in the United States and other countries. Fujitsu and the Fujitsu logo are trademarks of Fujitsu Limited. This publication is provided “as is” without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or noninfringement. This publication could include technical inaccuracies or typographical errors. Changes are periodically added to the information herein; these changes will be incorporated in new editions of the publication. Fujitsu Limited may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time.
    [Show full text]
  • Numerical Computation Guide
    Numerical Computation Guide A Sun Microsystems, Inc. Business 2550 Garcia Avenue Mountain View, CA 94043 U.S.A. Part No.: 801-7639-10 Revision A, August 1994 1994 Sun Microsystems, Inc. 2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A. All rights reserved. This product and related documentation are protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or related documentation may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Portions of this product may be derived from the UNIX® and Berkeley 4.3 BSD systems, licensed from UNIX System Laboratories, Inc., a wholly owned subsidiary of Novell, Inc., and the University of California, respectively. Third-party font software in this product is protected by copyright and licensed from Sun’s font suppliers. RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the United States Government is subject to the restrictions set forth in DFARS 252.227-7013 (c)(1)(ii) and FAR 52.227-19. The product described in this manual may be protected by one or more U.S. patents, foreign patents, or pending applications. TRADEMARKS Sun, the Sun logo, Sun Microsystems, Sun Microsystems Computer Corporation, Solaris, are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and certain other countries. UNIX is a registered trademark of Novell, Inc., in the United States and other countries; X/Open Company, Ltd., is the exclusive licensor of such trademark. OPEN LOOK® is a registered trademark of Novell, Inc.
    [Show full text]
  • In Using the GNU Compiler Collection (GCC)
    Using the GNU Compiler Collection For gcc version 6.1.0 (GCC) Richard M. Stallman and the GCC Developer Community Published by: GNU Press Website: http://www.gnupress.org a division of the General: [email protected] Free Software Foundation Orders: [email protected] 51 Franklin Street, Fifth Floor Tel 617-542-5942 Boston, MA 02110-1301 USA Fax 617-542-2652 Last printed October 2003 for GCC 3.3.1. Printed copies are available for $45 each. Copyright c 1988-2016 Free Software Foundation, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with the Invariant Sections being \Funding Free Software", the Front-Cover Texts being (a) (see below), and with the Back-Cover Texts being (b) (see below). A copy of the license is included in the section entitled \GNU Free Documentation License". (a) The FSF's Front-Cover Text is: A GNU Manual (b) The FSF's Back-Cover Text is: You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development. i Short Contents Introduction ::::::::::::::::::::::::::::::::::::::::::::: 1 1 Programming Languages Supported by GCC ::::::::::::::: 3 2 Language Standards Supported by GCC :::::::::::::::::: 5 3 GCC Command Options ::::::::::::::::::::::::::::::: 9 4 C Implementation-Defined Behavior :::::::::::::::::::: 373 5 C++ Implementation-Defined Behavior ::::::::::::::::: 381 6 Extensions to
    [Show full text]
  • Cycle-Accurate Evaluation of Software-Hardware Co-Design of Decimal
    TABLE I: Development environment Input X,Y Description Compiler GNU GCC for RISC-V as a cross compiler [14] Yes Special? ISA simulator SPIKE [15] Programming language High-level: Scale, C++, and C Hardware description language: Chisel [16] No Assembly: RISC-V assembly Sign X ,Y Exp X ,Y Coeff X ,Y ISA RISC-V [12] s s e e c c Processor core Rocket [17] Temp Exp Convert to BCD Decimal Software Library decNumber C library [2] Testing Arithmetic operations verification database [18] = Xe + Ye Sign Result = X pp[0] = 0 pp[1] = Xc Set i = 1 • The sign and initial exponent are calculated by doing XOR and addition operation between the signs and Yes exponents of multiplier and multiplicand. s pp[i+1] = i<10 ⊕ Y pp[i]+ pp[1] ? • Coefficient multiplication is performed to produce the s No i = i+1 coefficient of the result. j = # of digit in Yc • If the result exceeds the precision then rounding opera- Result = 0 tion is applied with various rounding algorithm [4]. • Finally, the exponents are adjusted accordingly. Yes Result = Special Value Check rounding j = 0 The process of such multiplication can be designed using a ? No combination of software and hardware. Some solutions have th been proposed in [9]. In this paper, we propose an evaluation Exp Result = Temp k = j digit Yc exp + Rounding Digit framework and evaluate Method-1 of [9] as an example. pp[k] decimal Combine sign, left shift Figure 1 shows an overall flow of Method-1 that is one exponent, and of the solutions in [9].
    [Show full text]
  • Floating Point Arithmetic Chapter 14
    Floating Point Arithmetic Chapter 14 Although integers provide an exact representation for numeric values, they suffer from two major drawbacks: the inability to represent fractional values and a limited dynamic range. Floating point arithmetic solves these two problems at the expense of accuracy and, on some processors, speed. Most programmers are aware of the speed loss associated with floating point arithmetic; however, they are blithely unware of the prob- lems with accuracy. For many applications, the benefits of floating point outweigh the disadvantages. However, to properly use floating point arithmetic in any program, you must learn how floating point arithmetic operates. Intel, understanding the importance of floating point arithmetic in modern programs, provided support for floating point arithmetic in the ear- liest designs of the 8086 – the 80x87 FPU (floating point unit or math coprocessor). How- ever, on processors eariler than the 80486 (or on the 80486sx), the floating point processor is an optional device; it this device is not present you must simulate it in software. This chapter contains four main sections. The first section discusses floating point arithmetic from a mathematical point of view. The second section discusses the binary floating point formats commonly used on Intel processors. The third discusses software floating point and the math routines from the UCR Standard Library. The fourth section discusses the 80x87 FPU chips. 14.0 Chapter Overview This chapter contains four major sections: a description of floating point formats and operations (two sections), a discussion of the floating point support in the UCR Standard Library, and a discussion of the 80x87 FPU (floating point unit).
    [Show full text]
  • Floating Point
    Contents Articles Floating point 1 Positional notation 22 References Article Sources and Contributors 32 Image Sources, Licenses and Contributors 33 Article Licenses License 34 Floating point 1 Floating point In computing, floating point describes a method of representing an approximation of a real number in a way that can support a wide range of values. The numbers are, in general, represented approximately to a fixed number of significant digits (the significand) and scaled using an exponent. The base for the scaling is normally 2, 10 or 16. The typical number that can be represented exactly is of the form: Significant digits × baseexponent The idea of floating-point representation over intrinsically integer fixed-point numbers, which consist purely of significand, is that An early electromechanical programmable computer, expanding it with the exponent component achieves greater range. the Z3, included floating-point arithmetic (replica on display at Deutsches Museum in Munich). For instance, to represent large values, e.g. distances between galaxies, there is no need to keep all 39 decimal places down to femtometre-resolution (employed in particle physics). Assuming that the best resolution is in light years, only the 9 most significant decimal digits matter, whereas the remaining 30 digits carry pure noise, and thus can be safely dropped. This represents a savings of 100 bits of computer data storage. Instead of these 100 bits, much fewer are used to represent the scale (the exponent), e.g. 8 bits or 2 A diagram showing a representation of a decimal decimal digits. Given that one number can encode both astronomic floating-point number using a mantissa and an and subatomic distances with the same nine digits of accuracy, but exponent.
    [Show full text]