A Study on Linear Algebra Operations Using Extended Precision Floating-Point Arithmetic on Gpus

Total Page:16

File Type:pdf, Size:1020Kb

A Study on Linear Algebra Operations Using Extended Precision Floating-Point Arithmetic on Gpus View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Tsukuba Repository A study on linear algebra operations using extended precision floating-point arithmetic on GPUs 著者 椋木 大地 year 2013 その他のタイトル GPUにおける拡張精度浮動小数点演算を用いた線形 計算の研究 学位授与大学 筑波大学 (University of Tsukuba) 学位授与年度 2013 報告番号 12102甲第6722号 URL http://hdl.handle.net/2241/00122774 A study on linear algebra operations using extended precision floating-point arithmetic on GPUs Graduate School of Systems and Information Engineering University of Tsukuba November 2013 Daichi Mukunoki Contents 1 Introduction 1 1.1 Precision of Floating-Point Arithmetic . 2 1.1.1 IEEE Standard for Floating-Point Arithmetic . 2 1.1.2 Extended Precision Floating-Point Arithmetic . 3 1.2 GPU Computing . 4 1.2.1 Performance Characteristics . 5 1.2.2 CUDA . 5 1.3 Overview of the Thesis . 8 1.3.1 Scope of the Study . 8 1.3.2 Contributions . 9 1.3.3 Organization of the Thesis . 9 2 Triple- and Quadruple-Precision BLAS Subroutines on GPUs 12 2.1 Introduction . 12 2.2 Quadruple-Precision Floating-Point Operations . 13 2.2.1 DD-type Quadruple-Precision Floating-Point Format . 14 2.2.2 DD-type Quadruple-Precision Floating-Point Arithmetic . 15 2.3 Triple-Precision Floating-Point Operations . 18 2.3.1 D+S-type Triple-Precision Floating-Point Format . 19 2.3.2 D+S-type Triple-Precision Floating-Point Arithmetic . 19 2.3.3 Computation of D+S-type Values using DD Arithmetic . 21 2.3.4 D+I-type Triple-Precision Floating-Point Format . 23 2.4 Implementation of Triple- and Quadruple- Precision BLAS Subroutines on GPUs . 25 2.4.1 DD Arithmetic Functions . 26 2.4.2 Implementation of BLAS Kernels . 27 2.4.3 Data Structures of Triple- and Quadruple-Precision Value Ar- rays in Global Memory . 28 i CONTENTS 2.5 Performance Prediction . 30 2.5.1 Theoretical Peak Performance on DD Arithmetic . 30 2.5.2 Performance Prediction using the Bytes/Flop and Bytes/DDFlop ratios . 32 2.6 Performance Evaluation . 33 2.6.1 Evaluation Methods . 34 2.6.2 Performance Comparison of Double-, Triple- and Quadruple-Precision BLAS Subroutines . 34 2.6.3 AoS Layout vs. SoA Layout . 39 2.6.4 D+S Arithmetic v.s. DD Arithmetic . 39 2.6.5 Accuracy Evaluation . 40 2.7 Related Work . 41 2.8 Conclusion . 42 3 Optimization of Sparse Matrix-vector Multiplication on NVIDIA Kepler Architecture GPUs 44 3.1 Introduction . 44 3.2 Related Work . 46 3.3 Kepler architecture GPUs . 47 3.4 Implementation . 48 3.4.1 48KB Read-only Data Cache . 48 3.4.2 Avoid Outermost Loop . 49 3.4.3 Shuffle Instruction . 52 3.5 Performance Evaluation . 52 3.5.1 Evaluation Methods . 52 3.5.2 Result . 54 3.6 Conclusion . 59 4 Krylov Subspace Methods using Quadruple-Precision Arithmetic on GPUs 61 4.1 Introduction . 61 4.2 Related Work . 64 4.3 Implementation . 65 4.3.1 Overview of Quadruple-Precision Versions . 65 4.3.2 Implementation of CG and BiCGStab Methods on GPUs . 66 4.4 Experimental Results . 67 4.4.1 Unpreconditioned Methods . 67 4.4.2 Cases with Preconditioning . 73 ii CONTENTS 4.5 Conclusion . 77 5 Conclusion 79 5.1 Summary . 79 5.2 Future Work . 82 Bibliography 84 A List of Publications 92 iii Listings 2.1 Conversion from DD-type to D+I-type . 25 2.2 Conversion from D+I-type to DD-type . 26 2.3 Implementation of QuadMul . 27 2.4 Triple-precision AXPY using D+I-type format . 28 3.1 Host code of SpMV . 49 3.2 Kernel code of SpMV for the Fermi architecture . 50 3.3 Kernel code of SpMV for the Kepler architecture . 51 iv List of Figures 1.1 GPU architecture (NVIDIA Fermi architecture) . 6 1.2 CUDA device and thread hierarchy . 7 2.1 Quadruple-precision floating-point format on DD arithmetic . 14 2.2 Concept of DD arithmetic . 15 2.3 D+S-type triple-precision floating-point format . 19 2.4 Triple-precision operations using DD arithmetic on GPUs . 23 2.5 D+I-type triple-precision floating-point format . 24 2.6 GEMV kernel . 29 2.7 GEMM kernel . 30 2.8 Array of Structures (AoS) layout and Structure of Arrays (SoA) lay- out on D+S-type values . 31 2.9 Performance of triple- and quadruple-precision AXPY (DDFlops: DD- type floating-point operations per second) . 36 2.10 Relative execution time of AXPY (value is the multiple of the execu- tion time of the double-precision subroutine) . 36 2.11 Performance of triple- and quadruple-precision GEMV (DDFlops: DD-type floating-point operations per second) . 37 2.12 Relative execution time of GEMV (value is the multiple of the exe- cution time of the double-precision subroutine) . 37 2.13 Performance of triple- and quadruple-precision GEMM (DDFlops: DD-type floating-point operations per second) . 38 2.14 Relative execution time of GEMM (value is the multiple of the exe- cution time of the double-precision subroutine) . 38 2.15 Performance comparison of AoS (Array-of-Structures) and SoA (Structure- of-Arrays) layouts on D+S-type triple-precision AXPY . 39 2.16 Performance comparison of D+S and DD arithmetic on GEMM . 40 3.1 CRS format . 45 v LIST OF FIGURES 3.2 Thread mapping for the cases of NT=1, 2 and 4 on the CRS-vector method . 47 3.3 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. A (only using 48KB read-only data cache) to Ver. Fermi . 54 3.4 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. B (only avoiding outermost loop) to Ver. Fermi . 55 3.5 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. C (only using shuffle instruction) to Ver. Fermi . 56 3.6 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. Kepler to Ver. Fermi . 57 3.7 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. Kepler to cuSPARSE's subroutine . 58 4.1 Preconditioned CG method . 62 4.2 Preconditioned BiCGStab method . 63 4.3 Accelerating iterative methods using high precision arithmetic . 64 vi List of Tables 1.1 Binary floating-point basic formats defined in IEEE 754-2008 . 3 2.1 Number of double-precision floating-point instructions and Flops for quadruple-precision multiply-add . 18 2.2 Floating-point instruction and cycle counts for D+S and DD arith- metic on Tesla M2050 . 22 2.3 Bytes/Flop and Bytes/DDFlop for BLAS subroutine . 34 2.4 Evaluation environment . 34 2.5 Relative error to octuple-precision result (input: uniform random numbers in the range of 0 to 1 in double-precision) . 41 3.1 Evaluation Environment . 53 3.2 Properties of the matrices shown in Figure 3.3 . 54 3.3 Properties of the matrices shown in Figure 3.4 . 55 3.4 Properties of the matrices shown in Figure 3.5 . 56 3.5 Properties of the matrices shown in Figure 3.6 . 57 3.6 Properties of the matrices shown in Figure 3.7 . 58 4.1 Evaluation environment . 67 4.2 The number of problems which could be solved using quadruple- precision (QP) versions with stopping criterion: jjrjj2=jjr0jj2 ≤ and satisfying the true relative residual: jjb − Axjj2=jjbjj2 ≤ . 68 4.3 Cases which can be solved using both quadruple-precision (QP) and double-precision (DP) versions with stopping criterion: jjrjj2=jjr0jj2 ≤ −8 10 and satisfying the true relative residual (TRR): jjb−Axjj2=jjbjj2 ≤ 10−8 but QP versions are faster . 69 vii LIST OF TABLES 4.4 Cases which can be solved using quadruple-precision (QP) versions −8 with stopping criterion: jjrjj2=jjr0jj2 ≤ 10 and satisfying the true −8 relative residual (TRR): jjb − Axjj2=jjbjj2 ≤ 10 but which do not converge when using double-precision (DP) versions with the stopping criterion within the maximum number of iterations of 30,000 . 70 4.5 Cases which can be solved using quadruple-precision (QP) versions and converged by double-precision (DP) versions with stopping crite- −8 rion: jjrjj2=jjr0jj2 ≤ 10 but not satisfying the true relative residual −8 (TRR): jjb − Axjj2=jjbjj2 ≤ 10 when using DP versions . 71 4.6 The 2 largest and 2 smallest relative execution times per iteration (QP/DP ratio) . 72 4.7 The number of problems which can be solved using preconditioned quadruple-precision (QP) versions with stopping criterion: jjrjj2=jjr0jj2 ≤ and satisfying the true relative residual (TRR): jjb − Axjj2=jjbjj2 ≤ 74 4.8 Cases with preconditioning which can be solved using both quadruple- precision (QP) and double-precision (DP) versions with stopping cri- −8 terion: jjrjj2=jjr0jj2 ≤ 10 and satisfying the true relative residual −8 (TRR): jjb − Axjj2=jjbjj2 ≤ 10 but QP versions are faster . 75 4.9 Cases with preconditioning which can be solved using quadruple- precision (QP) versions and converged by double-precision (DP) ver- −8 sions with stopping criterion: jjrjj2=jjr0jj2 ≤ 10 but not satisfying −8 the true relative residual (TRR): jjb − Axjj2=jjbjj2 ≤ 10 when using DP versions . 76 viii Abstract Most computational tasks in scientific and engineering calculations are linear alge- braic operations, which compute vectors and matrices, such as the solving of systems of linear equations. On computers, real numbers are represented using floating-point numbers and are computed with floating-point arithmetic operations. Floating-point operations have rounding errors and these errors may become a critical issue for some applications.
Recommended publications
  • Fujitsu SPARC64™ X+/X Software on Chip Overview for Developers]
    White paper [Fujitsu SPARC64™ X+/X Software on Chip Overview for Developers] White paper Fujitsu SPARC64™ X+/X Software on Chip Overview for Developers Page 1 of 13 www.fujitsu.com/sparc White paper [Fujitsu SPARC64™ X+/X Software on Chip Overview for Developers] Table of Contents Table of Contents 1 Software on Chip Innovative Technology 3 2 SPARC64™ X+/X SIMD Vector Processing 4 2.1 How Oracle Databases take advantage of SPARC64™ X+/X SIMD vector processing 4 2.2 How to use of SPARC64™ X+/X SIMD instructions in user applications 5 2.3 How to check if the system and operating system is capable of SIMD execution? 6 2.4 How to check if SPARC64™ X+/X SIMD instructions indeed have been generated upon compilation of a user source code? 6 2.5 Is SPARC64™ X+/X SIMD implementation compatible with Oracle’s SPARC SIMD? 6 3 SPARC64™ X+/X Decimal Floating Point Processing 8 3.1 How Oracle Databases take advantage of SPARC64™ X+/X Decimal Floating-Point 8 3.2 Decimal Floating-Point processing in user applications 8 4 SPARC64™ X+/X Extended Floating-Point Registers 9 4.1 How Oracle Databases take advantage of SPARC64™ X+/X Decimal Floating-Point 9 5 SPARC64™ X+/X On-Chip Cryptographic Processing Capabilities 10 5.1 How to use the On-Chip Cryptographic Processing Capabilities 10 5.2 How to use the On-Chip Cryptographic Processing Capabilities in user applications 10 6 Conclusions 12 Page 2 of 13 www.fujitsu.com/sparc White paper [Fujitsu SPARC64™ X+/X Software on Chip Overview for Developers] Software on Chip Innovative Technology 1 Software on Chip Innovative Technology Fujitsu brings together innovations in supercomputing, business computing, and mainframe computing in the Fujitsu M10 enterprise server family to help organizations meet their business challenges.
    [Show full text]
  • Arithmetic Algorithms for Extended Precision Using Floating-Point Expansions Mioara Joldes, Olivier Marty, Jean-Michel Muller, Valentina Popescu
    Arithmetic algorithms for extended precision using floating-point expansions Mioara Joldes, Olivier Marty, Jean-Michel Muller, Valentina Popescu To cite this version: Mioara Joldes, Olivier Marty, Jean-Michel Muller, Valentina Popescu. Arithmetic algorithms for extended precision using floating-point expansions. IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers, 2016, 65 (4), pp.1197 - 1210. 10.1109/TC.2015.2441714. hal- 01111551v2 HAL Id: hal-01111551 https://hal.archives-ouvertes.fr/hal-01111551v2 Submitted on 2 Jun 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. IEEE TRANSACTIONS ON COMPUTERS, VOL. , 201X 1 Arithmetic algorithms for extended precision using floating-point expansions Mioara Joldes¸, Olivier Marty, Jean-Michel Muller and Valentina Popescu Abstract—Many numerical problems require a higher computing precision than the one offered by standard floating-point (FP) formats. One common way of extending the precision is to represent numbers in a multiple component format. By using the so- called floating-point expansions, real numbers are represented as the unevaluated sum of standard machine precision FP numbers. This representation offers the simplicity of using directly available, hardware implemented and highly optimized, FP operations.
    [Show full text]
  • Floating Points
    Jin-Soo Kim ([email protected]) Systems Software & Architecture Lab. Seoul National University Floating Points Fall 2018 ▪ How to represent fractional values with finite number of bits? • 0.1 • 0.612 • 3.14159265358979323846264338327950288... ▪ Wide ranges of numbers • 1 Light-Year = 9,460,730,472,580.8 km • The radius of a hydrogen atom: 0.000000000025 m 4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim ([email protected]) 2 ▪ Representation • Bits to right of “binary point” represent fractional powers of 2 • Represents rational number: 2i i i–1 k 2 bk 2 k=− j 4 • • • 2 1 bi bi–1 • • • b2 b1 b0 . b–1 b–2 b–3 • • • b–j 1/2 1/4 • • • 1/8 2–j 4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim ([email protected]) 3 ▪ Examples: Value Representation 5-3/4 101.112 2-7/8 10.1112 63/64 0.1111112 ▪ Observations • Divide by 2 by shifting right • Multiply by 2 by shifting left • Numbers of form 0.111111..2 just below 1.0 – 1/2 + 1/4 + 1/8 + … + 1/2i + … → 1.0 – Use notation 1.0 – 4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim ([email protected]) 4 ▪ Representable numbers • Can only exactly represent numbers of the form x / 2k • Other numbers have repeating bit representations Value Representation 1/3 0.0101010101[01]…2 1/5 0.001100110011[0011]…2 1/10 0.0001100110011[0011]…2 4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim ([email protected]) 5 Fixed Points ▪ p.q Fixed-point representation • Use the rightmost q bits of an integer as representing a fraction • Example: 17.14 fixed-point representation
    [Show full text]
  • Hacking in C 2020 the C Programming Language Thom Wiggers
    Hacking in C 2020 The C programming language Thom Wiggers 1 Table of Contents Introduction Undefined behaviour Abstracting away from bytes in memory Integer representations 2 Table of Contents Introduction Undefined behaviour Abstracting away from bytes in memory Integer representations 3 – Another predecessor is B. • Not one of the first programming languages: ALGOL for example is older. • Closely tied to the development of the Unix operating system • Unix and Linux are mostly written in C • Compilers are widely available for many, many, many platforms • Still in development: latest release of standard is C18. Popular versions are C99 and C11. • Many compilers implement extensions, leading to versions such as gnu18, gnu11. • Default version in GCC gnu11 The C programming language • Invented by Dennis Ritchie in 1972–1973 4 – Another predecessor is B. • Closely tied to the development of the Unix operating system • Unix and Linux are mostly written in C • Compilers are widely available for many, many, many platforms • Still in development: latest release of standard is C18. Popular versions are C99 and C11. • Many compilers implement extensions, leading to versions such as gnu18, gnu11. • Default version in GCC gnu11 The C programming language • Invented by Dennis Ritchie in 1972–1973 • Not one of the first programming languages: ALGOL for example is older. 4 • Closely tied to the development of the Unix operating system • Unix and Linux are mostly written in C • Compilers are widely available for many, many, many platforms • Still in development: latest release of standard is C18. Popular versions are C99 and C11. • Many compilers implement extensions, leading to versions such as gnu18, gnu11.
    [Show full text]
  • Chapter 2 DIRECT METHODS—PART II
    Chapter 2 DIRECT METHODS—PART II If we use finite precision arithmetic, the results obtained by the use of direct methods are contaminated with roundoff error, which is not always negligible. 2.1 Finite Precision Computation 2.2 Residual vs. Error 2.3 Pivoting 2.4 Scaling 2.5 Iterative Improvement 2.1 Finite Precision Computation 2.1.1 Floating-point numbers decimal floating-point numbers The essence of floating-point numbers is best illustrated by an example, such as that of a 3-digit floating-point calculator which accepts numbers like the following: 123. 50.4 −0.62 −0.02 7.00 Any such number can be expressed as e ±d1.d2d3 × 10 where e ∈{0, 1, 2}. (2.1) Here 40 t := precision = 3, [L : U] := exponent range = [0 : 2]. The exponent range is rather limited. If the calculator display accommodates scientific notation, e g., 3.46 3 −1.56 −3 then we might use [L : U]=[−9 : 9]. Some numbers have multiple representations in form (2.1), e.g., 2.00 × 101 =0.20 × 102. Hence, there is a normalization: • choose smallest possible exponent, • choose + sign for zero, e.g., 0.52 × 102 → 5.20 × 101, 0.08 × 10−8 → 0.80 × 10−9, −0.00 × 100 → 0.00 × 10−9. −9 Nonzero numbers of form ±0.d2d3 × 10 are denormalized. But for large-scale scientific computation base 2 is preferred. binary floating-point numbers This is an important matter because numbers like 0.2 do not have finite representations in base 2: 0.2=(0.001100110011 ···)2.
    [Show full text]
  • Decimal Floating Point for Future Processors Hossam A
    1 Decimal Floating Point for future processors Hossam A. H. Fahmy Electronics and Communications Department, Cairo University, Egypt Email: [email protected] Tarek ElDeeb, Mahmoud Yousef Hassan, Yasmin Farouk, Ramy Raafat Eissa SilMinds, LLC Email: [email protected] F Abstract—Many new designs for Decimal Floating Point (DFP) hard- Simple decimal fractions such as 1=10 which might ware units have been proposed in the last few years. To date, only represent a tax amount or a sales discount yield an the IBM POWER6 and POWER7 processors include internal units for infinitely recurring number if converted to a binary decimal floating point processing. We have designed and tested several representation. Hence, a binary number system with a DFP units including an adder, multiplier, divider, square root, and fused- multiply-add compliant with the IEEE 754-2008 standard. This paper finite number of bits cannot accurately represent such presents the results of using our units as part of a vector co-processor fractions. When an approximated representation is used and the anticipated gains once the units are moved on chip with the in a series of computations, the final result may devi- main processor. ate from the correct result expected by a human and required by the law [3], [4]. One study [5] shows that in a large billing application such an error may be up to 1 WHY DECIMAL HARDWARE? $5 million per year. Ten is the natural number base or radix for humans Banking, billing, and other financial applications use resulting in a decimal number system while a binary decimal extensively.
    [Show full text]
  • Floating Point Formats
    Telemetry Standard RCC Document 106-07, Appendix O, September 2007 APPENDIX O New FLOATING POINT FORMATS Paragraph Title Page 1.0 Introduction..................................................................................................... O-1 2.0 IEEE 32 Bit Single Precision Floating Point.................................................. O-1 3.0 IEEE 64 Bit Double Precision Floating Point ................................................ O-2 4.0 MIL STD 1750A 32 Bit Single Precision Floating Point............................... O-2 5.0 MIL STD 1750A 48 Bit Double Precision Floating Point ............................. O-2 6.0 DEC 32 Bit Single Precision Floating Point................................................... O-3 7.0 DEC 64 Bit Double Precision Floating Point ................................................. O-3 8.0 IBM 32 Bit Single Precision Floating Point ................................................... O-3 9.0 IBM 64 Bit Double Precision Floating Point.................................................. O-4 10.0 TI (Texas Instruments) 32 Bit Single Precision Floating Point...................... O-4 11.0 TI (Texas Instruments) 40 Bit Extended Precision Floating Point................. O-4 LIST OF TABLES Table O-1. Floating Point Formats.................................................................................... O-1 Telemetry Standard RCC Document 106-07, Appendix O, September 2007 This page intentionally left blank. ii Telemetry Standard RCC Document 106-07, Appendix O, September 2007 APPENDIX O FLOATING POINT
    [Show full text]
  • Extended Precision Floating Point Arithmetic
    Extended Precision Floating Point Numbers for Ill-Conditioned Problems Daniel Davis, Advisor: Dr. Scott Sarra Department of Mathematics, Marshall University Floating Point Number Systems Precision Extended Precision Patriot Missile Failure Floating point representation is based on scientific An increasing number of problems exist for which Patriot missile defense modules have been used by • The precision, p, of a floating point number notation, where a nonzero real decimal number, x, IEEE double is insufficient. These include modeling the U.S. Army since the mid-1960s. On February 21, system is the number of bits in the significand. is expressed as x = ±S × 10E, where 1 ≤ S < 10. of dynamical systems such as our solar system or the 1991, a Patriot protecting an Army barracks in Dha- This means that any normalized floating point The values of S and E are known as the significand • climate, and numerical cryptography. Several arbi- ran, Afghanistan failed to intercept a SCUD missile, number with precision p can be written as: and exponent, respectively. When discussing float- trary precision libraries have been developed, but leading to the death of 28 Americans. This failure E ing points, we are interested in the computer repre- x = ±(1.b1b2...bp−2bp−1)2 × 2 are too slow to be practical for many complex ap- was caused by floating point rounding error. The sentation of numbers, so we must consider base 2, or • The smallest x such that x > 1 is then: plications. David Bailey’s QD library may be used system measured time in tenths of seconds, using binary, rather than base 10.
    [Show full text]
  • Design of 32 Bit Floating Point Addition and Subtraction Units Based on IEEE 754 Standard Ajay Rathor, Lalit Bandil Department of Electronics & Communication Eng
    International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 2 Issue 6, June - 2013 Design Of 32 Bit Floating Point Addition And Subtraction Units Based On IEEE 754 Standard Ajay Rathor, Lalit Bandil Department of Electronics & Communication Eng. Acropolis Institute of Technology and Research Indore, India Abstract Precision format .Single precision format, Floating point arithmetic implementation the most basic format of the ANSI/IEEE described various arithmetic operations like 754-1985 binary floating-point arithmetic addition, subtraction, multiplication, standard. Double precision and quad division. In science computation this circuit precision described more bit operation so at is useful. A general purpose arithmetic unit same time we perform 32 and 64 bit of require for all operations. In this paper operation of arithmetic unit. Floating point single precision floating point arithmetic includes single precision, double precision addition and subtraction is described. This and quad precision floating point format paper includes single precision floating representation and provide implementation point format representation and provides technique of various arithmetic calculation. implementation technique of addition and Normalization and alignment are useful for subtraction arithmetic calculations. Low operation and floating point number should precision custom format is useful for reduce be normalized before any calculation [3]. the associated circuit costs and increasing their speed. I consider the implementation ofIJERT IJERT2. Floating Point format Representation 32 bit addition and subtraction unit for floating point arithmetic. Floating point number has mainly three formats which are single precision, double Keywords: Addition, Subtraction, single precision and quad precision. precision, doubles precision, VERILOG Single Precision Format: Single-precision HDL floating-point format is a computer number 1.
    [Show full text]
  • Floating Point Representation (Unsigned) Fixed-Point Representation
    Floating point representation (Unsigned) Fixed-point representation The numbers are stored with a fixed number of bits for the integer part and a fixed number of bits for the fractional part. Suppose we have 8 bits to store a real number, where 5 bits store the integer part and 3 bits store the fractional part: 1 0 1 1 1.0 1 1 $ 2& 2% 2$ 2# 2" 2'# 2'$ 2'% Smallest number: Largest number: (Unsigned) Fixed-point representation Suppose we have 64 bits to store a real number, where 32 bits store the integer part and 32 bits store the fractional part: "# "% + /+ !"# … !%!#!&. (#(%(" … ("% % = * !+ 2 + * (+ 2 +,& +,# "# "& & /# % /"% = !"#× 2 +!"&× 2 + ⋯ + !&× 2 +(#× 2 +(%× 2 + ⋯ + ("%× 2 0 ∞ (Unsigned) Fixed-point representation Range: difference between the largest and smallest numbers possible. More bits for the integer part ⟶ increase range Precision: smallest possible difference between any two numbers More bits for the fractional part ⟶ increase precision "#"$"%. '$'#'( # OR "$"%. '$'#'(') # Wherever we put the binary point, there is a trade-off between the amount of range and precision. It can be hard to decide how much you need of each! Scientific Notation In scientific notation, a number can be expressed in the form ! = ± $ × 10( where $ is a coefficient in the range 1 ≤ $ < 10 and + is the exponent. 1165.7 = 0.0004728 = Floating-point numbers A floating-point number can represent numbers of different order of magnitude (very large and very small) with the same number of fixed bits. In general, in the binary system, a floating number can be expressed as ! = ± $ × 2' $ is the significand, normally a fractional value in the range [1.0,2.0) .
    [Show full text]
  • Floating Point Numbers and Arithmetic
    Overview Floating Point Numbers & • Floating Point Numbers Arithmetic • Motivation: Decimal Scientific Notation – Binary Scientific Notation • Floating Point Representation inside computer (binary) – Greater range, precision • Decimal to Floating Point conversion, and vice versa • Big Idea: Type is not associated with data • MIPS floating point instructions, registers CS 160 Ward 1 CS 160 Ward 2 Review of Numbers Other Numbers •What about other numbers? • Computers are made to deal with –Very large numbers? (seconds/century) numbers 9 3,155,760,00010 (3.1557610 x 10 ) • What can we represent in N bits? –Very small numbers? (atomic diameter) -8 – Unsigned integers: 0.0000000110 (1.010 x 10 ) 0to2N -1 –Rationals (repeating pattern) 2/3 (0.666666666. .) – Signed Integers (Two’s Complement) (N-1) (N-1) –Irrationals -2 to 2 -1 21/2 (1.414213562373. .) –Transcendentals e (2.718...), π (3.141...) •All represented in scientific notation CS 160 Ward 3 CS 160 Ward 4 Scientific Notation Review Scientific Notation for Binary Numbers mantissa exponent Mantissa exponent 23 -1 6.02 x 10 1.0two x 2 decimal point radix (base) “binary point” radix (base) •Computer arithmetic that supports it called •Normalized form: no leadings 0s floating point, because it represents (exactly one digit to left of decimal point) numbers where binary point is not fixed, as it is for integers •Alternatives to representing 1/1,000,000,000 –Declare such variable in C as float –Normalized: 1.0 x 10-9 –Not normalized: 0.1 x 10-8, 10.0 x 10-10 CS 160 Ward 5 CS 160 Ward 6 Floating
    [Show full text]
  • Floating Point Numbers Dr
    Numbers Representaion Floating point numbers Dr. Tassadaq Hussain Riphah International University Real Numbers • Two’s complement representation deal with signed integer values only. • Without modification, these formats are not useful in scientific or business applications that deal with real number values. • Floating-point representation solves this problem. Floating-Point Representation • If we are clever programmers, we can perform floating-point calculations using any integer format. • This is called floating-point emulation, because floating point values aren’t stored as such; we just create programs that make it seem as if floating- point values are being used. • Most of today’s computers are equipped with specialized hardware that performs floating-point arithmetic with no special programming required. —Not embedded processors! Floating-Point Representation • Floating-point numbers allow an arbitrary number of decimal places to the right of the decimal point. For example: 0.5 0.25 = 0.125 • They are often expressed in scientific notation. For example: 0.125 = 1.25 10-1 5,000,000 = 5.0 106 Floating-Point Representation • Computers use a form of scientific notation for floating-point representation • Numbers written in scientific notation have three components: Floating-Point Representation • Computer representation of a floating-point number consists of three fixed-size fields: • This is the standard arrangement of these fields. Note: Although “significand” and “mantissa” do not technically mean the same thing, many people use these terms interchangeably. We use the term “significand” to refer to the fractional part of a floating point number. Floating-Point Representation • The one-bit sign field is the sign of the stored value.
    [Show full text]