Extended Precision Accumulation of Floating-Point Data for Digital Signal Generation and Processing

15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP EXTENDED PRECISION ACCUMULATION OF FLOATING-POINT DATA FOR DIGITAL SIGNAL GENERATION AND PROCESSING Adam Dąbrowski and Paweł Pawłowski Chair of Control and System Engineering/Department of Computing and Management Poznań University of Technology, Piotrowo 3a, 60-965 Poznań, Poland phone: + (48 61) 665 2831, fax: + (48 61)665 2840, email: [email protected] web: www.dsp.put.poznan.pl ABSTRACT longer than the fraction, is 24 bit long and is typically nor- In this paper we present a novel approach to the realization malized to range <1,2). Hence the MSB is equal to 1 and of floating-point computations, which are typical of various thus it can be a hidden bit, i.e., not physically represented in digital signal processing tasks. The proposed idea, referred the register (see Fig. 1). to as "the two-accumulator concept", overcomes the phenomenon of accuracy reduction of floating-point computations in case of a long series of additions of small numbers Figure 1 – IEEE-754 floating-point single precision format with relatively large intermediate results. This phenomenon Apart from special cases, the value of the represented num- is a substantial drawback of the classic floating-point arith- ber can is equal to metic, particularly in signal processing applications. ν = (−1)S 2e−127 (1. f ), if 0 < e < 255 . (1) 1. INTRODUCTION Digital signal processing is a quickly evolving scientific and This is the so-called normalized number [5] because of nor- technological area of the modern electronic world. However, malization of mantissa to the range <1,2). This ensures the the fast growth and new facilities bring many problems to unique number representation but it lowers the relative pre- designers and programmers. One of them is a natural need cision of numbers near to zero and makes it even impossible of assurance of the maximal accuracy of computations at the to represent zero. Because of condition 0 < e < 255, the ef- lowest computation cost [7] for both fixed-point as well as fective (biased) exponents range from –126 to 127 and the floating-point number representation formats standardized smallest absolute value of the normalized single precision by the IEEE [4, 9, 10]. In many applications floating-point number is equal to data processing is particularly important, because it guaran- L = 21−127 ⋅1 = 2−126 . (2) tees, among other features, an extremely large dynamic min_ 32bit range. Therefore, floating-point computations are widely It should be noticed that this value does not depend on the used in software and hardware, if high dynamic range and/or fraction length. If we allow the so-called denormalized precision of data processing is required [8, 12]. Unfortu- numbers, i.e. those with the smallest possible, fixed expo- nately, with this potentially precise approach large and un- nent, (thus constituting in fact a subset of the fixed-point expected errors can occur. One the cases of a significant loss numbers), defined as of accuracy of the final result is a long series additions of small numbers successively summed to relatively large in- ν = (−1)S 2e−126 (0. f ), for e = 0 , (3) termediate results. we make a clear representation of zero possible (for f = 0) 2. NUMBERS IN THE FLOATING-POINT FORMAT and also extend precision of the number representation near to zero ( f ≠ 0 ). To emphasize that these are fixed-point The well-known and widely used IEEE-754 floating-point standard was established in 1985 and two years later the numbers, we can write formula (3) in the simplified form generalized IEEE-854 standard appeared [5, 6]. Although 1 s 2−126 0. f they are widely accepted by companies producing proces- ν = (− )() . (4) sors and software, non-standard formats are also used (e.g., Taking the denormalized numbers into account, the lowest in Texas Instruments TMS320C 3x, 4x processor families absolute value of the single precision numbers is equal to [14]). The IEEE-754 standard defines the so-called single −126 −23 −149 precision (32-bit) (Fig. 1) as well as the double precision Lmin_ 32bit _ nn = 2 ⋅ 2 = 2 . (5) (64-bit) format. Single precision numbers consist of a sign bit s, 8-bit biased exponent e and the 23-bit fraction f. Man- If we extend the fraction length to 31 bits, i.e. consider a 40- tissa, which is by one, i.e. by the most significant bit (MSB), bit extension of the 32-bit single precision format, we obtain ©2007 EURASIP 1630 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP −126 −31 −157 of additions (accumulations) of a small constant x. It can be Lmin_ 40bit = 2 ⋅ 2 = 2 . (6) noticed that at each power of 2 (i.e., 128, 256, 512) the slope Typical contemporary floating-point DSPs offer single pre- of the error curve rises. The reason for this phenomenon cision computations with the normalized numbers only, results from the error sources described above. since if we wanted to allow also the denormalized numbers, the arithmetic unit would be much more complicated and power consuming. On the contrary, common PC software (e.g., Matlab) realizes double precision arithmetic and uses also the denormalized numbers. 3. ADDITION OF FLOATING-POINT NUMBERS 3.1 Algorithm for floating-point addition Floating-point addition is a much more complicated opera- tion than the fixed-point addition. It can be split into four steps: • Comparison of exponents of the numbers to be added. Figure 2 – Error of the addition of x plus a constant • Possible equalization of these exponents. A lower exponent is increased to the larger exponent together with the respective right shift of the fraction of the lower number. It should be noticed, that after this shift, the hidden bit (the MSB of the fraction) of the lower number appears as a physical bit. • Addition of the fractions. • Possible normalization of the result. If the fraction of the result remains in the normalization range, it is unchanged. If the fraction is greater than or equal to 2, it is right shifted by one bit and the exponent is increased by one. If the fraction is lower than 1, it is re- spectively left shifted and the exponent is corre- Figure 3 – Error of a long series of additions spondingly decreased until the denormalized number 3.3 Two Accumulator Method range is reached. To reduce errors produced during a long series of additions 3.2 Sources of Errors (accumulations) [11], the authors proposed to use two accu- Sources of errors in the floating-point addition described mulators. above are in steps 2 and 4. A substantial error can be made at The proposed, modified algorithm for the floating-point the correction step of the fraction of the lower number after addition computes a difference between exponents of the the equalization of the exponents. In the extreme case, if we numbers to be added. If this difference is lower than the assume m bits of the fraction and the difference between given threshold Tr, the input number is accumulated to the exponents is greater than m+1, the lower number will be first accumulator. Otherwise two following steps are made: zeroed. The result will stay equal to the greater number and • the value of the first accumulator is added to the the error of such addition will be equal to the lower number. number stored in the second accumulator and Thus we loose as many bits of the fraction as they are • the input number is moved to the first accumulator. needed to represent the difference between the exponents of The C code for this algorithm is presented in Fig. 4. the added numbers. The main advantage of the two accumulator method is Mainly because of the above effect, Analog Devices (in in its realization. This method needs no non-standard com- the Sharc DSP family) but also other companies extended ponents. It can be easily implemented even in typical float- the single precision IEEE-754 standard and offer fractions ing-point signal processors. longer by 8 bits, i.e., 32-bit long mantissas (including the Variations of contents of both accumulators during the -5 hidden bit) [2, 3]. multiple accumulation of the number Wm = 1.25·10 are The second error source is the fraction normalization of presented in Figs. 5 and 6. Accumulator 1 works periodi- the result but this can bring at most a loss of merely one bit. cally with small numbers, while accumulator 2 stores large Rarely overflow and underflow errors can occur, if the result portions of the results. If intermediate results are needed (in exceeds the dynamic range, but these situations should not most cases they are not) the contents of both accumulators appear in the right functioning algorithms. should be added. Figure 2 shows an error of a single addition of a small 1. Input: a1, a2, in, Tr constant to the variable x, which changes from 0 to 1000. 2. Output: a1, a2 Figure 3 depicts the total error of a long series (2 500 000) 3. mask = 0x7F800000 ©2007 EURASIP 1631 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP 4. d = (a1 & mask) - (in & mask) 4. DIGITAL GENERATOR AS EXAMPLE 5. if(d < Tr) OF APPLICATION 6. { 7. a1 = a1 + in; To verify the proposed method the authors designed a digital 8. }else 9. { sine waveform generator with slow frequency modulation 10.

Load more