<<

15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

EXTENDED PRECISION ACCUMULATION OF FLOATING-POINT DATA FOR DIGITAL SIGNAL GENERATION AND PROCESSING

Adam Dąbrowski and Paweł Pawłowski Chair of Control and System Engineering/Department of Computing and Management Poznań University of Technology, Piotrowo 3a, 60-965 Poznań, Poland phone: + (48 61) 665 2831, fax: + (48 61)665 2840, email: [email protected] web: www.dsp.put.poznan.pl

ABSTRACT longer than the fraction, is 24 long and is typically nor- In this paper we present a novel approach to the realization malized to range <1,2). Hence the MSB is equal to 1 and of floating-point computations, which are typical of various thus it can be a hidden bit, i.e., not physically represented in digital signal processing tasks. The proposed idea, referred the register (see Fig. 1). to as "the two-accumulator concept", overcomes the phe- nomenon of accuracy reduction of floating-point computa- tions in case of a long series of additions of small numbers Figure 1 – IEEE-754 floating-point single precision format with relatively large intermediate results. This phenomenon Apart from special cases, the value of the represented num- is a substantial drawback of the classic floating-point arith- ber can is equal to metic, particularly in signal processing applications. ν = (−1)S 2e−127 (1. f ), if 0 < e < 255 . (1) 1. INTRODUCTION Digital signal processing is a quickly evolving scientific and This is the so-called normalized number [5] because of nor- technological area of the modern electronic world. However, malization of mantissa to the range <1,2). This ensures the the fast growth and new facilities bring many problems to unique number representation but it lowers the relative pre- designers and programmers. One of them is a natural need cision of numbers near to zero and makes it even impossible of assurance of the maximal accuracy of computations at the to represent zero. Because of condition 0 < e < 255, the ef- lowest computation cost [7] for both fixed-point as well as fective (biased) exponents range from –126 to 127 and the floating-point number representation formats standardized smallest absolute value of the normalized single precision by the IEEE [4, 9, 10]. In many applications floating-point number is equal to data processing is particularly important, because it guaran- L = 21−127 ⋅1 = 2−126 . (2) tees, among other features, an extremely large dynamic min_ 32bit range. Therefore, floating-point computations are widely It should be noticed that this value does not depend on the used in software and hardware, if high dynamic range and/or fraction length. If we allow the so-called denormalized precision of data processing is required [8, 12]. Unfortu- numbers, i.e. those with the smallest possible, fixed expo- nately, with this potentially precise approach large and un- nent, (thus constituting in fact a subset of the fixed-point expected errors can occur. One the cases of a significant loss numbers), defined as of accuracy of the final result is a long series additions of small numbers successively summed to relatively large in- ν = (−1)S 2e−126 (0. f ), for e = 0 , (3) termediate results. we make a clear representation of zero possible (for f = 0) 2. NUMBERS IN THE FLOATING-POINT FORMAT and also extend precision of the number representation near to zero ( f ≠ 0 ). To emphasize that these are fixed-point The well-known and widely used IEEE-754 floating-point standard was established in 1985 and two years later the numbers, we can write formula (3) in the simplified form generalized IEEE-854 standard appeared [5, 6]. Although 1 s 2−126 0. f they are widely accepted by companies producing proces- ν = (− )() . (4) sors and software, non-standard formats are also used (e.g., Taking the denormalized numbers into account, the lowest in Texas Instruments TMS320C 3x, 4x processor families absolute value of the single precision numbers is equal to [14]). The IEEE-754 standard defines the so-called single −126 −23 −149 precision (32-bit) (Fig. 1) as well as the double precision Lmin_ 32bit _ nn = 2 ⋅ 2 = 2 . (5) (64-bit) format. Single precision numbers consist of a sign bit s, 8-bit biased exponent e and the 23-bit fraction f. Man- If we extend the fraction length to 31 , i.e. consider a 40- tissa, which is by one, i.e. by the most significant bit (MSB), bit extension of the 32-bit single precision format, we obtain

©2007 EURASIP 1630 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

−126 −31 −157 of additions (accumulations) of a small constant x. It can be Lmin_ 40bit = 2 ⋅ 2 = 2 . (6) noticed that at each power of 2 (i.e., 128, 256, 512) the slope Typical contemporary floating-point DSPs offer single pre- of the error curve rises. The reason for this phenomenon cision computations with the normalized numbers only, results from the error sources described above. since if we wanted to allow also the denormalized numbers, the arithmetic unit would be much more complicated and power consuming. On the contrary, common PC software (e.g., Matlab) realizes double precision arithmetic and uses also the denormalized numbers.

3. ADDITION OF FLOATING-POINT NUMBERS 3.1 Algorithm for floating-point addition Floating-point addition is a much more complicated opera- tion than the fixed-point addition. It can be split into four steps: • Comparison of exponents of the numbers to be added. Figure 2 – Error of the addition of x plus a constant • Possible equalization of these exponents. A lower ex- ponent is increased to the larger exponent together with the respective right shift of the fraction of the lower number. It should be noticed, that after this shift, the hidden bit (the MSB of the fraction) of the lower number appears as a physical bit. • Addition of the fractions. • Possible normalization of the result. If the fraction of the result remains in the normalization range, it is unchanged. If the fraction is greater than or equal to 2, it is right shifted by one bit and the exponent is in- creased by one. If the fraction is lower than 1, it is re- spectively left shifted and the exponent is corre- Figure 3 – Error of a long series of additions spondingly decreased until the denormalized number 3.3 Two Accumulator Method range is reached. To reduce errors produced during a long series of additions 3.2 Sources of Errors (accumulations) [11], the authors proposed to use two accu- Sources of errors in the floating-point addition described mulators. above are in steps 2 and 4. A substantial error can be made at The proposed, modified algorithm for the floating-point the correction step of the fraction of the lower number after addition computes a difference between exponents of the the equalization of the exponents. In the extreme case, if we numbers to be added. If this difference is lower than the assume m bits of the fraction and the difference between given threshold Tr, the input number is accumulated to the exponents is greater than m+1, the lower number will be first accumulator. Otherwise two following steps are made: zeroed. The result will stay equal to the greater number and • the value of the first accumulator is added to the the error of such addition will be equal to the lower number. number stored in the second accumulator and Thus we loose as many bits of the fraction as they are • the input number is moved to the first accumulator. needed to represent the difference between the exponents of The code for this algorithm is presented in Fig. 4. the added numbers. The main advantage of the two accumulator method is Mainly because of the above effect, Analog Devices (in in its realization. This method needs no non-standard com- the Sharc DSP family) but also other companies extended ponents. It can be easily implemented even in typical float- the single precision IEEE-754 standard and offer fractions ing-point signal processors. longer by 8 bits, i.e., 32-bit long mantissas (including the Variations of contents of both accumulators during the -5 hidden bit) [2, 3]. multiple accumulation of the number Wm = 1.25·10 are The second error source is the fraction normalization of presented in Figs. 5 and 6. Accumulator 1 works periodi- the result but this can bring at most a loss of merely one bit. cally with small numbers, while accumulator 2 stores large Rarely overflow and underflow errors can occur, if the result portions of the results. If intermediate results are needed (in exceeds the dynamic range, but these situations should not most cases they are not) the contents of both accumulators appear in the right functioning algorithms. should be added. Figure 2 shows an error of a single addition of a small 1. Input: a1, a2, in, Tr constant to the variable x, which changes from 0 to 1000. 2. Output: a1, a2 Figure 3 depicts the total error of a long series (2 500 000) 3. mask = 0x7F800000

©2007 EURASIP 1631 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

4. = (a1 & mask) - (in & mask) 4. DIGITAL GENERATOR AS EXAMPLE 5. if(d < Tr) OF APPLICATION 6. { 7. a1 = a1 + in; To verify the proposed method the authors designed a digital 8. }else 9. { sine waveform generator with slow frequency modulation 10. a2 = a2 + a1; (FM). The designed system is based on the ADSP-21061 11. a1 = in; Sharc DSP by Analog Devices [1, 2]. Simulations were 12. } made using the specially prepared software realizing the Figure 4 – C code for the two accumulator method DSP arithmetical unit in the Matlab environment. Generally, it is possible to extend the proposed two ac- 4.1 Digital Generator of Sine Waveforms with FM cumulator method to larger number of accumulators, e.g. to Real-time digital generation of a sine waveform frequency three, four, five, and so on, but it would be related to much modulated by a triangle signal can be realized as follows more complicated algorithms. In software realizations it xn = A ⋅ sin(ϕn ) (7) would slow down the system performance and in hardware where: realizations it would unacceptable increase the hardware ϕ = ϕ +W ⋅ M +1 , M = M +W , complexity. Additionally, to calculate intermediate results or n n−1 g ( n ) n n−1 m the final result more than one addition would be required. 4 ⋅ g ⋅ f W = 2π ⋅T ⋅ f , W = ± m Furthermore, on the market there are many processors with g s g m f dual processing units or even dual cores, e.g. ADSP- s

21160M SHARC DSP by Analog Devices [3]. Thus imple- and A – amplitude, fg – frequency of the sine signal (without mentation of the proposed two accumulator method seems modulation), fm – frequency of the FM modulation signal, to be the most adequate. fs – sampling frequency, g – modulation factor [13]. It can be noticed that accumulations occurs in two stages: during computation of the phase of the generated sine waveform and during computation of factor Mn (i.e., during accumulation of Wm). If we assume: fs = 48 kS/s, fm = 0.05 Hz, g = 0.3 we get -5 Wm = 1.25·10 . The maximal value of Wm is equal to g and the difference of exponents of the accumulator and Wm is equal to 15. Thus the classic addition brings a loss of 15 or 14 bits of the fraction of the lower number. These addition errors lead to a not acceptable mismatch of parameters of the generated signal. A simple, partial solution of this problem is extension of the fraction size. Indeed, usage of the 40-bit extended single precision format brings significant im- provement of the final signal quality (see Tab. 1) [1, 2, 13].

4.2 Simulation and Measurement Results Figure 5 – Variations of contents of accumulator 1 during multiple Effectiveness of the proposed two accumulator method has accumulation of number W = 1,25·10-5 m been checked by means of the software experiments with the arithmetical unit realized in the Matlab environment. The designed experimental arithmetical unit utilizes two 32-bit or 40-bit accumulators. As an illustrative example, Figures 7 and 8 present the maximal error of accumulation of numbers -4 -5 Wm = 1.25·10 and Wm = 1.25·10 , respectively, in the modulation process at different thresholds Tr. A significant minimization of the total error can be noticed at thresholds Tr equal to 5 and 9, respectively. The accumulation error obtained with the optimal threshold Tr is plotted in Fig. 9. The curve is close to linear, but it contains tiny oscillations visible in the zoomed plot (Fig. 10). At the threshold lower than optimal (Fig. 11) the error has a piecewise linear waveform and looks like that of the classic adder with only one accumulator (cf., Fig. 3). A Figure 6 – Variations of accumulator 2 contents during the multiple threshold larger than the optimal threshold (Fig. 12) brings -5 accumulation of number Wm = 1,25·10 more errors in accumulator 1, visible as the local oscillatory growth of the slope of the error curve. The four mentioned methods are compared in Table 1 and thse data are depicted in Fig. 13.

©2007 EURASIP 1632 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

Table 1 – Accumulation of Factor Wm Accumulation Accumulator Maximal method length computation error Single accumulator 32 bit 2.2·10-4 Single accumulator 40 bit 4.3·10-7 Two accumulators 32 bit 7.0·10-7 (Tr = 9) Two accumulators 40 bit 8.6·10-9 (Tr = 9)

It is worth to stress that the two accumulator method always reaches results equal or better than the one accumu- -5 lator method with the same length of the accumulator and at Figure 9 – Maximal error of accumulation of number Wm = 1.25·10 any threshold Tr. Additionally, the two accumulator method at Tr = 9 based on 32-bit accumulators guarantees similar accuracy as the classic method with one 40-bit accumulator. It could be helpful if we use 32-bit commercial processor from the mar- ket and there is no possibility to extended the data wordlength. Futhermore, at the same wordlength the two accumulator method, in comparison with the classic adder, reaches better exactness by 2 to 3 decimal digits.

Figure 10 – Zoomed fragment of of error during accumulation of the -5 number Wm = 1.25·10 at Tr = 9 depicted in Fig. 9

-4 Figure 7 – Maximal error of accumulation of number Wm = 1.25·10 at different Tr

-5 Figure 11 – Error of accumulation of number Wm = 1.25·10 Tr = 3

-5 Figure 8 – Maximal error of accumulation of number Wm = 1.25·10 -5 at different Tr Figure 12 – Error of accumulation of number Wm = 1.25·10 at Tr = 12

©2007 EURASIP 1633 15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

The two accumulator method was also tested with ran- 5. SUMMARY dom input numbers. The of 100 000 random numbers The experiments the authors made clearly show that the was prepared in such a way, that the exact sum was known proposed method of computation of a series of floating- and equal to 4.99995⋅104+4.99995⋅109. The lowest number point additions with two accumulators substantially extends was equal to 0.90440, the largest number was equal to precision of the data accumulation. Presented considerations 99 999.95370 and the probability distribution was uniform. point to straightforward possibilities of a simple DSP reali- Every number was defined with exactness of five decimal zation of the proposed modified accumulation algorithm. It digits. An additional error, which results from the less exact does not bring important complications to the digital signal number representation with the 32-bit numbers does not generation software running at the commercially available dominate and can be omitted. Indeed, the exact sum of DSP platforms but brings significantly greater accuracy of 100 000 32-bit numbers is lower by 0.446 than the exact the signal parameters. The authors plan to continue this re- result. Accumulation of random numbers with the two ac- search and to start the design project in order to produce a cumulator method is compared with the classical floating- hardware arithmetical unit on a chip, which will use the pre- point accumulation in Fig. 14. Two additional experiments sented two accumulator method on the fly, i.e., without any were performed on the sorted numbers, in ascending and intervention of the user. descending orders, respectively. Both compared methods gain better results for the input numbers sorted in the as- REFERENCES cending order and somehow worse for the numbers sorted in the descending order but the two accumulator method is less [1] Analog Devices, “ADSP-2106x SHARC Processor, sensitive to the order change of the input numbers. Note, User’s Manual,” Analog Devices, Inc., Rev 2.1, 2004. that the sorting of the input numbers is one of the most time [2] Analog Devices, “ADSP-2106x SHARC DSP Micro- consuming steps in the maximum accuracy algorithms [11]. computer Family,” Rev B, 2000. For random numbers the same conclusion is valid as [3] Analog Devices, “ADSP-21160M SHARC DSP for accumulation of the constant numbers − namely that the Microcomputer,” Analog Devices, Inc., Rev. 0, 2001. two accumulator method always reaches results, which are [4] C. E. Fang, R. A. Rutenbal, T. Chen, “Fast, accurate not worse than those ot the one accumulator method at any static analysis for fixed-point finite-precision effects in threshold Tr. However, if the threshold is optimal, the two DSP designs,” ICCAD’O3, November 11-13 2003, accumulator method brings about 1000 times more exact Sanlose, California, USA, pp. 275–282. results. [5] IEEE Standard, “IEEE Standard for Binary Floating- Point Arithmetic,” IEEE Std 754-1985. [6] IEEE Standard, “IEEE Standard for Radix-Independent Floating-Point Arithmetic,” IEEE Std 854-1987. [7] R. K. Kolagotla et al., “High Performance Dual-MAC DSP Architecture,” IEEE Signal Processing Magazine, July 2002, pp. 42–53. [8] W. Kramer, “A priori worst case error bounds for float- ing-point computations,” IEEE Trans. of Comp., Vo l. 47, July 1998, pp. 750–756. [9] D.M. Lewis, “114 MFLOPS logarithmic number sys- tem arithmetic unit for DSP applications,” IEEE J. Solid-State Circ., Vol. 30, No. 12, 1995, pp.1547–1553. [10] M.N. Mahesh, M. Mehendale, “Improving perfor- mance of high precision signal processing algorithms Figure 13 – Errors of accumulation of different numbers on programmable DSPs,” Proc. Int. Symposium on Circuits and Systems, 1999, Vol. 3, pp. 488–491. [11] M. Olejniczak, “Digital filters realizations with the use of floating-point arithmetic” (in Polish), Ph. D. Thesis, Poznan University of Technology, 1993, unpublished. [12] V. Paliouras, K. Karagianni, T. Stouraitis, “A Floating- Point Processor for Fast and Accurate Sine/Cosine Evaluation,” IEEE Trans. Circuits and Systems—II, Vol. 47, No. 5, May 2000, pp. 441–451. [13] M. Portalski, P. Pawłowski, A. Dąbrowski, “Synthesis of some class of nonharmonic tones,” XXVIII IC- SPETO, 2005, Vol. 2, pp. 439–442. [14] Texas Instruments, “TMS320C4x User’s Guide,” Texas Instruments Inc. 1994. Figure 14 – Errors of accumulation of 100 000 random num- bers Work supported by 93-1580, BW 93-41/07, DS-93-152/07.

©2007 EURASIP 1634