Oleh Horyachyy, Leonid Moroz, Viktor Otenko / International Journal of Computing, 18(4) 2019, 461-470

Print ISSN 1727-6209 [email protected] On-line ISSN 2312-5381 www.computingonline.net International Journal of Computing

SIMPLE EFFECTIVE FAST INVERSE ALGORITHM WITH TWO MAGIC CONSTANTS

Oleh Horyachyy, Leonid Moroz, Viktor Otenko

Lviv Polytechnic National University, 12 Bandera str., 79013 Lviv, Ukraine, [email protected], [email protected], [email protected], http://lpnu.ua

Paper history: Abstract: The purpose of this paper is to introduce a modification of Fast Received 09 January 2019 Inverse Square Root (FISR) approximation algorithm with reduced relative Received in revised form 04 June 2019 Accepted 25 October 2019 errors. The original algorithm uses a magic constant trick with input floating- Available online 31 December 2019 point number to obtain a clever initial approximation and then utilizes the classical iterative Newton-Raphson formula. It was first used in the computer game III Arena, causing widespread discussion among scientists and Keywords: inverse square root; programmers, and now it can be frequently found in many scientific applications, FISR algorithm; although it has some drawbacks. The proposed algorithm has such parameters of initial approximation; the modified inverse square root algorithm that minimize the relative error and magic constant; includes two magic constants in order to avoid one floating-point multiplication. IEEE 754 standard; floating-point arithmetic; In addition, we use the fused multiply-add function and iterative methods of FMA function; higher order in the second iteration to improve the accuracy. Such algorithms do maximum relative error; not require storage of large tables for initial approximation and can be effectively Newton-Raphson; used on field-programmable gate arrays (FPGAs) and other platforms without Householder. hardware support for this function. Copyright © Research Institute for Intelligent Computer Systems, 2019. All rights reserved.

1. INTRODUCTION rule, the initial approximation is formed using a lookup table (LUT – lookup table) [1-2]. However, a When solving many important problems in the group of iterative algorithms is also known that do field of digital signal processing, computer 3D not use LUT [6, 8-15]. Here, the initial graphics, scientific computing, etc., it is necessary to approximation is formed using integer arithmetic in use floating-point arithmetic [1-7] as the most the form of so-called magic constant, and upon accurate and common way of representing real transition from integer arithmetic to floating-point numbers in computers. In particular, a common arithmetic, the resulting approximation is fed into an practical task that arises when working with iterative process using corresponding Newton- floating-point numbers is the calculation of Raphson formulae. elementary functions, including the inverse square In this article, we consider algorithms for root: calculating the inverse square root using magic constants, which offers reduced relative errors and 1 y  . (1) applies to floating-point numbers represented in the x IEEE 754 standard format. In addition, an important advantage of the proposed algorithms is a reduction The algorithms implementing the inverse square (by one) in the number of multiplication operations. root calculation are widely described in scientific The best-known version of this algorithm called literature [1, 6, 8-14]. Most of them are iterative and Fast Inverse Square Root (FISR) [8, 15-18], which require the formation of an initial approximation. was used in the computer game Quake III Arena [7], Moreover, the more precise the initial is given below: approximation, the fewer iterations are needed to calculate a function with the required accuracy. As a 1. float InvSqrt1 (float x){

461 Oleh Horyachyy, Leonid Moroz, Viktor Otenko / International Journal of Computing, 18(4) 2019, 461-470

2. float half = 0.5f * x; which is slightly worse than declared by the authors, 3. int i = *(int*)&x; even when using the fused multiply-add (FMA) 4. i = 0x5F3759DF - (i>>1); function to increase the accuracy of calculations. 5. x = *(float*)&i; Lemaitre et al. [6] suggest using for type float 6. x = x*(1.5f - half*x*x); Householder’s method of order 4 with fused 7. x = x*(1.5f - half*x*x); operations. For initial approximation they consider 8. return x; magic constant R = 0x5f375a86 [8]. Thus, their 9. } proposed algorithm has the form:

This InvSqrt1 code, written in the C/C++ 1. float InvSqrt3 (float x){ language, implements a fast algorithm for inverse 2. int i = *(int*)&x; square root calculation. In line 3, we represent the 3. i = 0x5F375A86 - (i>>1); bits of the input variable x (float) as the variable i 4. float y = *(float*)&i; (int). Lines 4 and 5 determine the initial 5. float a = x*y*y; 6. float t = fmaf(0.2734375f, a, -1.40625f); approximation y0 of the inverse square root. Note that the InvSqrt1 algorithm reuses the variable x for 7. t = fmaf(a, t, 2.953125f); 8. t = fmaf(a, t, -3.28125f); this purpose. It then becomes a part of the iterative 9. y = y*fmaf(a, t, 2.4609375f); process. Here in line 4, R = 0x5f3759df is a “magic 10. return y; constant”. In line 5, we represent the bits of the 11. } variable i (int) back into the variable x (float).

Note that henceforth in the code, the variable x acts It has maximum relative error as a new approximation to y . Lines 6 and 7 contain two classic successive Newton-Raphson iterations to = 6.58∙10–7, (4) refine the results. If the maximum relative error of 1max calculations after the second iteration is denoted by or 20.54 (out of 24 possible) bits of accuracy. 2max , then the accuracy of this algorithm is only Consequently, there is still a need to develop

algorithms of higher accuracy, also for IEEE 754 –6 2max = 4.73∙10 , or  log2 ( 2 max ) =17.69 (2) numbers of higher precision, such as double and quadruple, without forgetting about the speed of correct bits. These results are obtained by practical computations. calculations on the Intel Core i7-7700HQ platform The remainder of this paper is organized as using the method that will be discussed in more follows. Section 2 further gives a theoretical detail in Section 3. description of the aforementioned algorithms with Compared to this algorithm, the following magic constant. In Sections 3 and 4, we elucidate the algorithm with a magic constant from [12] has better proposed algorithms and give their implementations accuracy, namely 20.37 correct bits: in C++. Section 5 contains simulation results. Conclusions are given in Section 6. 1. float InvSqrt2 (float x){ 2. float half = 0.5f * x; 2. ANALYTICAL DESCRIPTION OF 3. int i = *(int*)&x; KNOWN ALGORITHMS 4. i = 0x5F376908 - (i>>1); 5. x = *(float*)&i; In this section, we briefly present the main results of [11-12] to explain how InvSqrt1, InvSqrt2, and 6. x = x*(1.5008789f - half*x*x); InvSqrt3 work. Assume that we have a positive 7. x = x*(1.5000006f - half*x*x); floating-point number 8. return x;

9. } Ex x  (1 mx )  2 , (5) It has different magic constant and two modified Newton-Raphson iterations. The maximum relative where error of the algorithm is approximately

mx [0,1) (6) –7 2max = 7.37∙10 , (3) Ex  log2 (x)  floor(log2 (x)) . (7)

462 Oleh Horyachyy, Leonid Moroz, Viktor Otenko / International Journal of Computing, 18(4) 2019, 461-470

In the IEEE 754 standard, the floating-point 35 45 189 t  ,t   ,t  , number x is encoded by 32, 64, or 128 bits for three 1 128 2 32 3 64 basic binary floating-point formats. In this paper, we (16) 105 315 will mostly consider single precision (float type) t4   ,t5  . numbers (as in the above algorithms), which have 32 32 128 bits, but all considerations can also be generalized for other data types, as it will be shown later in As proven in [11-12], in order to know the Section 4. The first bit corresponds to a sign (sign behaviour of the relative error for y0 in the whole field). In our case, this bit is equal to zero. The next range of normalized floating-point numbers, it eight bits correspond to a biased exponent Ex suffices to consider the range x[1,4) . In this range,

(exponent field), and the last 23 bits encode a a piecewise linear analytical approximation y0 of fractional part of the mantissa mx (mantissa field). the function consists of three separate parts: The integer encoded by these 32 bits, I , is given by x 1 3 1 y01   x   t , x [1,2) (17) I  (bias  E  m )N , (8) 4 4 8 x x x m 1 1 1 y02   x   t , x [2,t) (18) where N  223 and bias  127 for float. Line 4 of 8 2 8 m 1 1 1 the code InvSqrt1 or InvSqrt2 and line 3 of InvSqrt3 y03   x   t , x [t,4) , (19) can be written as: 16 2 16

where I  R  I / 2 . (9) y0  x 

t  2  4mR  2/ Nm (20) The result of subtracting the integer I x /2 from and the fractional part of the mantissa of the magic the magic constant R is the integer number I y , 0 constant R is determined by the formula which is being represented as a float (lines 5 and 4 respectively) and gives the initial (zeroth) piecewise mR  R / Nm  R / Nm  . (21) linear approximation y0 of the function 1/ x . Further on, this approximation is refined (lines 6-7 In this case, the maximum relative error of such and 5-9). The InvSqrt1 and InvSqrt2 algorithms use piecewise linear approximations does not exceed the two classical (in the case of the InvSqrt1 algorithm) value 1/(2Nm ) . Moreover, if we denote the magic constant R by y1 1/ 2* y0 (3  x * y0 * y0 ) (10)

R  Q  Nm  mR  Nm , (22) y2 1/ 2* y1 (3 x* y1 * y1 ) (11) or modified (for the InvSqrt2 algorithm) where

Q  R / N , (23) y  1/ 2* y (k  x * y * y ) , (12)  m  1 0 2 0 0 y 1/ 2* y (k  x* y * y ) . (13) 2 1 4 1 1 then under condition mR  1/ 2 , the equality Q  190 will be satisfied for Eq. (17)-(19). This Newton-Raphson iterations, which provide condition is true for the magic constants of all quadratic convergence of the iterative process. In the considered algorithms. However, in general, m can last formulas, k and k are constants (see R 2 4 take any values from the range m [0,1) . We will InvSqrt2). Householder’s formula in the InvSqrt3 R algorithm investigate this case in the next section.

3. DESCRIPTION OF THE PROPOSED a  x* y * y 0 0 (14) ALGORITHM y  y (a(a(a(a *t  t )  t )  t )  t ) , (15) 1 0 1 2 3 4 5 Consider the case when Q  190 and has a rate of convergence equal to 5. Here in (15), 1/ 2  mR 1. Then, according to the theory, the the values of the constants are as follows: Eq. (17)-(19) will have the form (24)-(26).

463 Oleh Horyachyy, Leonid Moroz, Viktor Otenko / International Journal of Computing, 18(4) 2019, 461-470

1 1 to determine the relative error of the approximation y01   x 1 t , x [1,t) (24) 2 2 y1 for the inverse square root function. Note also 1 1 that to calculate the maximum relative errors after y   x 1 t , x [t,2) (25) 02 4 4 the first iteration, we use the formula

1 3 1 , . y03   x   t x [2,4) (26) 1max  max y1  sqrt(x) 1 . (32) 8 4 4 x[1,4)

Here Denote the corresponding upper and lower   bounds of the relative errors by 1max and 1max . t  2mR 1/ Nm . (27) Errors (30) have local maxima, in particular, at points In order to minimize the relative error, we must set out the requirement to align all the maxima of the x11max  (2  t) / 3, (33) relative error for the first iteration in x13max  (6  2t) / 3. (34) the y01 and y03 segments of the initial approximation. For this, the first iteration will be performed according to the formula For the second component y02 , one should also take into account the point t , where the relative

y1  k1 y0 (k2  xy0 y0 ) . (28) error 12 (as well as 11 ) has the negative maximum

12t (see Fig. 1). Now substituting (33), (34), and Note that this iteration requires four (27) in (30), we obtain expressions for  , multiplication operations. Then, taking into account 11max formulas (24)-(26), we write (28) in the form of 13max , and 12t . Based on them, we construct a three corresponding iterative equations system of two equations:

y1i  k1 y0i (k2  xy0i y0i ) , i 1,3 . (29)      0 11max 12t ,  (35) 13max  11max  0 Let us try to reduce the number of multiplications in the first iteration to three. Such algorithms with a reduced number of floating-point multiplications which will ensure the fulfillment of the above will be especially useful when implementing on condition and the requirement to minimize the FPGA [13-14]. To do this, we set the condition relative error. The solution of this system will be the following expressions for t and k2 : k1  1/ 4. Then the multiplication by this value in software or hardware implementation can be replaced by the integer subtraction. The basic idea is t  1.495535739131174024621926122, (36) to subtract two from the exponent of a number k2  4.764267011868366097360194544. (37) represented in the IEEE 754 format. Next, we need to find the best values of the Given (27), we can determine that for numbers of coefficient k2 and parameter t , which will type float determine the magic constant R . To do this, we write the relative errors of the first iteration based on mR  0.747767809960942236920338061. (38) the initial approximations y01 , y02 , and y03 : Experimental results have shown that in practice for single-precision floating-point numbers (float) 1i  0.25y0i (k2  xy0i y0i ) sqrt(x) 1, (30) the best values of these parameters are i 1,3. k2 4.764266968f and the corresponding magic constant R = 0x5F5FB6D3. Similarly, for double- Here we use the formula precision numbers: k  4.7642670066528519 and 2 R = 0x5FEBF6DB526DE7D9.   y  sqrt(x) 1 (31) 1 1 In the next section, we will look at the implementation of the proposed algorithms in C++

464 Oleh Horyachyy, Leonid Moroz, Viktor Otenko / International Journal of Computing, 18(4) 2019, 461-470 for type float and consider the peculiarities for we did for InvSqrt41. See the InvSqrt42 algorithm. numbers of higher precision. The FMA function is used here to reduce the rounding error of the corresponding floating-point 4. IMPLEMENTATION OF THE operations, which become noticeable at the last PROPOSED ALGORITHMS iteration of the algorithm (lines 9-11).

4.1 SINGLE PRECISION 1. float InvSqrt42 (float x){ The final C++ code of the proposed InvSqrt41 2. int i = *(int*)&x; algorithm with a single modified Newton-Raphson 3. i = i>>1; iteration for type float has the form: 4. int ii = 0x5E5FB432 - i; 5. i = 0x5F5FB432 - i; 1. float InvSqrt41 (float x){ 6. float y = *(float*)&i; 2. int i = *(int*)&x; 7. float yy = *(float*)ⅈ 3. i = i>>1; 8. y = yy*(4.76405191f - x*y*y); 4. int ii = 0x5E5FB6D3 - i; 9. float c = x*y; 5. i = 0x5F5FB6D3 - i; 10. c = fmaf(y, c, -1.0000006f); 6. float y = *(float*)&i; 11. y = fmaf(-0.500097573f*y, c, y); 7. float yy = *(float*)ⅈ 12. return y; 8. y = yy*(4.764266968f - x*y*y); 13. } 9. return y; 10. } The maximum relative errors of this algorithm are:

The maximum values of the relative error in this  7 case are: 2max  3.75670910 , (40)    3.973408107 ,  4 2max 1max  6.50257210 , (39)    6.502245104. which correspond to 21.26 correct bits of the result. 1max This algorithm has seven floating-point multiplication operations, including FMAs. These error values correspond to 10.59 exact bits Compared to the InvSqrt2 and InvSqrt3 algorithms, of the result. Fig. 1 below shows a graph of the which also have seven multiplications, the errors of relative errors of the InvSqrt41 algorithm. the proposed algorithm are almost 46% and 40% less, respectively. The following modification of the algorithm is a compromise between accuracy and speed.

1. float InvSqrt43 (float x){ 2. int i = *(int*)&x; 3. int ix = i - 0x80800000; 4. i = i>>1; 5. int ii = 0x5E5FB3E2 - i; 6. i = 0x5F5FB3E2 - i; 7. float y = *(float*)&i; 8. float yy = *(float*)ⅈ Figure 1 – The graph of the relative errors of the 9. y = yy*(4.76424932f - x*y*y); InvSqrt41 algorithm on the interval x  [1,4) 10. float mhalf = *(float*)&ix; 11. float t = fmaf(mhalf, y*y, 0.500000298f); Thus, the above algorithm contains only three 12. y = fmaf(y, t, y); floating-point multiplication operations, and the 13. return y; maximum error is reduced compared to the InvSqrt2 14. } algorithm (after the first iteration) by 26%. If we add the second Newton-Raphson iteration This algorithm has six multiplication operations, using the FMA function, then we can achieve much performing one multiplication of the number x by smaller maximum errors. The accuracy of the 1/ 2 in integer representation (line 3). The algorithm can be also improved using the method of accuracy of the InvSqrt43 algorithm is 21.21 correct practical optimization of the algorithm constants, as bits.

465 Oleh Horyachyy, Leonid Moroz, Viktor Otenko / International Journal of Computing, 18(4) 2019, 461-470

The accuracy could be further improved by using 3. i = i>>1; the Householder’s formula of order 2 in the second 4. long long ii = 0x5FCBF6DB526DE7D9 - i; iteration [6]: 5. i = 0x5FEBF6DB526DE7D9 - i; 6. double y = *(double*)&i; 7. double yy = *(double*)ⅈ a  x * y1 * y1 (41) 8. y = yy*(4.7642670066528519 - x*y*y); y  y (a(a *t  t )  t ), (42) 2 1 1 2 3 9. return y; 10. } where This algorithm has almost the same accuracy, 3 5 15 namely t  ,t   ,t  . (43) 1 8 2 4 3 8  4 1max  6.50142710 , As a result, we get the following modified  4 (45) algorithm with optimized constants: 1max   6.50142710 ,

1. float InvSqrt44 (float x){ but the use of a double-precision data type will give 2. int i = *(int*)&x; a tangible improvement after two or three iterations. 3. i = i>>1; To illustrate this, the double version of InvSqrt43 4. int ii = 0x5E5FB414 - i; algorithm will have an accuracy of 21.59 bits, and it 5. i = 0x5F5FB414 - i; will be 30.44 bits for the InvSqrt44 algorithm. 6. float y = *(float*)&i; A modification of the InvSqrt43 algorithm with 7. float yy = *(float*)ⅈ three Newton-Raphson iterations, optimized for 8. y = yy*(4.76410007f - x*y*y); speed, will look like: 9. float c = x*y; 10. float r = fmaf(y, c, -1.0f); 1. double InvSqrt46_d (double x){ 11. c = fmaf(0.374000013f, r, -0.5f); 2. long long i = *(long long*) &x; 12. y = fmaf(r*y, c, y); 3. long long ix = i - 0x8010000000000000; 13. return y; 4. i = i>>1; 14. } 5. long long ii = 0x5FCBF6D99EF4C0F4 - i; 6. i = 0x5FEBF6D99EF4C0F4 - i; This algorithm has eight floating-point 7. double y = *(double*)&i; multiplications. The maximum values of the relative 8. double yy = *(double*)ⅈ errors of the algorithm are: 9. y = yy*(4.7642669737958503 - x*y*y); 10. double mhalf = *(double*)&ix;    8.604127108, 11. double t = fma(mhalf, y*y, 2max 0.50000031699508796);  8 (44) 2max  8.17616910 , 12. y = fma(y, t, y); 13. t = fma(mhalf, y*y, 0.50000000000007538); or 23.47 correct bits of the result out of 24 possible. 14. y = fma(y, t, y); Compared to the Householder’s method of 15. return y; order 5 [6], which has the same number of 16. } multiplications, the error of the InvSqrt44 algorithm is reduced by 6 times. This algorithm has nine multiplications and provides 43.59 correct bits. For comparison, the 4.2 DOUBLE PRECISION corresponding FISR [6, 8] and Walczyk [12] algorithms have 10 multiplications and an accuracy The same algorithms can be also implemented in of 34.88 and 41.85 bits. 52 double. In this case, in (8) Nm  2 and To further improve the accuracy of the algorithm bias  1023 for 64-bit IEEE 754 numbers. Also, in type double, you need to perform an additional iteration of the quadratic convergence in the Q 1534 in the magic constant (22). algorithm based on InvSqrt44 or replace the last For example, an analogue of the InvSqrt41 iteration in the previous algorithm (lines 13-14) with algorithm for double will be: the Householder’s formula of order 2 (lines 13-16 of the InvSqrt47_d algorithm). Both algorithms will 1. double InvSqrt45_d (double x){ have similar accuracy for type double, but the 2. long long i = *(long long*) &x; second option should be faster: 466 Oleh Horyachyy, Leonid Moroz, Viktor Otenko / International Journal of Computing, 18(4) 2019, 461-470

1. double InvSqrt47_d (double x){ The value of the constant k2 in the first iteration 2. long long i = *(long long*) &x; is (37). 3. long long ix = i - 0x8010000000000000; Table 1 summarizes the process of determining 4. i = i>>1; the theoretical values of the magic constants R and 5. long long ii = 0x5FCBF6D9DB9A45CD - i; R2 for the three basic data types of the IEEE 754 6. i = 0x5FEBF6D9DB9A45CD - i; standard (float, double, and __float128 respectively). 7. double y = *(double*)&i; 8. double yy = *(double*)ⅈ Table 1. Theoretical values of the magic constants 9. y = yy*(4.7642670025852993 - x*y*y); IEEE 754 binary floating-point format 10. double mhalf = *(double*)&ix; Constant single double quadruple 11. double t = fma(mhalf, y*y, (32 bits) (64 bits) (128 bits) 0.50000031697852854); bias 127 1023 16383 23 52 112 12. y = fma(y, t, y); Nm 2 2 2 13. double c = x*y; Q 190 1534 24574 14. double r = fma(y, c, -1.0); R 5F5FB6DB 5FEBF6DB 5FFEBF6D 15. c = fma(0.375, r, -0.5); 610C8A67 B610C8A6 16. y = fma(r*y, c, y); 75345B2E4 17. return y; EFA8BA0 18. } R2 5E5FB6DB 5FCBF6DB 5FFCBF6D 610C8A67 B610C8A6 The maximum relative errors of this algorithm 75345B2E4 are: EFA8BA0

 16 Please note that in practice, the theoretical values  1.38777910 , 3max of the magic constants indicated in the table may not  16 (46) 3max  1.38777910 . be optimal due to the rounding of arithmetic operations. As you can see, the algorithm provides an accuracy of 52.68 bits out of 53 possible for double 5. EXPERIMENTAL RESULTS AND and it has eleven multiplications. Comparing with DISCUSSION the algorithm suggested by Lemaitre et al. [6] for We have also tested the proposed algorithms on double precision numbers, which includes two the ESP-WROOM-32 microcontroller [19]. In iterations of the order 3 Householder’s method using Table 2 you can find the results of measuring the FMA and contains 12 multiplication operations, the latency and accuracy of the algorithms for single and errors of our algorithm are reduced by 3.5 times. double precision numbers.

4.3 HIGHER PRECISION 5.1 SINGLE PRECISION In general, for numbers of any precision, the After analyzing the results for single precision value of Q in the magic constant R is obtained by numbers, we can say that the InvSqrt1, InvSqrt2, and the formula InvSqrt43 algorithms have the highest performance, but our algorithm (InvSqrt43) provides the best Q  bias  bias / 2. (47) accuracy among them. InvSqrt44 has the highest accuracy among all the considered algorithms, Then the theoretical value of the basic magic namely 23.47 bits, and its speed is the same as in the constant R for the proposed method is determined Householder’s algorithm of order 5. As you can see, all FISR-based algorithms on this platform are by (22), where the value of the mantissa m is equal R superior in speed to using the sqrtf function from the to (38) and the parameters bias and Nm are cmath library. However, only this mathematical determined by the data type that is applied. The function supports subnormal numbers. second magic constant R2 is obtained from R Please note that relative errors may differ slightly using the following expression: on different platforms depending on the compiler and the implementation of arithmetic operations, in R2  R  2 N . (48) particular, the InvSqrt2 algorithm has higher m accuracy on ESP-32 than on Intel. The reason is that this microcontroller has fast hardware FMA instructions for type float [20], which are

467 Oleh Horyachyy, Leonid Moroz, Viktor Otenko / International Journal of Computing, 18(4) 2019, 461-470 automatically used by the compiler to perform all algorithm, it is also more accurate than the Walczyk arithmetic operations. Here we use the following algorithm for three iterations and has better compiler options (for GCC 5.2.0): -std=gnu++11 performance on ESP-32. -Os -ffp-contract=fast. Thus, based on the obtained results, we can Table 2. Evaluation of the algorithms on ESP-32 conclude that for the effective implementation of the proposed algorithms with two magic constants, it is Accuracy, Latency, Algorithm important to have fast hardware multiplication and bits ns FMA operations of the corresponding floating-point float data type. 1.0f/sqrtf(x) 23.42 801.5 FISR [7] (InvSqrt1) 17.69 281.2 6. CONCLUSIONS Walczyk [12] (InvSqrt2) 20.40 281.2 Householder (order 4) [6] 20.54 323.1 We have proposed the FISR-based inverse square (InvSqrt3) root calculation algorithms that contain one less Householder (order 5) [6] 20.87 348.3 floating-point multiplication by introducing an InvSqrt42 21.26 306.4 additional magic constant. The advantage of the InvSqrt43 21.21 281.2 considered algorithms is that they do not use lookup InvSqrt44 23.47 348.3 tables to form an initial approximation and offer double increased calculation accuracy by minimizing the 1.0/sqrt(x) 52.42 6893.5 maximum relative error and using FMA. FISR [6, 8] (3 iter.) 34.88 5373.9 The practical value of the results lies in the fact FISR [6, 8] (4 iter.) 51.68 6849.9 Walczyk [12] (3 iter.) 41.85 5471.4 that the proposed algorithms can easily replace the Walczyk [12] (4 iter.) 51.69 6982.3 basic FISR method in many applications, which Householder (order 3) [6] 50.86 8010.5 increases the accuracy of calculations and reduces (2 iter.) the latency on some platforms. In particular, the Householder (order 3) [6] 50.71 7548.1 FISR algorithm with the magic constant (2 iter.) (*) R = 0x5f3759df is widely used to implement the Householder (order 3) [6] 50.62 7455.6 approximation of the reciprocal square root function (2 iter.) (**) in computer vision and object detection InvSqrt46_d 43.59 6106.0 tasks [14, 21-24] when it is necessary to improve the InvSqrt46_d (**) 43.59 5258.7 performance of the software and/or hardware InvSqrt47_d 52.68 7286.9 application, especially in the case of real-time InvSqrt47_d (*) 52.68 6777.1 computing. These algorithms can be effectively used in 5.2 DOUBLE PRECISION microcontrollers that support floating-point Regarding numbers of type double on the calculations, but do not have an FPU (floating-point microcontroller, first of all, it should be said that unit) to calculate the inverse square root, such as, for ESP-32 does not have the hardware to support example, ESP-WROOM-32 [19-20]. They can also them [20], so all the corresponding operations are be effective in hardware implementation on FPGA emulated by software, in particular, multiplication, platforms that have pipelining and cheap integer addition, and FMA. That is why, all the considered operations. An example of such a platform is Intel algorithms are inefficient in terms of speed, and Cyclone, which has fast single-precision floating- using the FMA function on this platform is point blocks [25]. especially costly. Therefore, to increase the It should also be noted that only normalized performance of the approximation algorithms, the floating-point numbers of single and double sign “*” in Table 2 indicates the implementation of precision were discussed in the article. As was the algorithms using only one FMA function, which shown, these algorithms can be easily generalized to is applied at the last iteration. For example, for the other data formats of the IEEE 754 standard, namely InvSqrt47_d (*) algorithm, this is FMA in line 14. for quadruple and octuple precision. Accordingly, “**” indicates the absence of FMA. It should be said that this step may cause a decrease in 7. REFERENCES accuracy. [1] B. Parhami, Computer Arithmetic: Algorithms Based on Table 2 ??? , it can be said that our and Hardware Designs, second ed., Oxford InvSqrt47_d (*) algorithm provides both the best Univ. Press, New York, 2010, 641 p. accuracy and latency when compared to the sqrt [2] N.H.F. Beebe, The Mathematical-Function function and corresponding FISR and Walczyk Computation Handbook: Programming Using algorithms. Regarding the InvSqrt46_d (**)

468 Oleh Horyachyy, Leonid Moroz, Viktor Otenko / International Journal of Computing, 18(4) 2019, 461-470

the MathCW Portable Software Library, [15] Z. Li, Y. Chen and X. Zeng, “OFDM Springer, 2017, 1115 p. synchronization implementation based on [3] M.D. Ercegovac, T. Lang, Division and Square Chisel platform for 5G research,” Proceedings Root: Digit Recurrence Algorithms and of the 2015 IEEE 11th International Implementations, Boston: Kluwer Academic Conference on ASIC (ASICON), 2015, pp. 1-4. Publishers, 1994, 230 p. [16] S. Zafar and R. Adapa, “Hardware architecture [4] J.-M. Muller, N. Brisebarre, F. de Dinechin, C.- design and mapping of fast inverse square P. Jeannerod, V. Lefevre, G. Melquiond, N. root’s algorithm,” Proceedings of the Revol, D. Stehle, S. Torres, Handbook of International Conference on Advances in Floating-Point Arithmetic, Springer, 2010, Electrical Engineering (ICAEE), 2014, pp. 1-4. 572 p. [17] C.A. Navarro, M. Vernier, N. Hitschfeld, B. [5] J.-M. Muller, N. Brunie, F. de Dinechin, C.-P. Bustos, “Competitiveness of a non-linear Jeannerod, M. Joldes, V. Lefevre, G. block-space GPU thread map for simplex Melquiond, N. Revol, S. Torres, “Software domains,” IEEE Transactions on Parallel and implementation of floating-point arithmetic,” Distributed Systems, vol. 29, issue 12, in: Handbook of Floating-Point Arithmetic, pp. 2728-2741, 2018. Birkhäuser, Cham, 2018, pp. 321-374. [18] A. Gazneli et al. Adaptive Nonlinear System [6] F. Lemaitre, B. Couturier, L. Lacassagne, Control using Robust and Low-complexity “Cholesky factorization on SIMD multi-core Coefficient Estimation, U.S. Patent No 15/648, architectures,” Journal of Systems Architecture, 205, 2018. vol. 79, pp. 1-15, 2017. [19] Espressif Systems, ESP32-WROOM-32 (ESP- [7] , Quake III Arena, quake3-1.32b WROOM-32) datasheet. Version 2.4, 2018. code, [Online]. Available: https://github.com/ [20] Tensilica Inc., Xtensa Instruction Set id-Software/Quake-III-Arena/blob/master/code/ Architecture (ISA), Reference Manual, 2018. game/q_math.c [21] S. Xiao, T. Isshiki, D. Li, H. Kunieda, “HOG- [8] C. Lomont, Fast Inverse Square Root, Purdue based object detection processor design using University, Tech. Rep., 2003, [Online]. ASIP methodology,” IEICE Transactions on Available at: http://www.lomont.org/Math/ Fundamentals of Electronics, Communications Papers/2003/InvSqrt.pdf and Computer Sciences, vol. 100, issue 12, [9] P. Martin, “Eight rooty pieces,” Overload pp. 2972-2984, 2017. Journal, vol. 24, issue 135, pp. 8-12, 2016. [22] M. Ramirez-Martinez, F. Sanchez-Fernandez, [10] D.H. Eberly, GPGPU Programming for Games P. Brunet, S. M. Senouci and El-Bay and Science, CRC Press, 2014, 469 p. Bourennane, “Dynamic management of a [11] L. Moroz, C.J. Walczyk, A. Hrynchyshyn, V. partial reconfigurable hardware architecture for Holimath, J.L. Cieśliński, “Fast calculation of pedestrian detection in regions of interest,” inverse square root with the use of magic Proceedings of the 2017 International constant – analytical approach,” Applied Conference on ReConFigurable Computing Mathematics and Computation, vol. 316, and FPGAs (ReConFig), 2017, pp. 1-7. issue C, pp. 245-255, 2018. [23] J. Lv, F. Wang and Z. Ma, “Peach fruit [12] C.J. Walczyk, L.V. Moroz, J.L. Cieśliński, recognition method under natural “Improving the accuracy of the fast inverse environment,” Proceedings of the Eighth square root algorithm,” ArXiv Preprint, International Conference on Digital Image arXiv:1802.06302, 2018. Processing (ICDIP’2016), Chengu, China, [13] A. Hasnat, T. Bhattacharyya, A. Dey, S. Halder August 29, 2016, vol. 10033, paper 1003317. and D. Bhattacharjee, “A fast FPGA based [24] B.K. Anirudh, V. Venkatraman, A. R. Kumar architecture for computation of square root and and S. D. Sumam, “Accelerating real-time inverse square root,” Proceedings of the computer vision applications using HW/SW co- International Conference on Devices for design,” Proceedings of the 2017 International Integrated Circuit (DevIC), Kalyani, India, Conference on Computer, Communications and 2017, pp. 383-387. Electronics (Comptelix), Jaipur, India, July 1-2, [14] C.J. Hsu, J.L. Chen and L.G. Chen, “An 2017, pp. 458-463. efficient hardware implementation of HON4D [25] Intel Corporation, Intel® Cyclone® 10 GX feature extraction for real-time action device overview. C10GX51001, 2018. recognition,” Proceedings of the 2015 IEEE International Symposium on Consumer Electronics (ISCE), 2015, pp. 1-2.

469 Oleh Horyachyy, Leonid Moroz, Viktor Otenko / International Journal of Computing, 18(4) 2019, 461-470

Oleh Horyachyy received his Information Technologies Security of the Institute of MSc degrees in Applied Com- Computer Technologies, Automation, and Metrology puter Science from Ivan Franko at Lviv Polytechnic National University. His research National University of Lviv in interests include computer arithmetic, numerical 2013 and in Information and methods, and digital signal processing. Communication Systems Secu- rity from Lviv Polytechnic Viktor Otenko graduated from National University in 2017. Lviv Polytechnic National Currently, he is a PhD student at University, Ukraine and received the Institute of Computer Tech- his MSc degree in Automation nologies, Automation, and Metrology at Lviv and Telemecha-nics in 1979. In Polytechnic National University, Ukraine and an 1986, he was awarded the title engineer at the Department of Information “Best young inventor of Technologies Security. His research interests Ukraine”. He received a PhD include iterative methods, secure multiparty degree in Elements and Appli- computations, biometric identification, security of ances of Computer Technology and Control communication technologies, cryptography, and Systems in 1998 from the same university. He has public key infrastructure. been working at the Department of Information Security of the Institute of Computer Technologies, Leonid Moroz received his MSc Automation, and Metrology at Lviv Polytechnic and PhD degrees from Lviv National University as an Assoc. Professor since Polytechnic National University, 2006. His research interests are software Ukraine in 1978 and 1985, engineering, program security, and pulse-numerical respectively. He defen-ded his functional transformation. DSc degree in Computer Systems and Components in 2013. Currently, he works as a Professor at the Department of

470